You are on page 1of 526

DSI408: Real Application Clusters

Internals
Electronic Presentation

D16333GC10
Production 1.0
April 2003
D37990

Authors

Copyright 2003, Oracle. All rights reserved.

Xuan Cong-Bui
John P. McHugh
Michael Mller

This documentation contains proprietary information of Oracle Corporation. It is provided under a


license agreement containing restrictions on use and disclosure and is also protected by copyright
law. Reverse engineering of the software is prohibited. If this documentation is delivered to a U.S.
Government Agency of the Department of Defense, then it is delivered with Restricted Rights and the
following legend is applicable:
Restricted Rights Legend

Technical Contributors and


Reviewers
Michael Cebulla
Lex de Haan
Bill Kehoe
Frank Kobylanski
Roderick Manalac
Sundar Matpadi
Sri Subramaniam
Harald van Breederode
Jim Womack

Publisher
Glenn Austin

Use, duplication or disclosure by the Government is subject to restrictions for commercial computer
software and shall be deemed to be Restricted Rights software under Federal law, as set forth in
subparagraph (c)(1)(ii) of DFARS 252.227-7013, Rights in Technical Data and Computer Software
(October 1988).
This material or any portion of it may not be copied in any form or by any means without the express
prior written permission of the Education Products group of Oracle Corporation. Any other copying is
a violation of copyright law and may result in civil and/or criminal penalties.
If this documentation is delivered to a U.S. Government Agency not within the Department of
Defense, then it is delivered with Restricted Rights, as defined in FAR 52.227-14, Rights in DataGeneral, including Alternate III (June 1987).
The information in this document is subject to change without notice. If you find any problems in the
documentation, please report them in writing to Worldwide Education Services, Oracle Corporation,
500Oracle Parkway, Box SB-6, Redwood Shores, CA 94065. Oracle Corporation does not warrant
that this document is error-free.
Oracle and all references to Oracle Products are trademarks or registered trademarks of Oracle
Corporation.
All other products or company names are used for identification purposes only, and may be
trademarks of their respective owners.

DSI408: Real Application


Clusters Internals
Volume 1 - Student Guide

D16333GC10
Edition 1.0
April 2003
37988

Authors

Copyright 2003, Oracle. All rights reserved.

Xuan Cong-Bui
John P. McHugh
Michael Mller

This documentation contains proprietary information of Oracle Corporation. It is


provided under a license agreement containing restrictions on use and disclosure and
is also protected by copyright law. Reverse engineering of the software is prohibited.
If this documentation is delivered to a U.S. Government Agency of the Department of
Defense, then it is delivered with Restricted Rights and the following legend is
applicable:

Technical Contributors
and Reviewers

Restricted Rights Legend

Michael Cebulla
Lex de Haan
Bill Kehoe
Frank Kobylanski
Roderick Manalac
Sundar Matpadi
Sri Subramaniam
Harald van Breederode
Jim Womack

Publisher

Use, duplication or disclosure by the Government is subject to restrictions for


commercial computer software and shall be deemed to be Restricted Rights software
under Federal law, as set forth in subparagraph (c)(1)(ii) of DFARS 252.227-7013,
Rights in Technical Data and Computer Software (October 1988).
This material or any portion of it may not be copied in any form or by any means
without the express prior written permission of Oracle Corporation. Any other copying
is a violation of copyright law and may result in civil and/or criminal penalties.
If this documentation is delivered to a U.S. Government Agency not within the
Department of Defense, then it is delivered with Restricted Rights, as defined in
FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).
The information in this document is subject to change without notice. If you find any
problems in the documentation, please report them in writing to Education Products,
Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065.
Oracle Corporation does not warrant that this document is error-free.
Oracle and all references to Oracle Products are trademarks or registered trademarks
of Oracle Corporation.

Glenn Austin
All other products or company names are used for identification purposes only, and
may be trademarks of their respective owners.

DSI408: Real Application


Clusters Internals
Volume 2 - Student Guide

D16333GC10
Edition 1.0
April 2003
D37989

Authors

Copyright 2003, Oracle. All rights reserved.

Xuan Cong-Bui
John P. McHugh
Michael Mller

This documentation contains proprietary information of Oracle Corporation. It is


provided under a license agreement containing restrictions on use and disclosure and
is also protected by copyright law. Reverse engineering of the software is prohibited.
If this documentation is delivered to a U.S. Government Agency of the Department of
Defense, then it is delivered with Restricted Rights and the following legend is
applicable:

Technical Contributors
and Reviewers

Restricted Rights Legend

Michael Cebulla
Lex de Haan
Bill Kehoe
Frank Kobylanski
Roderick Manalac
Sundar Matpadi
Sri Subramaniam
Harald van Breederode
Jim Womack

Publisher

Use, duplication or disclosure by the Government is subject to restrictions for


commercial computer software and shall be deemed to be Restricted Rights software
under Federal law, as set forth in subparagraph (c)(1)(ii) of DFARS 252.227-7013,
Rights in Technical Data and Computer Software (October 1988).
This material or any portion of it may not be copied in any form or by any means
without the express prior written permission of Oracle Corporation. Any other copying
is a violation of copyright law and may result in civil and/or criminal penalties.
If this documentation is delivered to a U.S. Government Agency not within the
Department of Defense, then it is delivered with Restricted Rights, as defined in
FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).
The information in this document is subject to change without notice. If you find any
problems in the documentation, please report them in writing to Education Products,
Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065.
Oracle Corporation does not warrant that this document is error-free.
Oracle and all references to Oracle Products are trademarks or registered trademarks
of Oracle Corporation.

Glenn Austin
All other products or company names are used for identification purposes only, and
may be trademarks of their respective owners.

Contents
Preface
I

Course Overview DSI 408: RAC Internals


Prerequisites I-2
Course Overview I-3
Practical Exercises I-5

Section I: Introduction
1

Introduction to RAC
Objectives 1-2
Why Use Parallel Processing? 1-3
Scaleup and Speedup 1-5
Scalability Considerations 1-7
RAC Costs: Synchronization 1-9
RAC Costs: Global Resource Directory 1-10
RAC Costs: Cache Coherency 1-12
RAC Terminology 1-14
Terminology Translations 1-16
Programmer Terminology 1-18
History 1-19
History Overview 1-20
Internalizing Components 1-21
Oracle7 1-22
Oracle8 1-23
Oracle8i 1-24
Oracle9i 1-25
Summary 1-26

Introduction to RAC Internals


Objectives 2-2
Simple RAC Diagram 2-3
One RAC Instance 2-4
Internal RAC Instance 2-5
Oracle Code Stack 2-6
RAC Component List 2-7
Module Relation View 2-8
Alternate Module Relation View 2-9
Module, Code Stack, Process 2-10
Operating System Dependencies (OSD) 2-11
Platform-Specific RAC 2-12
OSD Module: Example 2-13
Summary 2-15
References 2-16

iii

Section II: Architecture


3

Cluster Layer: Cluster Monitor


Objectives 3-2
RAC and Cluster Software 3-3
Generic CM Functionality: Distributed Architecture 3-4
Generic CM Functionality: Cluster State 3-5
Generic CM Functionality: Node Failure Detection 3-6
Cluster Layer and Cluster Manager 3-7
Oracle-Supplied CM 3-8
Summary 3-9

Cluster Group Services and Node Monitor


Objectives 4-2
RAC and CGS/GMS and NM 4-3
Node Monitor (NM) 4-4
RDBMS SKGXN Membership 4-5
NM Groups 4-6
NM Internals 4-7
Node Membership 4-8
Instance Membership Changes 4-10
NM Membership Death 4-12
Starting an Instance: Traditional 4-13
Starting an Instance: Internal 4-14
Stopping an Instance: Traditional 4-15
Stopping an Instance: Internal 4-16
NM Trace and Debug 4-17
Cluster Group Services (CGS) 4-18
Configuration Control 4-19
Valid Members 4-20
Membership Validation 4-23
Membership Invalidation 4-24
CGS Reconfiguration Types 4-26
CGS Reconfiguration Protocol 4-27
Reconfiguration Steps 4-28
IMR-Initiated Reconfiguration: Example 4-30
Code References 4-32
Summary 4-33

RAC Messaging System


Objectives 5-2
RAC and Messaging 5-3
Typical Three-Way Lock Messages 5-4
Asynchronous Traps 5-5
AST and BAST 5-6
Message Buffers 5-7
Message Buffer Queues 5-8
iv

Messaging Deadlocks 5-9


Message Traffic Controller (TRFC) 5-10
TRFC Tickets 5-11
TRFC Flow 5-13
Message Traffic Statistics 5-15
IPC 5-18
IPC Code Stack 5-19
Reference Implementation 5-20
KSXP Wait Interface to KSL 5-21
KSXP Tracing 5-22
KSXP Trace Records 5-23
SKGXP Interface 5-24
Choosing an SKGXP Implementation 5-25
SKGXP Tracing 5-26
Possible Hang Scenarios 5-27
Other Events for IPC Tracing 5-28
Code References 5-29
Summary 5-30
6

System Commit Number


Objectives 6-2
System Commit Number 6-3
Logical Clock and Causality Propagation 6-4
Basics of SCN 6-5
SCN Latching 6-7
Lamport Implementation 6-8
Lamport SCN 6-9
Limitations on SCN Propagation 6-10
max_commit_propagation_delay 6-11
Piggybacking SCN in Messages 6-12
Periodic Synchronization 6-13
SCN Generation in Earlier Versions of Oracle 6-14
Code References 6-15
Summary 6-16

Global Resource Directory: Formerly the Distributed Lock Manager


Objectives 7-2
RAC and Global Resource Directory (GRD) 7-3
DLM History 7-4
DLM Concepts: Terminology 7-5
DLM Concepts: Resources 7-6
DLM Concepts: Locks 7-7
DLM Concepts: Processes 7-8
DLM Concepts: Shadow Resources 7-9
DLM Concepts: Copy Locks 7-10
Resource or Lock Mastering 7-11
Basic Resource Structures 7-12
v

DLM Structures 7-13


Lock Mode Changes 7-16
Simple Lock Changes on a Resource 7-17
Changes on a Resource with Deadlock 7-18
DLM Functions 7-19
DLM Functionality in Global Enqueue Service Daemon (LMD0) 7-20
DLM Functionality in Global Enqueue Service Monitor (LMON) 7-22
DLM Functionality in Global Cache Service Process (LMS) 7-23
DLM Functionality in Other Processes 7-24
Configuring GES Resources 7-25
Configuring GES Locks 7-26
Configuring GCS Resources 7-27
Configuring GCS Locks 7-28
Configuring DLM processes 7-29
Logical to Physical Nodes Mapping 7-30
Buckets to Logical Nodes Mapping 7-31
Mapping for a New Node Joining the Cluster 7-32
Remapping When Node Joins 7-34
Mapping Broadcast by Master Node 7-35
Master Node Determination for GES 7-36
Master Node Determination for GCS 7-37
Dump and Trace of Remastering 7-38
DLM Functions 7-39
kjual Connection to DLM 7-40
kjual Flow 7-42
kjpsod Flow 7-43
DML Enqueue Handling Flow: Example 7-44
Step 1: P1 Locks Table in Share Mode 7-45
Step 2: P2 Locks Table in Share Mode 7-46
Step 3: P2 Does Rollback 7-47
Step 4: P1 Locks Table in Exclusive Mode 7-48
Step 5: P3 Locks Table in Share Mode 7-49
Step 6: P1 Does Rollback 7-50
Steps 1 and 2: Code Flow 7-51
Step 1: kjusuc Flow Detail 7-52
Step 2: kjusuc Flow Detail 7-54
Step 3: Code Flow 7-55
Step 3: kjuscl Flow Detail 7-56
Step 4: Code Flow 7-57
Step 4: kjuscv Flow Detail 7-58
Step 5: kjuscv Flow Detail 7-60
Step 6: kjuscl Flow Detail 7-61
Code References 7-63
Summary 7-64
References and Further Reading 7-65

vi

Cache Coherency (Part One): Enqueues/Non-PCM


Objectives 8-2
Cache Coherency: Enqueues 8-3
Enqueue Types 8-6
Enqueue Structure 8-7
Examining Enqueues 8-8
Enqueues and DLM 8-9
Source Tree for Non-PCM Lock Flow 8-10
Lock Modes 8-11
Lock Compatibility 8-12
Deadlock Detection: The Classic Deadlock 8-13
Deadlock Detection: A More General Example 8-15
Deadlock Detection and Resolution 8-16
Timeout-Based Deadlock Detection 8-17
Deadlock Graph Printout 8-18
Deadlock Flow 8-19
Deadlock Flow: One Node 8-21
Deadlock Flow: Two Nodes 8-22
Parallel DML (PDML) Deadlocks 8-23
Deadlock Detection Algorithm 8-24
Deadlock Validation Steps 8-27
Code References 8-28
Summary 8-29

Cache Coherency (Part Two): Blocks/PCM Locks


Objectives 9-2
Cache Coherency: Blocks 9-3
Block Cache Contention 9-4
Earlier Cache Coherency: Oracle8 Ping Protocol 9-5
Earlier Cache Coherency: Oracle8i CR Server 9-6
Earlier Cache Coherency: Oracle8i CR Server 9-7
Oracle9i Cache Fusion Protocol 9-8
GCS (PCM) Locks 9-9
PCM Lock Attributes 9-10
Lock Modes 9-11
Lock Roles 9-12
Past Image 9-13
Local Lock Role 9-14
Global Lock Role 9-15
Block Classes 9-16
Lock Elements (LE) 9-17
Allocation of New LE 9-18
Hash Chain of LE 9-19
Block to LE Mapping 9-20
Queues of LE for LMS 9-21
LMSn Free of LE 9-22
Cache Fusion Examples: Overview 9-23
vii

Cache Fusion: Example 1 9-25


Cache Fusion: Example 2 9-26
Cache Fusion: Example 3 9-27
Cache Fusion: Example 4 9-28
Cache Fusion: Example 5 9-29
Cache Fusion: Example 6 9-30
Cache Fusion: Example 7 9-31
Cache Fusion: Example 8 9-32
Cache Fusion: Example 9 9-33
Cache Fusion: Example 10 9-34
Cache Fusion: Example 11 9-35
Views 9-36
Parameters 9-39
Summary 9-40
10 Cache Fusion 1: CR Server
Objectives 10-2
Cache Fusion: Consistent Read Blocks 10-3
Consistent Read Review 10-4
Getting a CR Buffer 10-5
Getting a CR Buffer in Oracle9i Release 2 10-7
CR Server in Oracle9i Release 2 10-8
CR Requests 10-9
Light Work Rule 10-11
Fairness 10-12
Statistics 10-13
Wait Events 10-14
Fixed Table X$KCLCRST Statistics 10-15
CR Requestor-Side Algorithm 10-16
CR Requestor-Side AST Delivery 10-21
CR Requestor-Side CR Buffer Delivery 10-22
CR Server-Side Algorithm 10-23
Summary 10-27
11 Cache Fusion 2: Current Block: XCUR
Objectives 11-2
Cache Fusion: Current Blocks 11-3
PCM Locks and Resources 11-4
Fusion: Long Example 11-5
Initial State 11-7
Step 1: Instance 3 Performs SELECT 11-8
Lock Changes in Instance 3 11-9
Lock Changes in Instance 2 11-10
Step 2: Instance 2 Performs SELECT 11-11
Lock Changes in Instance 2 11-12
Step 3: Instance 2 Performs UPDATE 11-13
Lock Changes in Instance 2 11-14
viii

Lock Changes in Instance 3 11-15


Step 4: Instance 1 Performs UPDATE 11-16
Lock Changes in Instance 2 11-17
Lock Changes in Instance 1 11-18
Step 5: Instance 3 Performs SELECT 11-19
Lock Changes in Instance 3 11-20
Step 6: Instance 1 Performs WRITE 11-21
Lock Changes in Instance 2 11-22
Lock Changes in Instance 1 11-23
Tables and Views 11-24
Summary 11-26
12 Cache Fusion Recovery
Objectives 12-2
NonCache Fusion OPS and Database Recovery 12-3
Cache Fusion RAC and Database Recovery 12-4
Overview of Fusion Lock States 12-5
Instance or Crash Recovery 12-6
SMON Process 12-7
First-Pass Log Read 12-8
Block Written Record (BWR) 12-9
BWR Dump 12-10
Recovery Set 12-11
Recovery Claim Locks 12-12
IDLM Response to RecoveryClaimLock Message on PCM Resource
No Lock Held by Recovering Instance on the PCM Resource 12-14
Recovery Claim Locks 12-15
Second-Pass Log Read 12-17
Large Recovery Set and Partial IR Lock Mode 12-19
Lock Database Availability During Recovery 12-22
Handling BASTs on Recovery Buffers 12-23
IR of Nonfusion Blocks 12-24
Failures During Instance Recovery 12-26
Memory Contingencies 12-28
Code References 12-29
Summary 12-31
Section III: Platforms
13 Linux Platform
Objectives 13-2
Linux RAC Architecture 13-3
Storage: Raw Devices 13-4
Extended Storage 13-5
Linux Cluster Software 13-6
OCMS 13-7
OCMS Components 13-8
ix

12-13

WDD, NM, and CM Flow (Up to version 9.2.0.1) 13-9


Watchdog Daemon 13-10
Hangcheck, NM, and CM Flow (After version 9.2.0.2) 13-11
Hangcheck Module 13-12
Node Monitor (NM) 13-13
Cluster Manager 13-14
Linux Port-Specific Code 13-15
Cluster Manager 13-16
skgxpt and skgxpu 13-17
Installing RAC on Linux 13-18
Running RAC on Linux 13-21
Starting CM 13-22
Starting WDD 13-23
Starting NM 13-24
Starting CM 13-25
Debugging 13-26
Summary 13-27
References 13-28
14 HP-UX Platform
Objectives 14-2
HP-UX RAC Architecture 14-3
HP-UX Cluster Software 14-4
HP-UX Port-Specific Code 14-5
SKGXP (UDP Implementation) 14-6
SKGXP: Lowfat 14-7
Installing RAC on HP-UX 14-8
Running RAC on HP-UX 14-9
Debugging on HP-UX 14-10
Summary 14-11
15 Tru64 Platform
Objectives 15-2
Tru64 RAC Architecture 15-3
Shared Disk Systems 15-4
Tru64 Cluster Software 15-5
Tru64 Port-Specific Code 15-6
Node Monitor: SKGXN 15-7
IPC: SKGXP 15-8
SKGXPM: RDG 15-9
Installing RAC on Tru64 15-11
Debugging on Tru64 15-12
x

Useful Tru64 Commands 15-13


Summary 15-15
16 AIX Platform
Objectives 16-2
AIX RAC Architecture 16-3
AIX SP Clusters 16-4
AIX HACMP Clusters 16-5
AIX Cluster Software 16-6
AIX Cluster Layer 16-7
AIX Port-Specific Code 16-8
RAC on AIX Stack 16-9
Node Monitor (NM) 16-10
Installing RAC on AIX 16-12
Debugging on AIX 16-14
Summary 16-15
References 16-16
17 Other Platforms
Objectives 17-2
RAC Architecture: Solaris 17-3
RAC Architecture: Windows 17-4
RAC Architecture: OpenVMS 17-5
Port-Specific Code 17-6
Installing RAC 17-7
Summary 17-8
Section IV: Debug
18 V$ and X$ Views and Events
Objectives 18-2
V$ and GV$ Views 18-3
List of Views 18-4
Old and New Views 18-5
V$ Views for Lock Information 18-6
X$ Tables 18-7
Events 18-8
19 KST and X$TRACE
Objectives 19-2
KST: X$TRACE 19-3
KST Concepts 19-4
KST Concepts 19-6
Circular Buffer 19-7
xi

Data Structure kstrc 19-8


Trace Control Interfaces 19-9
KST Initialization Parameters 19-10
KST Trace Control Interfaces 19-12
KST Fixed Table Views 19-14
KST Trace Output 19-15
KST Current Instrumentation 19-18
KST Performance 19-19
KST: Examples 19-20
KST Sample Trace File 19-24
KST Demonstration 19-25
DIAG Daemon 19-26
DIAG Daemon: Features 19-27
DIAG Daemon: Design 19-29
DIAG Daemon: Startup and Shutdown 19-33
DIAG Daemon: Crash Dumping 19-34
Summary 19-36
20 ORADEBUG and Other Debugging Tools
Objectives 20-2
ORADEBUG 20-3
Flash Freeze 20-5
LKDEBUG 20-6
NSDBX 20-7
HANGANALYZE 20-8
Summary 20-9
References 20-10
Appendix A: Practices
Appendix B: Solutions

xii

Course Overview

DSI 408: RAC Internals

Copyright 2003, Oracle. All rights reserved.

Prerequisites

Before taking this course, you should have:


Taken DSI 401, 402, and 403 so that you know
about the server internals on crashes, dumps,
transactions, block handling, and recovery
systems
Taken the Real Application Clusters (RAC)
administration course so that you know about the
external view of RAC
Performed at least one RAC installation and
assisted in at least one RAC debugging case

I-2I-2

Copyright 2003, Oracle. All rights reserved.

Prerequisites
The prerequisites ensure that the course is useful to you, instead of being too hard, and that
the instructor need not cover basic material.
You must have your TAO account ready for examining source code.

DSI408: Real Application Clusters Internals I-2

Course Overview

The course includes the following four sections:

Introduction
Architecture
Platforms
Debug

Subjects that are not covered include:


Utilities (srvctl, OCFS, HA)
Performance tuning
Pre-Oracle9i versions (OPS)

I-3I-3

Copyright 2003, Oracle. All rights reserved.

Course Overview
This course contains four sections. It is scheduled to take four days but does not require
one day per section. Most of the time is spent on the Architecture section.
Introduction
The Introduction section provides a summary of the public RAC architecture and its
accurate terminology. An overview of architecture changes between versions is also given.
Architecture
The Architecture section covers the theory of operation of RAC. The RAC code stack is
examined from the bottom up. There are many references to the source code.
Platforms
The Platforms section covers the differences and architectural details of RAC
implementation on different platforms. Installation issues and known gotchas are
included.

DSI408: Real Application Clusters Internals I-3

Course Overview (continued)


Debug
The Debug section provides a detailed explanation of the trace and dump mechanisms that
are placed inside RAC for fault location. A number of practical exercises use these
mechanisms.
Subjects not Covered
This course does not cover utility modules that are not part of the primary core RAC
functionality. It also does not cover some of the external programs that RAC depends on.
Performance is not covered as a separate topic. The knowledge from this course should be
sufficient to identify performance bottlenecks that are purely relevant to RAC; otherwise,
tuning is the same as for a single instance.
For versions of Oracle Parallel Server, you should review earlier courses. In earlier courses,
the differences between RAC and OPS are pointed out, whereas the RAC knowledge in
this course is not applicable to OPS.

DSI408: Real Application Clusters Internals I-4

Practical Exercises

I-5I-5

The course includes practical exercises.


Exercises run on a shared Solaris cluster.

Copyright 2003, Oracle. All rights reserved.

Practical Exercises
The cluster hardware is shared between students and other classesthis prevents practices
that involve node shutdown, or breaking the interconnect.

DSI408: Real Application Clusters Internals I-5

SQL
SQL Layer
Layer

SQL
SQL Layer
Layer

Buffer
Buffer Cache
Cache

Buffer
Buffer Cache
Cache
Section
I
CGS
II
II
CGS
CGS
CGS
P
P
P
P
GES/GCS
GES/GCS Introduction
GES/GCS
GES/GCS
C
C
C
C

Node
Node Monitor
Monitor

Node
Node Monitor
Monitor
Cluster
Cluster Manager
Manager

Copyright 2003, Oracle. All rights reserved.

Introduction to RAC

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Review the design objectives of Real Application
Clusters (RAC)
Relate Oracle9i RAC to its predecessors

1-10

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-10

Why Use Parallel Processing?

1-11

Scaleup: Increased throughput


Speedup: Increased performance or faster
response
Higher availability
Support for a greater number of users

Copyright 2003, Oracle. All rights reserved.

Why Use Parallel Processing?


Scaleup: Increased Throughput
Parallel processing breaks a large task into smaller subtasks that can be performed
concurrently. With tasks that grow larger over time, a parallel system that also grows (or
scales up) can maintain a constant time for completing the same task.
Speedup: Increased Performance
For a given task, a parallel system that can scale up improves the response time for
completing the same task.
For decision support system (DSS) applications and parallel queries, parallel
processing decreases the response time.
For online transaction processing (OLTP) applications, speedup cannot be expected
due to the overhead of synchronization. Depending on the precise circumstances, a
decrease in performance can occur.

DSI408: Real Application Clusters Internals I-11

Why Use Parallel Processing? (continued)


Higher Availability
Because each node running in the parallel system is isolated from other nodes, a single node
failure or crash should not cause other nodes to fail. Other instances in the parallel server
environment remain up and running.
The operating systems failover capabilities and fault tolerance of the distributed cluster
software are an important infrastructure component.
Support for a Greater Number of Users
Each node can support several users because each node has its own set of resources, such as
memory, CPU, and so on. As nodes are added to the system, more users can also be added,
allowing the system to continue to scale up.

DSI408: Real Application Clusters Internals I-12

Scaleup and Speedup


Original system
Hardware

Time

Cluster system scaleup


Hardware

1-13

Time

Hardware

Time

Hardware

Time

Up to
200%
of
task

100% of task

Cluster system speedup

Up to
300%
of
task

Hardware

Time

Hardware

Time

50% of task

50% of task

Copyright 2003, Oracle. All rights reserved.

Scaleup and Speedup


Scaleup
Scaleup is the capability of providing continued increases in throughput in the presence of
limited increases in processing capability while keeping the time constant:
Scaleup = (volumeparallel) / (volumeoriginal) time for interprocess communication
For example, if 30 users consume close to 100% of the CPU during their normal
processing, adding more users would cause the system to slow down due to contention for
limited CPU cycles. By adding CPUs, however, extra users can be supported without
degrading performance.
Speedup
Speedup is the capability of providing continued increases in speed in the presence of
limited increases in processing capability while keeping the task constant:
Speedup = (timeoriginal) / (timeparallel) time for interprocess communication
Speedup results in resource availability for other tasks. For example, if queries normally
take 10 minutes to process, and running in parallel reduces the time to 5 minutes, then
additional queries can run without introducing the contention that might occur if they were
to run concurrently.
DSI408: Real Application Clusters Internals I-13

Scaleup and Speedup (continued)


Speedup (continued)
Example 1: A particular application might take N seconds to fully scan and produce a
summary of a 1 GB table
With scaleup, if the table doubles in size, then doubling hardware resources should allow
the query to still complete in N seconds.
With speedup, if the table does not grow in size, doubling the hardware resources should
allow the query to complete in N/2 seconds.
Example 2: A particular application might have 100 users, each getting a three-second
response on queries.
With scaleup, if the number of users doubles in size, then doubling hardware resources
should allow response time to remain at three seconds.
With speedup, if the number of users remains the same, doubling the hardware resources
should reduce the response time. This occurs only if the three-second activity can be
broken down into two separate activities that can run independently of each other.
A Success Example of Scaleup
The following testimonial is from the internal RAC mailing list. This was a response to
a question about the ease of changing a single instance to an RAC system.
Just yesterday, we tested with a customer a migration from single instance to two-node
RAC on Solaris. They were using Veritas DBE/AC for the cluster system.
These are the steps we took:
1. Node 1 Server running 9i single instance at approx 80% CPU load.
2. Connection through Transparent Application Failover with 40 retries and a delay of
five seconds.
3. Alter shared initialization file to set Cluster Database = true and add extra
parameters for the second node (bdump location and so on).
4. Shut down Database on Node 1.
5. Start up Database on Node 2 using new initialization file.
6. Start up Database on Node 1 using new initialization file.
At this point we had 85% of users on Node 1 and 15% on Node 2.
7. Run a script to disconnect sessions on Node 1 to allow them to load balance across
to Node 2.
At this point we had 50% of users on Node 1 and 50% on Node 2. The database was no
longer highly loaded and we were able to add more (now load-balanced) users.
The application was written in Java and was TAF-aware (i.e., it knew to retry transactions
with certain warning messages). Once we added the second node, the TPMs per Node
remained approximately the same so we had over 1.9 x improvement in TPMs, which was
pretty good scaling.

DSI408: Real Application Clusters Internals I-14

Scalability Considerations

1-15

Hardware: Disk I/O


Internode communication: High bandwidth and
low latency
Operating system: Number of CPUs (for example,
SMP)
Cache Coherency and the Global Cache Service
Database: Design
Application: Design

Copyright 2003, Oracle. All rights reserved.

Scalability Considerations
It is important to remember that if any of these six areas are not scalable (no matter how
scalable the other areas are), parallel cluster processing may not be successful.
Hardware scalability: High bandwidth and low latency offer the maximum scalability.
A high amount of remote I/O may prevent system scalability, because remote I/O is
much slower than local I/O.
Bandwidth of the communication interface is the total size of messages that can be
sent per second. Latency of the communication interface is the time required to place
a message on the interconnect. It indicates the number of messages that can be put on
the interconnect per unit of time.
Operating system: Nodes with multiple CPUs and methods of synchronization in the
OS can determine how well the system scales. Symmetric multiprocessing can
process multiple requests to resources concurrently.

DSI408: Real Application Clusters Internals I-15

Scalability Considerations (continued)


The processes that manage local resource coordination in a cluster database are
identical to the local resource coordination processes in single instance Oracle. This
means that row and block level access, space management, system change number
(SCN) creation, and data dictionary cache and library cache management are the
same in Real Application Clusters as in single instance Oracle. If the resource is
modified by more than one instance, then RAC performs further synchronization on a
global level to permit shared access to this block across the cluster. Synchronization
in this case requires intranode messaging as well as the preparation of consistent read
versions of the block and the transmission of copies of the block between memory
caches within the cluster database." (See Oracle9i Real Application Clusters
Concepts Release 2 (9.2), Part Number A96597-01, Chapter 5,Real Application
Clusters Resource Coordination.)
Database scalability: Database scalability depends on how well the database is
designed (for example, how the data files are arranged, how well the locks are
allocated, and how well the objects are partitioned).
Scalability of the application: Application design is one of the keys to taking
advantage of the other elements of scalability. Regardless of how well the hardware
and database scale, parallel processing does not work as desired if the application
does not scale.
A typical cause for the lack of scalability is one common shared resource that must be
accessed often. This causes the otherwise parallel operations to serialize on this bottleneck.
A high latency in the synchronization increases the cost of synchronization, counteracting
the benefits of parallelization. This is a general limitation and not a RAC-specific
limitation.

DSI408: Real Application Clusters Internals I-16

RAC Costs: Synchronization

To scale, there is a cost in synchronization:


Scalability = Synchronization
Less synchronization = Speedup and scaleup

1-17

Synchronization is necessary to maintain cache


coherency in RAC.

Copyright 2003, Oracle. All rights reserved.

RAC Costs: Synchronization


Synchronization is a necessary part of parallel processing, but for parallel processing to be
advantageous, the cost of synchronization must be determined.
Synchronization provides the coordination of concurrent tasks and is essential for parallel
processing to maintain data integrity or correctness. Proper locking between disjoint SGAs
(Oracle instances) must be maintained to ensure correct data. This is cache coherency.
Partitioning can help reduce synchronization costs because there are fewer
concurrent tasks (that is, fewer concurrent users modifying the same set of data).
An application that modifies a small set of data can cause a high overhead for
synchronization if performed in disjoint SGAs.
Contention occurs between instances using a single block or row, such as a table with
one row that is used to generate sequence numbers.
Two ways to synchronize:
Locks: Latches, enqueues, locks
Messages: Send/wait for messages
Synchronization = Amount Cost
Amount: How often do you need to synchronize?
Cost: How expensive is it to synchronize?
DSI408: Real Application Clusters Internals I-17

Levels of Syncronization
Row-Level (Database)

Oracle Row-Locking feature

Maximize concurrency

SCN coerency

Local Cache Level (intra-instance)

Every buffer in cache is protected by logical semaphores (spin latches)

Access to buffers is synchronized

Global Cache Fusion (inter-instance DLM)

Every buffer in every cache is tracked by GCS

cache coherency / cache consistency

1-18

Copyright 2003, Oracle. All rights reserved.

- CACHE BUFFERS CHAINS , CACHE BUFFER HANDLES , CACHE


BUFFER HANDLES
- Global Resource Directory managed by Global Cache Services (GCS) .
(Old DLM in pre9i)
- cache coherency
The synchronization of data in multiple caches so that reading a memory
location
by way of any cache will return the most recent data written to that location
by way
of any other cache. Sometimes called cache consistency.

DSI408: Real Application Clusters Internals I-18

Levels of Syncronization Row Level


Global Cache
(iDLM)

Instance

fg1
Update row1

fg2

Database

Update row2

Block 100

1-19

Block 101

Copyright 2003, Oracle. All rights reserved.

Enqueues are local locks that serialize access to various resources. This
wait event indicates a wait for a lock that is held by another session (or
sessions) in an incompatible mode to the requested mode. See
<Note:29787.1> (about V$LOCK) for details of which lock modes are
compatible with which. Enqueues are usually represented in the format
"TYPE-ID1-ID2" where
"TYPE" is a 2 character text string
"ID1" is a 4 byte hexadecimal number
"ID2" is a 4 byte hexadecimal number

DSI408: Real Application Clusters Internals I-19

Levels of Syncronization Local Cache


Global Cache
(iDLM)

Instance
BCache
Update
row1

fg1

fg2

Update
row2

Database
Block 100

1-20

Block 101

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-20

Levels of Syncronization Global Cache


Global Cache
(iDLM)
Global Resource Directory

Instance
BCache

BCache

Update
row1

Update
row2

fg1

fg1

Database
Block 100

1-21

Block 101

Copyright 2003, Oracle. All rights reserved.

global resources
Inter-instance synchronization mechanisms that provide cache coherency for
Real
Application Clusters. The term can refer to both Global Cache Service (GCS)
resources and Global Enqueue Service (GES) resources.

DSI408: Real Application Clusters Internals I-21

We need a cache

Serialize

Serialization is the easiest method to manage


concurrency, But, conversely cost in term of
system througput

Evolutions of
Oracle
minimize the
set of tasks
that are
serialized

Sequencing
operations
guarantee
consistency of
data

But : Minimize the


level of
concurrency of the
system

And : time to
complete
sequence of
operations
depends by the
slower element :
disks

fg

fg

fg
fg

fg

Serialize
Database

Block

1-22

Block

Block

Copyright 2003, Oracle. All rights reserved.

Give a set of Tasks: [T1,T2,Tn] that arrive at the times [t1 <t2 <<tn],
Suppose that the system have a number of processing unit that allow the
potentially the maximum level of parallelism for such tasks .
You can approach the problem of run all the task minimizing the time (max
throughput)
At least in two modes .
1) executes the task sequentially , as they came, the last arrived wait
Until the previous ones are terminated . This not use the potential
parallelism
Of your machine .
Good : easy to implement
Bad : performances
2) implement a LOCK/WAIT infrastructure an allow all the task to run
freely until the
Are blocked by some other task(s) . The effective degree of parallelism
will be maximum
DSI408: Real Application Clusters Internals I-22

When the set of points of synchronization is minimal .

Coerency
Res: 1,0x100
S
S

BCache

BCache

select
row1

fg1

select
row2

scn:900

fg1

Scn: 1010
Start SC#
1010

Start SC#
900

scn: 800

Block 100

1-23

The systems reach a


maximum level of
concurrency

Copyright 2003, Oracle. All rights reserved.

Ex: ALTER SYSTEM DUMP DATAFILE 5 BLOCK 4690;


ALTER SYSTEM DUMP DATAFILE {'filename'}|{filenumber}
|---BLOCK MIN {blockno} BLOCK MAX {blockno}|-->
|---BLOCK {blockno}-----------------------|
Note : blockdump report the BC block if block is CURRENT/Dirty in
current instance
alter session set events 'immediate trace name BUFFER level <RDBA>';

DSI408: Real Application Clusters Internals I-23

Coerency costs of locks

1-24

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-24

Fixed*/Releasable 1:M lock model (static)


Global Cache
(iDLM)

Instance
(*)starting 9i
removed fixedlocking mode

Database
Block 100

Block 101

1-25

Block 102

Block 103

Block 104

Copyright 2003, Oracle. All rights reserved.

GC_FILES_TO_LOCKS = 1=100:2=0:3=1000:4-5=0EACH
GC_FILES_TO_LOCKS ={ file_list= lock_count[! blocks][EACH][:...]}

PCM lock names


type is always BL (because PCM locks are buffer locks)
ID1 is the block class (described in Classes of Blocks)
ID2 For fixed locks, ID2 is the lock element (LE) index number obtained by hashing the block address
(see the GV$LOCK_ELEMENT/ GV$GC_ELEMENT fixed view) For releasable locks, ID2 is the database address of
the block.

Non PCM locks


CF Controlfile Transaction

IV Library Cache Invalidation

CI Cross-Instance Call Invocation L[A-P] Library Cache Lock


DF Datafile

N[A-Z] Library Cache Pin

DL Direct Loader Index Creation

Q[A-Z] Row Cache

DM Database Mount

PF Password File

DX Distributed Recovery

PR Process Startup

FS File Set

PS Parallel Slave Synchronization

KK Redo Log Kick

RT Redo Thread

IN Instance Number

SC System Commit Number

IR Instance Recovery

SM SMON

IS Instance State

SN Sequence Number

MM Mount Definition

SQ Sequence Number Enqueue

MR Media Recovery

SV Sequence Number Value

ST Space Management Transaction

TT Temporary Table

TA Transaction Recovery

TX Transaction

DSI408: Real Application Clusters Internals I-25

False Pinging
Global Cache
(iDLM)

LE: 23

Instance
updating

fg1

BCache

dba:10

dba:103

dba:105

Database
Block 100

1-26

Block 101

Block 102

Block 103

Block 104

Copyright 2003, Oracle. All rights reserved.

Another instance need access to dba:100, the owning instance must ping all the dirty blocks
that are covered by LE

DSI408: Real Application Clusters Internals I-26

Releasable 1:1 lock model (dynamic)


Global Cache
(iDLM)

LE: 100

LE: 105

Instance
updating

fg1

BCache

dba:101

dba:103

dba:105

Database
Block 100

Block 101

1-27

Block 102

Block 103

Block 104

Copyright 2003, Oracle. All rights reserved.

break on GC_ELEMENT_NAME
select inst_id,GC_ELEMENT_NAME,CLASS,MODE_HELD
from gv$gc_element where GC_ELEMENT_NAME>20970000
order by GC_ELEMENT_NAME;

INST_ID GC_ELEMENT_NAME

CLASS

MODE_HELD

---------- --------------- ---------- ---------1

20971522

20971523

20971913

20971914

20976209

20976210

3
0

-V SPLIT ==> DBA


(Hex)
= File#,Block# (Hex
File#,Block#)DSI408: Real Application Clusters Internals I-27
=

=====

===

=====

============

Scalability
Scaleup
Scaleup is the capability to provide continued increases in
throughput in the presence of limited increases in processing
capability while keeping time constant:
Scaleup = (volume parallel) / (volume original)

Speedup
Speedup is the capability to provide continued increases in speed in
the presence of limited increases in processing capability, while
keeping the task constant:
Speedup = (time original) / (time parallel)

1-28

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-28

RAC Costs: Global Resource Directory

1-29

Single instance: Synchronization of concurrent


tasks and access to shared resources
Global Resource Directory (GRD) to record
information about how resources are used within
a cluster database. The Global Cache Service
(GCS) and Global Enqueue Service (GES) manage
the information in this directory. Each instance
maintains part of the global resource directory in
the System Global Area (SGA).

Copyright 2003, Oracle. All rights reserved.

RAC Costs: Global Resource Directory


In single-instance environments, locking coordinates access to a common resource, such as
a row in a table. Locking prevents two processes from changing the same resource (or row)
at the same time.
In RAC environments, internode synchronization is critical because it maintains proper
coordination between processes on different nodes, preventing them from changing the
same resource at the same time. Internode synchronization guarantees that each instance
sees the most recent version of a block in its buffer cache.

DSI408: Real Application Clusters Internals I-29

RAC Costs: Global Resource Directory (continued)


Resource coordination within Real Application Clusters occurs at both an instance level
and at a cluster database level. Instance level resource coordination within Real
Application Clusters is referred to as local resource coordination. Cluster level
coordination is referred to as global resource coordination.
The processes that manage local resource coordination in a cluster database are identical to
the local resource coordination processes in single instance Oracle. This means that row
and block level access, space management, system change number (SCN) creation, and
data dictionary cache and library cache management are the same in Real Application
Clusters as in single instance Oracle.
If the resource is modified by more than one instance, then RAC performs further
synchronization on a global level to permit shared access to this block across the cluster.
Synchronization in this case requires intranode messaging as well as the preparation of
consistent read versions of the block and the transmission of copies of the block between
memory caches within the cluster database." (See Oracle9i Real Application Clusters
Concepts Release 2 (9.2), Part Number A96597-01, Chapter 5, Real Application Clusters
Resource Coordination.)
Note: Global Cache Service (GCS) and Global Enqueue Service (GES) do not interfere
with row-level locking and vice versa. Row-level locking is a transaction feature.

DSI408: Real Application Clusters Internals I-30

RAC Costs: Cache Coherency

Cache coherency is the technique of keeping multiple


copies of an object consistent between different
Oracle instances.

1-31

Copyright 2003, Oracle. All rights reserved.

RAC Costs: Cache Coherency


Maintaining cache coherency is an important part of a cluster. Cache coherency is the
technique of keeping multiple copies of an object consistent between different Oracle
instances (or disjoint caches) on different nodes.
Global cache management ensures that access to a master copy of a data block in an SGA
is coordinated with the copy of the block in other SGAs.
Therefore, the most recent copy of a block in all SGAs contains all changes that are made
to that block by any instance in the system, regardless of whether those changes have been
committed on the transaction level. Full redo protection of the block changes is maintained.

DSI408: Real Application Clusters Internals I-31

RAC Costs: Cache Coherency

Node 1

Node 2

Instance A

1-32

Instance B

Node 3
Instance C

SGA

SGA

SGA

GES/GCS

GES/GCS

GES/GCS

Copyright 2003, Oracle. All rights reserved.

RAC Costs: Cache Coherency (continued)


The cost (or overhead) of cache coherency is the need before any access to a specific
shared resource to first check with the other instances whether this particular access is
permitted. The algorithms optimize the need to coordinate on each and every access, but
some overhead is incurred.
The GCS tracks the locations, modes, and roles of data blocks. The GCS therefore also
manages the access privileges of various instances in relation to resources. Oracle uses the
GCS for cache coherency when the current version of a data block is in one instance's
buffer cache and another instance requests that block for modification. If an instance reads
a block in exclusive mode, then in subsequent operations multiple transactions within the
instance can share access to a set of data blocks without using the GCS. This is true,
however, only if the block is not transferred out of the local cache. If the block is
transferred out of the local cache, then the GCS updates the Global Resource Directory
that the resource has a global role; whether the resources mode converts from exclusive to
another mode depends on how other instances use the resource.

DSI408: Real Application Clusters Internals I-32

RAC Terminology

1-33

Cache coherency
Resources and locks
Global and local
GCS and GES, or PCM and non-PCM
GRM or DLM
Node, instance, cluster, and process

Copyright 2003, Oracle. All rights reserved.

RAC Terminology
Cache coherency means that the contents of the caches in different nodes are in a welldefined state with respect to each other. Cache coherency identifies the most up-to-date
copy of a resource, which is also called the master copy. In case of node failure, no vital
information is lost (such as committed transaction state), and atomicity is maintained. This
requires additional logging or copying of data but is not part of the locking system.
A resource is an identifiable entity; that is, it has a name or reference. The entity referred
to is usually a memory region, a disk file, or an abstract entity; the name of the resource is
the resource. A resource can be owned or locked in various states, such as exclusive or
shared.
By definition, any shared resource is lockable. If it is not shared, there is no access
conflict. If it is shared, access conflicts must be resolved, typically with a lock. The terms
lock and resource, although they refer to entirely separate objects, are therefore
(unfortunately) used interchangeably.
A global resource is one that is visible and used throughout the cluster. A local resource
is used by only one instance. It may still have locks to control access by the multiple
processes of the instance, but there is no access to it from outside the instance.
DSI408: Real Application Clusters Internals I-33

RAC Terminology (continued)


Data buffer cache blocks are the most obvious and most heavily used global resource.
There are other data item resources that are global in the cluster, such as transaction
enqueues and database data structures. The data buffer cache blocks are handled by the
Global Cache Service (GCS), and Parallel Cache Management (PCM). The nondata
block resources are handled by Global Enqueue Services (GES), also called NonParallel Cache Management (non-PCM).
The Global Resource Manager (GRM) keeps the lock information valid and correct
across the cluster.
From the module skgxn.h:

Node: An individual computer with one or more CPUs, some


memory, and access to disk storage (generally capable of
running an instance of OPS).

Cluster: A collection of loosely coupled nodes that


support a parallel Oracle database.

Cluster Membership: The set of active nodes in a


cluster. These are the nodes that are "alive" and have
access to shared resources (that is, shared disk). Nodes
that are not in the current cluster membership must not
have access to shared resources.

Instance: Distributed services typically are made up of


several identical components, one on each node of a
cluster. One of these components will be called an
"instance." For example, an OPS database will have an
Oracle instance running on each node.

Process: For the purposes of this interface, a process


is a unit of execution. On some operating systems, this
may be equivalent to an OS process. On others, it may be
equivalent to an OS thread. A process is considered
terminated when it can no longer execute, pending OS
requests are completed/canceled, and any process-local
resources are released.
Note that the older OPS terms are used in the code, but the terms are also valid for RAC.

DSI408: Real Application Clusters Internals I-34

Terminology Translations

Terminology depends on the speaker


Product managers to sales or marketing
Support, technical teams, development

Terminology depends on the version


Older terms tend to stay in code
Variable names and prefixes reflect the older name
Newer names reflect newer application or
functionality

1-35

Copyright 2003, Oracle. All rights reserved.

Terminology Translations
RAC = OPS. OPS is the older term. See the History slide (#19) in this lesson.
Row Cache = Dictionary Cache. Row Cache is the older term. It is the SGA area to cache
database dictionary information. It is a global resource.
Distributed Lock Manager (DLM) = Global Resource Manager (GRM). DLM is the older
term; GRM has slightly more functionality. The terms are used for any locking system that
can handle several processes, typically (but not necessarily) on several nodes.
DLM = IDLM = UDLM. The DLM term is a very general term, but also refers to the
external operating systemsupplied DLM used by Oracle7. IDLM refers to the Integrated
DLM introduced in Oracle8. UDLM is the Universal DLM, that is, the reference
implementation of a DLM made on the Solaris platform. It is often called by its code
reference skgxn-v2.
Some of the RAC processes have retained their old names but are described with a
different purpose:
LMON: Global Enqueue Service Monitor, previously Lock Monitor
LMD: Global Enqueue Service Daemon, previously Lock Monitor Daemon
LMS: Global Cache Service Processes, previously Lock Manager Services
DSI408: Real Application Clusters Internals I-35

Terminology Translations (continued)


Terminology in This Course
This course reflects the mixed usage of similar terms and aligns more with the terminology
of code than with the externalized names.

DSI408: Real Application Clusters Internals I-36

Programmer Terminology

1-37

Client or user: calling code


Callback: routine to execute when the called
program has new information

Copyright 2003, Oracle. All rights reserved.

Programmer Terminology
Inside the code, comments often refer to the programmers point of view.
Client and User are used interchangeably, and refer to the calling code.
Client code can register interest in a service by giving a pointer to a data structure that is to
be updated or a routine that is to be called, when the service has completed the required
action.

DSI408: Real Application Clusters Internals I-37

History

Real Application Clusters (RAC) is the current


product.
RAC has some similarity to Oracle Parallel Server
(OPS)
Has same end-user capability; a clustered database
Scales better because of better internal handling of
cache coherency
Has some internal, fundamental changes in the
global cache

1-38

Copyright 2003, Oracle. All rights reserved.

History
Oracle Parallel Server (OPS) historically had a bad reputation; it was not scalable. Most
applications ran slower on an OPS system than on a single instance. There was a need to
carefully determine which instance performed DML on which tables or (more accurately)
on which blocks. With RAC this need has been eliminated, resulting in true scalability.
Although RAC borrows much code from OPS, the official policy is not to mention that
RAC is an evolved version of OPS. Oracle does not want the bad reputation of OPS to
adversely affect the reputation of RAC in the market. Internally (in the code), the OPS
heritage in RAC is evident.

DSI408: Real Application Clusters Internals I-38

History Overview

1-39

OPS 6 was not in production and was available


only on limited platforms.
OPS 7 was platform generic, relying on external
DLM.
OPS 8 had Integrated Distributed Lock Manager.
OPS 8i had Cache Fusion Stage 1.
RAC 9i has Cache Fusion Stage 2.
The database layout for different versions has not
changed.

Copyright 2003, Oracle. All rights reserved.

History Overview
Some components have undergone changes in scope and name. The system that ensures
that access to a block is coherent is the Global Cache Manager in Oracle9i. In Oracle8i and
Oracle8, this was the Integrated Distributed Lock Manager. Earlier it was an external
operating systemsupplied service that the Oracle processes called. The Cluster Group
Service of Oracle9i and Oracle8i was the Group Membership Services module in Oracle8
and (before that) part of the external Distributed Lock Manager.
Although there have been many changes to the architecture in the instance, the database
structure has changed only marginally. Separate redo threads and undo spaces are still
used.

DSI408: Real Application Clusters Internals I-39

Internalizing Components
Oracle7
Simulated
callback,
enqueue
translation

RDBMS
DLM API
No local
state in
instance

Oracle8

DLM,
CM
&
Op.Sys.

RDBMS
IDLM

Callbacks,
enqueues
CM
&
Op.Sys.

Local state in
SGA memory

1-40

Copyright 2003, Oracle. All rights reserved.

Internalizing Components
The development of RAC has internalized more operating system components for each
version. As an example, the diagram on the slide shows the internalization of the
Distributed Lock Manager (DLM) in the development of Oracle7 to Oracle8. Instead of
calling the external operating system whenever any lock status needed checking by the
DLM API module, the IDLM module in the Oracle server only needs to examine its SGA.
The RDBMS routines did not in principle need to reflect the change.
The earlier versions had the DLM external, which limited the functionality (lowest
common denominator effect) that the Oracle server could rely on, and the need to pass
data to external services. Data transfer used pipes or network communication to the
external processes; control for I/O completion used Asynchronous Trap (AST)
mechanisms, polling mechanisms, or blocked waits. Internal communication inside the
Oracle servereven between the various background processescan use the common
SGA memory area that includes latches and enqueues.
This is merely illustrative and is not an accurate summary of the changes made.
The Oracle8 to Oracle9i development similarly internalized the GMS interface (that is, the
Node Monitor (NM) functionality), relying on only the Cluster Manager (CM) interface
routines.
DSI408: Real Application Clusters Internals I-40

Oracle7

The differences between a non-OPS server and an


OPS-enabled Oracle server were few:
Database structure changes
Separate redo per instance
Separate undo per instance

1-41

Addition of LCK process in instance

Copyright 2003, Oracle. All rights reserved.

Oracle7
OPS in Oracle7 consisted of the database structural changes for cluster operation (as in all
versions) and the addition of the LCK process that communicated with the external DLM.
The instances not only coordinated global cache coherency through the DLM but also used
the DLM as the communication channel for registering into the OPS cluster.
The method for sending the SCN or other messages was platform specific.
External DLM
The external DLM usage had the following characteristics:
It had to be running before any instance started.
Resources and locks had to be adequately configured.
Death of the DLM on a node implied death of all its clients on the node.
OPS/DLM diagnostics had to have port-specific lock dumps.
Internode parallel query code had to be port specific.

DSI408: Real Application Clusters Internals I-41

Oracle8

First stage in internalizing cluster communications:


Oracles own lock manager in Oracle server
New communication path for clusterwide
messages
New background processes LMD and LMON
Cluster state communication through external
Group Membership Service (GMS)

1-42

Copyright 2003, Oracle. All rights reserved.

Oracle8
The internal DLM meant that resource allocation was inside the Oracle server. Diagnostic
lock dumps no longer needed to be port specific. The Oracle server, version 8 (and later),
started communicating with the cluster services of the operating system. The interface
consisted of the GMS that was an Oracle-specified API. The GMS functionality included:
Supplying each instance with the current set of registered members, clusterwide
Notifying other members when a member joins or leaves
Automatically deregistering dead processes/instances from their groups
Interfacing with the node monitor for cluster events

DSI408: Real Application Clusters Internals I-42

Oracle8i

Cache Fusion Stage 1


Read/write blocks sent via interconnect and not
through the disk
CR server process BSP

More cluster communication functions as part of


Oracle server code
GMS functionality split into Cluster Group Services
(CGS) and Node Monitor (NM) in the skgxnv2
Lock Manager structures in shared pool

1-43

Copyright 2003, Oracle. All rights reserved.

Oracle8i
The Cache Fusion Stage 1 satisfied some types of block requests across the cluster
communication paths (rather than via disk) and made use of the messaging services.
The Oracle8 GMS has been split into OSD and Oracle kernel components. Node monitor
OSD skgxn is extended from monitoring a single client per node to arbitrarily named
process groups. The rest of the GMS functionality is moved into Oracle as CGS. A
distributed name service is added to CGS.
LMON executes most of the CGS functionality:
Joins the skgxn process group representing the instances of the specified group
Connects to other members and performs synchronization to ensure that all of them
have the same view of group membership

DSI408: Real Application Clusters Internals I-43

Oracle9i

Cache Fusion Stage 2


Write/write blocks handled concurrently
GCS and GES instead of IDLM

Enhanced instance availablity


Instance Member Reconfiguration (IMR)
New recovery features

1-44

Enhanced messaging for inter-instance


communication

Copyright 2003, Oracle. All rights reserved.

Oracle9i
The remainder of this course is based on Oracle9i.

DSI408: Real Application Clusters Internals I-44

Summary

In this lesson, you should have learned how to:


Determine whether to use RAC in application
design
Describe RAC improvements over its predecessor

1-45

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-45

Introduction to RAC Internals

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Outline the RAC architecture with internal
references
Relate the RAC-related modules to the Oracle
code stack

2-47

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-47

Simple RAC Diagram


High-speed interconnect
Node

Node

Instance
(SGA,
processes)

Node

Instance
(SGA,
processes)

Instance
(SGA,
processes)

Cluster
disk/file
system

2-48

Copyright 2003, Oracle. All rights reserved.

Simple RAC Diagram


The node contains more than just the instance. It includes the operating system, network
stacks for various protocols, disk software, and a number of Oracle noninstance processes:
Listener, Intelligent Agent, and the foreground/shadow server processes.
The instance has its usual complement of background processes (more so with the RAC
configuration). They connect to the disk system, the network, and the high-speed
interconnect.
The cluster disk or file system may be mirrored, RAID-based, SAN/Fiber-based, or JBOD
(just a bunch of disks). If it is a clusterwide file system, it can contain the Oracle home
code. The clusterwide disks can be host-managed (that is, the controller is part of the node)
but are serviced to the cluster and equivalent to clusterwide disks. Local disks are of little
interest to RAC but are used for noncommon files where the common disks are raw disks.
Note: There are some issues with node-specific files of the Intelligent Agent or password
file orapw when using a cluster file system. The solution varies with the platform and the
CFS that are used.

DSI408: Real Application Clusters Internals I-48

One RAC Instance

Node
Instance

SGA contains (but is not limited


to):
Library, row, and buffer caches
Global Resource Directory

SGA

DBW0

PMON

DIAG

LMS

LMD

LCK

LMON

Other background processes are:


LGWR, SMON, and so on
PQ, Jobs, and so on
Dispatchers and servers

Foreground processes not shown

CM

2-49

Copyright 2003, Oracle. All rights reserved.

One RAC Instance


This is the traditional view of an instance and its background processes. All processes are,
however, the same programoracle.exe or oraclejust instantiated with different
startup parameters (see source opirip and WebIV Note:33174.1). On Windows, this is
more apparent; there is clearly only one Oracle process showing in the Task Manager, but
with a number of threads.
All caches in the SGA are either global and must be coherent across all instances, or they
are local. The library, row (also called dictionary), and buffer caches are global. The large
and Java pool buffers are local. For RAC, the Global Resource Directory is global in itself
and also used to control the coherency.
The LMON process communicates with its partner process on the remote nodes. Other
processes may have message exchanges with peer processes on the other nodes (for
example, PQ). The LMS and LMD processes, for example, may directly receive requests
from remote processes.
The Cluster Monitor (CM) system communicates with the other CMs on other nodes and is
not part of the Oracle RAC instance. But it is a necessary component.

DSI408: Real Application Clusters Internals I-49

Internal RAC Instance


kqlm: Library cache (fusion)
kqr: Dictionary/row cache
kcl: Buffer cache
ksi: Instance locks
kjb: Global Cache Service
kju: Global Enqueue
Service
CGS: Cluster Group Services
NM: Node Monitor

Node
Instance

kql
kqr
kqlm
ksi
GCS kjb/GES kju
CGS kjxg
NM skgxn.v2

IPC: Interprocess
Communication
2-50

kcl
s
I k
Pg
Cx
p

CM

Copyright 2003, Oracle. All rights reserved.

Internal RAC Instance


This is an internal view of some of the instance code stack and the RAC-relevant sections
and modules.
The NM layer is the communication layer to the CM. The IPC services facilitate other
process to process communication on different instances.
The CGS maintains the state of the RAC-cluster, knowing which instances are in the
cluster and which are not. Contrast this with the node availability.
The GRD is the data structure that stores Global Enqueue and Global Cache objects; it is
aware of every clusterwide resource. Resources are typically a buffer element, like a data
buffer, or a data file, but can also be abstract entities, such as an enqueue or NM resource.
The three buffer caches are used by the various user foreground processes by calling
handling routines (kqlm, lqr, kcl) for allocation, deallocation, and locking. The
handling routines maintain coherency by using kcl. The data buffer cache is the sole user
of the GCS.
Note: Other skg-interfaces, such as skgfr (disk I/O), are not shown.

DSI408: Real Application Clusters Internals I-50

Oracle Code Stack


Oracle Call Interface
User Program Interface
Oracle Program Interface
Kernel Compilation Layer
Kernel Execution Layer
Kernel Distributed Execution Layer
Network Program Interface
Kernel Security Layer
Kernel Query Layer
Recursive Program Interface
Kernel Access Layer
Kernel Data Layer
Kernel Transaction Layer
Kernel Cache Layer
Kernel Services Layer
Kernel Lock Management Layer
Kernel Generic Layer
Operating System Dependencies
2-51

OCI
UPI
OPI
KK
KX
K2
NPI
KZ
KQ
RPI
KA
KD
KT
KC
KS
KJ
KG
S

Copyright 2003, Oracle. All rights reserved.

Oracle Code Stack


The first few characters of the routine and structure names indicate which layer in the code
stack they come from.

DSI408: Real Application Clusters Internals I-51

RAC Component List

This course examines the following RAC component


list:
Cluster Layer and Cluster Manager (CM)
Node Monitor (NM)
Cluster Group Services (CGS)
Global Cache Service and Global Enqueue Service
(GCS and GES)
Interprocess Communication (IPC)
Cache Fusion in the GCS
Cache Fusion Recovery

2-52

Copyright 2003, Oracle. All rights reserved.

RAC Component List


This course examines the components listed in the slide. This is the stack, with the most
fundamental module listed first (with some exceptions).

DSI408: Real Application Clusters Internals I-52

Module Relation View

ORACLE

DLM (GRD)

GCS

GES

CGS/IMR

DRM/FR

IPC

KSXP

NM

SKGXN

SKGXP

2-53

Copyright 2003, Oracle. All rights reserved.

Module Relation View


GCS: Global Cache Service, or PCM locks
GES: Global Enqueue Service, or non-PCM locks
DRM/FR: Dynamic Resource Mastering/Fast Reconfiguration. Only partially activated in
a standard Oracle9i Release 2 installation.
IMR: Instance Membership Recovery. LMON handles instance death and split brain (two
networks).
KSXP: Multiplexing service (multithreaded layer). Allows DLM to do a lazy send;
ksxp informs client after send is completed.
NM: Node Monitor. Instances joining and leaving the cluster
IPC: Interprocess Communication. There is usually a choice of underlying protocols to
use, depending on the platform and hardware. The default is UDP (light; consumes no
resources/connections) memory mapped I/O (enhanced to IPC interface used by cache
fusion) versus port-based communication.
CGS: Cluster Group Service. Handles the sync up the bitmap. Also a name service for
publishing and querying configuration data. CGS in Oracle9i is changed from earlier
versions to speed up the reconfiguration.
DSI408: Real Application Clusters Internals I-53

Alternate Module Relation View

Client
code
PQ
kcl

ksq

KSXP SKGXP

ksi

DLM
CGS

2-54

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-54

Module, Code Stack, Process

2-55

The same code is present in all foreground and


background processes.
Modules may be constrained to run in a specific
process.

Copyright 2003, Oracle. All rights reserved.

Module, Code Stack, Process


Although the running Oracle server consists of several processes (both foreground and
background), remember that this is the same program that runs in all processes. Processes
are limited to performing a set of functions, and thus some code is active in only some
processes. Thus there is no LMON program module, but some routines in the KJB source
modules have a comment stating that the function runs only in the LMON process. This is
confusing to remember when one process calls another process when examining code.
Cross process calls require a message or posting, and execution may have to wait until the
called process starts executing; in other words, a context switch must occur.
On the Windows platform, there is only one process. The various Oracle server processes
are implemented as threads inside this program.

DSI408: Real Application Clusters Internals I-55

Operating System Dependencies


(OSD)

Code that must be separate for each platform is


typically collected in OSD modules.
Generic version: Runs on development system
Reference version: Classic version ported to all
platforms
Platform version: Optimized and specialized;
several versions may exist.

2-56

OSD code is bracketed with #ifdef #endif in


some modules.

Copyright 2003, Oracle. All rights reserved.

Operating System Dependencies (OSD)


This applies to many other Oracle server products or functions but is much more visible
with RAC.
If the platform dependency is small, it may be bracketed by the #ifdef #endif
construction; otherwise, a common routine is called in an OSD module, which is
appropriately rewritten for each platform. Such modules are generic. For example, refer to
the skgxnr.c module.
For some OSD modules, there may be more than one version. For example, the IPC
implementation has a number of protocols to be used. One OSD module with the same
interface is written for each protocol. Only one module is linked to the Oracle server, thus
deciding the IPC protocol to be used.
Where several implementations are possible, a reference module is constructed. This is
runable on all platforms and is the lowest common denominator. It proves functionality
and is used to verify the correct functionality of the other specialized version of the
module. However, it may not be used.

DSI408: Real Application Clusters Internals I-56

Platform-Specific RAC
Higher layers
SQL, Transaction, Data

Cache KC*
Service KS*

GES and GCS KJ*


Generic Layer KG*
(common functions)
Platform Specific Code
OSD S*

These are kernel


routines, so the names
start with K.
Service routines start
with KS.
OSD routines start with
S or SS.
OSD code is written by
the porting groups.

Operating System
Routines

2-57

Copyright 2003, Oracle. All rights reserved.

Platform-Specific RAC
Many RAC problems are platform specific. The Operating System Dependency (OSD)
layer therefore must be examined for the platform concerned. The subdirectory is called
sosd or osds.
This cannot be examined in TAO with cscope; you need the vobs access.
OSD code is partially available at
/export/home/ssupport/920/rdbms/src/server/osds.

DSI408: Real Application Clusters Internals I-57

OSD Module: Example

SKGXP
2

U
D
P

T
C
P

H
M
P

SKGXP
module,
3 alternative
versions

3
5
4

OS routines

2-58

skgxp.h
Generic interface
skgxp.c
Reference
implementation
sskgxpu.c
UDP implementation,
port-specific
sskgxph.c
HMP implementation,
port specific (HP-UX)

Copyright 2003, Oracle. All rights reserved.

OSD Module: Example


A module that needs to call the operating system must be port specific. Calling an I/O
routine may vary in name, arguments, and other particulars between platforms, even
though they give the same functionality.
The skgxp module has an official upward API (1). Internally, there are some common
functions and one way of achieving the necessary communication function of the SKGXP.
The UDP option, for example, performs the required OS-related calls through the OS API
(3) that send, receive, check status, and so on, by using UDP packets. It also possibly has
some code to hide or simulate functions so that the common set (2) is maintained. The
functions are similar for the other protocol options.
The reference implementation is made to compile and work on all platforms, but the whole
module is additionally rewritten by most platform groups. As explained previously, a
platform group makes several versions by using different protocols. This is selected at link
time by using the appropriate library. The HMP module, shown in this example, is only
available on the HP platform

DSI408: Real Application Clusters Internals I-58

OSD Module: Example (continued)


Dependencies on the OSD Module
For the skgxp module, some OSD variants have additional interfaces callable from
higher modules. The kcl module, for example, can call for a special memory map pointer
for the HMP protocol. Higher levels in the stack have #ifdef #endif bracketed calls
to the extended sskgxph.

DSI408: Real Application Clusters Internals I-59

Summary

In this lesson, you should have learned about the:


RAC architecture outline with internal references
Relationship between the RAC-related modules
and the Oracle code stack

2-60

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-60

References

Main sources for general RAC information:


RAC Web site
http://rac.us.oracle.com:7778

RAC Pack repository on OFO


http://files.oraclecorp.com/content/AllPublic/
Workspaces/RAC%20Pack-Public/

WebIV
Check folder Server.HA.RAC

2-61

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-61

Cluster Layer

Cluster Monitor

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to:


Describe the generic Cluster Manager (CM)
functionality
Outline the interaction between CM and RAC
cluster layers

3-63

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-63

RAC and Cluster Software


Node
Instance

Caches

ksi/ksq/kcl
GRD
CGS
NM

I
P
C

Other
nodes
(not
shown)

CM

3-64

Copyright 2003, Oracle. All rights reserved.

Cluster Layer in RAC


The cluster layer is not part of the RAC instance. The Cluster Manager (CM) is part of the
cluster layer.
It has its own communication path with the peer cluster software on other nodes. It can
determine the status of other nodes in the cluster but does not maintain any consistent view.
Most of the synchronization and consistency is handled in the Node Monitor (NM).

DSI408: Real Application Clusters Internals I-64

Generic CM Functionality:
Distributed Architecture

3-65

Local cluster manager daemons


All daemons make up the Cluster Manager
One daemon elected as master node

Copyright 2003, Oracle. All rights reserved.

Generic CM Functionality: Distributed Architecture


Every node in the cluster must have a local CM daemon(s) running. The set of all CM
daemons makes up the Cluster Manager. The CM daemons on all nodes communicate with
one another. The CM daemons on all nodes may elect a master node, which is responsible
for managing cluster state transitions.
Upon communication failure remaining CM daemons form a new cluster using an
established protocol and re-elect a new master if necessary.
The CM and the RAC cluster are distinct entities acting as physically distinct services. The
CM is responsible for cluster consistency. The CM detects and manages cluster state
transitions. The CM co-ordinates RAC cluster recovery brought about by cluster state
transitions.

DSI408: Real Application Clusters Internals I-65

Generic CM Functionality:
Cluster State

3-66

State change
Cluster Incarnation Number
Cluster Membership List
IDLM Membership List

Copyright 2003, Oracle. All rights reserved.

Generic CM Functionality: Cluster State


A cluster is said to change state when one or more nodes join or leave the cluster. This
transition is complete when the cluster moves from a previous stable configuration to a
new one. Each stable configuration is identified by a number called the cluster incarnation
number. Every state change in the cluster monotonically increases the cluster incarnation
number
The set of all nodes in a cluster form a cluster membership list. The set of all nodes in the
cluster where the RAC IDLM is running form an IDLM membership list. Every node in a
cluster is identified by a node-ID provided by the CM, which remains unchanged during
the lifetime of a cluster. The IDLM uses this node-ID to identify and distinguish between
members in the IDLM membership list

DSI408: Real Application Clusters Internals I-66

Generic CM Functionality:
Node Failure Detection

3-67

Node failure detection


Communication failure detection

Copyright 2003, Oracle. All rights reserved.

Generic CM Functionality: Node Failure Detection


To insure integrity of the cluster, the CM must detect node failures. The RAC cluster may
suspect node failure (for example, a communication failure with a node) in which it may:
Freeze activity and expect a message from the CM to start reconfiguration
Inform the CM of an error condition and await reconfiguration notification after a
new stable cluster state is established
If the CM and RAC cluster are to detect the same communication failures, CM should
monitor cluster health on the same physical circuit used by the RAC cluster (for example,
on HP use of HMP). Performance considerations may require the CM and RAC cluster to
use separate virtual circuits.
If the CM and RAC cluster are using separate physical circuits, the CM should be aware of
the RAC clusters physical circuit and monitor for cluster health via the same circuit. The
CM may provide for physical circuit redundancy for failover and performance.
RAC Cluster reconfiguration is begun after a cluster has reached a new stable state.
CM must be able to handle nested state transitions and communicate these state
changes to the RAC cluster.
Nested cluster transitions interrupt any in-process RAC cluster reconfiguration.
DSI408: Real Application Clusters Internals I-67

Cluster Layer and Cluster Manager

Node
Instance

NM

RAC cluster registers the


instance in the CM.
Primarily the LMON
process
Secondarily other I/O
capable processes (DBWR,
PQ-slaves, )
Obtains Node-ID from
cluster

CM

3-68

Copyright 2003, Oracle. All rights reserved.

Cluster Layer and Cluster Manager


The Cluster Manager is a vendor- or Oracle-provided facility to communicate between all
the nodes in the cluster about node state. The CM uses a different protocol or channel. It
uses heartbeat and sanity checks to validate node status. The RAC processes communicate
directly with each other, but the CM is not the communication channel.
CM is used to monitor the node health, detect the failure of a node, and manage the node
membership in the cluster.
The CM handles nodes, not instances.
Registration and I/O Stop
To cope with the various failure scenarios, such as process termination and broken
communication, several RAC processes use the SKGXN to register in the CM. This is
described in more detail in the lesson titled Cluster Group Services and Node Monitor.

DSI408: Real Application Clusters Internals I-68

Oracle-Supplied CM

3-69

For the Linux and Windows platforms, the CM


software component is part of the Oracle
distribution.
RAC high availability extension functionality
makes use of the CM.

Copyright 2003, Oracle. All rights reserved.

Oracle-Supplied CM
The Oracle-supplied CM is covered in the Linux platform lesson later in this course.
In the Oracle-supplied CM the integration to the RAC cluster is somewhat closer, blurring
the distinction.

DSI408: Real Application Clusters Internals I-69

Summary

In this lesson, you should have learned how to


Describe the generic Cluster Manager (CM)
functionality
Outline the interaction between CM and RAC
cluster layers

3-70

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-70

Cluster Group Services


and Node Monitor

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Describe the functionality of the cluster
configuration components
Node Monitor
Cluster Group Services

4-73

Identify the function of cluster configuration


components in dumps and traces

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-73

RAC and CGS/GMS and NM


Node
Instance

NM: Node Monitor


CGS: Cluster Group
Services

Caches

ksi/ksq/kcl
GRD
CGS/GMS
NM

I
P
C

Other
nodes
(not
shown)

CM

4-74

Copyright 2003, Oracle. All rights reserved.

RAC and CGS/GMS and NM


In Oracle8, the Group Membership Service (GMS) was used. This has changed, and the
functionality is now in the CGS and NM layer.

DSI408: Real Application Clusters Internals I-74

Node Monitor (NM)

Provides node membership information


Notifies clients of any change in membership
status
Members joining
Members leaving

4-75

Provides query facility for management tools


Source reference skgxn.c

Copyright 2003, Oracle. All rights reserved.

Node Monitor (NM)


The Node Monitor provides the interface to other modules for determining cluster
resources status, that is, node membership. It obtains the status of a cluster resource from
the Cluster Manager for remote nodes and provides the status of cluster resources of the
local node to the Cluster Manager.
skgxn has a passive interface; group events are delivered through constant polling by
clients.
The core of skgxn is the Distributed Process Group facility.

DSI408: Real Application Clusters Internals I-75

RDBMS SKGXN Membership

4-76

The purpose of membership is to determine which


instances are in the RAC cluster.
Nonmembers must not access the common
database files.

Copyright 2003, Oracle. All rights reserved.

Group Membership
A process can register with a group on behalf of an instance that includes multiple
processes. It is important that, when the member deregisters from the group, the other
instance processes do not access the shared cluster resources (such as shared disk) after the
remaining group members have been informed of the deregistration. Otherwise, the
deregistered instance may overwrite changes that are made by the surviving instances.
To protect against this situation, the processes of an instance can share the membership of
the process that is registered with the group. These processes register as slave members,
specifying the member ID of the member that registered as a normal (primary) member.
The deregistration of the primary member must not be propagated to the groups other
primary members until all the associated slave members have also deregistered.

DSI408: Real Application Clusters Internals I-76

NM Groups

A process group is a named clusterwide resource.


Processes throughout the cluster register with the
process group by sending their:
IPC port ID and other bootstrap information
Node name and other information for administrative
tools to use

A process can register either as a primary member


or a slave member.
Primary member: Registers on behalf of the
instance
Slave member: Registers using the primary
members member ID

4-77

Copyright 2003, Oracle. All rights reserved.

NM Groups
On registration, a process provides:
Private member data that can be retrieved only by other members and that consists of
IPC port ID and other bootstrap information
Public member data that can be retrieved by any skgxn client and that consists of
node name and other information for administrative tools to use
Primary members should ensure that all slaves are terminated on deregistering from the
group. Failure to do so is a bug or malfunction. LMON is the primary member of an
instance.
Slave members are all I/O-capable clients.

DSI408: Real Application Clusters Internals I-77

NM Internals

NM interface (skgxn.v2) is the OSD interface for


Generic Node/Process Monitor.
skgxncin: Defines an OSD context and returns a
handle
skgxnreg: Registers with process group as
primary member (LMON)
skgxnsrg: Registers with process group as slave
member
skgxnpstat: Polls/waits for process group status

4-78

skgxn is a passive interface; group events are


delivered through constant polling by clients.

Copyright 2003, Oracle. All rights reserved.

NM Internals
Source Notes
The basic concept is Process Groups. This implementation relies on the UNIX Distributed
Lock Manager (UDLM) architecture, the same as the first version of the DLM that was
external.
skgxnpstat receives group membership changes. The client must call it to get group
changes (passive). The interface itself does not call back; the caller must check the state bit
in the context.
The process must call skgxnpstat to receive any state event changes. An example is
skgxpwait in IPC to receive an event such as I/O completion.
These routines are normally part of a daemon loop (LMON).

DSI408: Real Application Clusters Internals I-78

Node Membership

Each node index is represented by a bit.


0=down, 1=up
When nodes join or depart, the bitmap is rebuilt
and communicated to all members.
The bitmap is stored globally in the cluster with
the resource name of type NM or MM
skgxnmap
1

4-79

Copyright 2003, Oracle. All rights reserved.

Node Membership
The bitmap is stored globally in the cluster by using the UDLM as a global repository to
store global information, and uses the global notification mechanism of the UDLM. The
global repository stores the bitmap.
The UDLM reserves a storage space for each resource in a Resource Value Block (RVB).
That space is limited to 16 bytes. Multiple resources and RVBs can be used for large
clusters. These are stored in persistent resources. Persistent resources survive crashes and
are recoverable. They are stored in the UDLM space struct kjurvb. (see kjuser.h
for more information).

DSI408: Real Application Clusters Internals I-79

Node Membership (continued)


The resource names used by the Solaris reference SKGXN has the format of res[0] =
opcode, res[1-*] = "type<grpname>". These are not exposed through
V$LOCK but can be dumped through lkdebug. The types are MM, MN, and MP. For
instance, the reference SKGXN uses 2 resources to represent a group bitmap and has the
names:
MAP1 : (0x0 "MM<grpname>")
MAP2 : (0x1 "MM<grpname>")
Some of the important resource names are:
JOIN : (0x00000042 "MM<grpname>")
SYNC1 : (0x00000040 "MM<grpname>")
SYNC2 : (0x00000041 "MM<grpname>")
Member
: (memno "MN<grpname>")
Private Data : (memno+(n*256) "MN<grpname>")
Public Data : (memno+(n*256) "MP<grpname>")

DSI408: Real Application Clusters Internals I-80

Instance Membership Changes

1. Registration at startup or deregistration at


shutdown
2. Bitmap updated
3. Reread of bitmaps
4. Propagate change to CGS (not shown)
Instance 1

Instance 2

Instance n

LMD0
1
LMON/NM
2

LMON/NM
3

LMON/NM
3

Cluster Layer - CM

4-81

Copyright 2003, Oracle. All rights reserved.

Node Membership Changes


skgxncin is called to initialize or join a cluster when mounting the database. This
initializes a context and calls skgxnreg to register as primary with a process group
(slaves call skgxnsrg). This translates at the lower layers as registration to a particular
group. NM reads the existing bitmap to identify the members, then locates the index to
where the joining node should be, and turns that bit on a zero-to-one transition. The bitmap
is then invalidated. The bitmap itself is valid; it is the status of it that has changed. This
state change forces a reread.
The only way for existing members to know whether a member has joined or left is when
the bitmap is invalidated (skgxn_mapinv), that is, marked dubious.
At startup or shutdown, several iterations of reading/set bitmap/invalidate are made,
setting state fields as invalid or dubious in the RVB (see skgxnbc bitmap operations
where skgxnbcINV = 4, Invalidate map).
A member joins and invalidates the status flag of the bitmap. The other members see this
event change in their skgxnpstat call. Periodically, NM calls skgxnpstat to read
the bitmap. If the bitmap is invalid, then it initiates reconfiguration, and group membership
has changed via skgxngeRCFG. Reconfiguration at this layer means rebuilding the
bitmap. The status calls are in the LMON loop.
DSI408: Real Application Clusters Internals I-81

NM Membership Changes (continued)


Note: Rebuilding the entire bitmap may involve nested joins or deletes. The NM should
be able to handle this.
In 8.1.7, the DLM can detect reconfiguration because it can now talk to the NM API
directly. In release 8.0, you went through the GMS. The DLM detects reconfiguration
by calling status from NM. The DLM then calls CGS to do incarnation/synchronization.
In 8.0, it was hierarchical NM->GMS->DLM->RDBMS.
This node membership bitmap model is the referenced implementation and has not
changed in Oracle9i.
When you register, you tell NM what group you register to. The NM is responsible for
tracking your membership. The way it keeps track is through UDLM resources (MN, MM).
These resources are global and persistent. Through these resources, the NM can pull up
the right bitmap (in the event that there are multiple databases on the cluster).
Note: On platforms that do not use UDLM (such as Tru64), they may use the same
reference implementation but call out to a different DLM.
Each member maintains a lock on a members resource. If a node exits, then this
resource becomes invalid.
If an instance is alive, then there is a holder of the MM resource. If the instance exits,
then there is no holder. That is how the NM knows when an instance joins or exits the
group. This may be a lengthy process as the NM goes and checks the MM resource to
see whether it is dubious or not.
After the entire bitmap is rebuilt, it sends an event upward NM->CGS->DLM with a
new bitmap. CGS synchronizes the bitmap as there are transient operations active. CGS
must synchronize the cluster to be sure that all members get the reconfiguration event
and that they all see the same bitmap.
This is how NM API is implemented on Solaris. MN/MM resources are only used for
Node Monitor. On Solaris, the UDLM is part of the cluster software. This should not be
confused with the IDLM, which is part of the Oracle kernel.

DSI408: Real Application Clusters Internals I-82

NM Membership Death

4-83

Copyright 2003, Oracle. All rights reserved.

NM Membership Death
Given a bitmap composed of eight nodes, all of which are up, skgxnpstatus is called.
This call also calls skgxn_neighbor to determine the right-side neighbor and
skgxn_test_member_alive to determine its status rather than scanning the entire
bitmap. This avoids all nodes calling skgxnpstatus to read the entire bitmap. This is a
protected read. When invalidating the bitmap, it is a lock:write.
Note: Reconfiguration may not happen simultaneously in all nodes. This is why the CGS
layer above must do the synchronization.

DSI408: Real Application Clusters Internals I-83

Starting an Instance: Traditional

Instance A runs, and


instance B starts:
1. B registers
2. Notification
3. Reconfiguration

Instance A
2
LMD0
LMON
3

CM
Instance B
LMD0

LMON

4-84

Copyright 2003, Oracle. All rights reserved.

Starting an Instance: Traditional


Assume that instance A is running and instance B starts and joins the RAC cluster:
1. Instance B registers with the CM.
2. The CM notifies all instances that the cluster has changed.
3. The instances adjust themselves. This involves reconfiguration of the Cluster Group
Services, which in turn reconfigures the Global Resource Manager.

DSI408: Real Application Clusters Internals I-84

Starting an Instance: Internal

LMON trace, Instance A

Instance A

*** 2002-08-23 17:40:04.496


kjxgmpoll reconfig bitmap: 0 1
*** 2002-08-23 17:40:04.497
kjxgmrcfg: Reconfiguration
started, reason 1

CGS/GMS
NM
Communication via CM
Instance B
CGS/GMS
NM

4-85

Alert log, Instance B


Fri Aug 23 17:40:04 2002
ALTER DATABASE
MOUNT
Fri Aug 23 17:40:04 2002
lmon registered with NM instance id 2 (internal mem no
1)

Copyright 2003, Oracle. All rights reserved.

Starting an Instance: Internal


The node up/down state is communicated via the CM and is not shown.
The NM of instance B registration is with the CM, and thus communicated to the NM of
instance A. The NM in both nodes could communicate via their own IPC link, but
registration is done via the CM, because the instances do not know of each others
existence before they are running.

DSI408: Real Application Clusters Internals I-85

Stopping an Instance: Traditional

Both instances run,


and instance B stops:
1. Deregistration
2. Notification
3. Reconfiguration

Instance A
2
LMD0
LMON
3
CM
Instance B
LMD0

LMON

4-86

Copyright 2003, Oracle. All rights reserved.

Stopping an Instance: Traditional


Assume that both instances are running and that instance B stops in an orderly manner:
1. Instance B deregisters.
2. The CM sends the notification to the other registered members.
3. The instances adjust themselves. This involves reconfiguration of the Cluster Group
Services, which in turn reconfigures the Global Resource Manager.

DSI408: Real Application Clusters Internals I-86

Stopping an Instance: Internal

LMON trace, Instance A


*** 2002-08-23 17:45:04.596
kjxgmpoll reconfig bitmap: 0
*** 2002-08-23 17:45:04.597
kjxgmrcfg: Reconfiguration
started, reason 1

Instance A
CGS/GMS
NM
Communication via CM
Instance B
CGS/GMS
NM

4-87

Copyright 2003, Oracle. All rights reserved.

Stopping an Instance: Internal


The NM and CM layers cannot detect a sudden instance death. This situation is handled by
the IMR in CGS.

DSI408: Real Application Clusters Internals I-87

NM Trace and Debug

Event 29718 traces calls to the CGS.

4-88

Copyright 2003, Oracle. All rights reserved.

NM Trace and Debug


More details are covered in the debug section.

DSI408: Real Application Clusters Internals I-88

Cluster Group Services (CGS)

Provides a reliable and synchronized view of


cluster instance membership
CGS checks membership validity regularly.
Reconfiguration is stable.
Split-brain scenarios are avoided.

4-89

Distributed Repository is used by GES/GCS.


The CGS functionality is executed by LMON.

Copyright 2003, Oracle. All rights reserved.

Cluster Group Services (CGS)


The complexity of CGS lies in producing the reliable cluster view of member instances.
Apart from checking that members remain valid, it also requires a stable reconfiguration
algorithm and detection of split-brain situations, where the communication path between
nodes is lost.
The distributed repository infrastructure provides:
Process Group Service with skgxn
Synchronization Service
Name Service
Stores IPC port ids and other data needed for inter-instance communication

DSI408: Real Application Clusters Internals I-89

Configuration Control

Each configuration has an incarnation number.


Incremented with each change in group
membership
Must be the same in all instances when
configuration is complete

For reconfiguration, an additional substate


number identifies the step within the recovery
sequence.
Set to 0
Incremented after each synchronization

4-90

Copyright 2003, Oracle. All rights reserved.

Configuration Control
In the CGS, the most important data value is the incarnation value and synchronization.

DSI408: Real Application Clusters Internals I-90

Valid Members

Valid members must:


Perceive themselves to be part of the same
database group (no split-brain group)
Be able to communicate among themselves
(LMON and GES/GCS channels)
Be able to update the control file periodically
Failure of these requirements activates the Instance
Membership Reconfiguration (IMR).

4-91

Copyright 2003, Oracle. All rights reserved.

Valid Members
The CGS checks whether members in the database group are valid. It ensures that all
members are operating on the same configuration.
All members vote, detailing which incarnation they are voting on and a bitmap of
membership as they perceive it to be.
The member that tallies the votes waits for all members of the last incarnation to register
that they have received the reconfiguration.
Instance Membership Reconfiguration (IMR)
This is a component part of the CGS layer. Source is in kjxgr.h and kjxgr.c.

DSI408: Real Application Clusters Internals I-91

Instance Membership Reconfiguration (IMR) (continued)


From kjxgr.h - OPS Instance Membership Recovery Facility
DESCRIPTION
The IMR facility is intended to provide a means for
verifying the connectivity of instances of a database group
and to expedite the removal from the group of members that
are not connected, thus removing potential impediments to
database performance. Connectivity, in this context, can
take a variety of forms. The three forms of connectivity
that are monitored by the initial implementation of this
facility are:
skgxn (group membership)
skgxp (IPC)
disk (database files)
When an instance is considered to be not connected along any
of the monitored channels, the remaining instances perform a
reconfiguration of the group membership. During this
reconfiguration, the instances vote on what they perceive
the membership to be and a single arbiter assesses the votes
and attempts to arrive at an optimal subnetwork, which is
then published as the voting results. Both the voting and
the results publishing are done via the control file. All
members read the results and, based on the bitmap of the
membership in the results, either commit suicide or continue
with the reconfiguration.
This facility also expedites recovery by initiating recovery
upon detection of a loss of connectivity and offers an
opportunity to guarantee database integrity by providing
well-defined points in the recovery process to fence a
member that is perceived to be disconnected. At initial
implementation, however, the facility only guarantees
integrity to the extent that the skgxn group membership
facility guarantees integrity. In other words, the
membership as determined via the voting is synchronized with
the skgxn membership as a means of ensuring that I/O is not
generated by departing members. A more aggressive approach
would be to fence I/O of the departing members, thereby
allowing reconfiguration to complete with a guarantee of
database integrity even while the instance remains active.

DSI408: Real Application Clusters Internals I-92

Instance Membership Reconfiguration (IMR) (continued)


This facility also offers the feature of a limited guarantee
of database integrity by allowing a periodic check of
membership that can be employed to limit the potential for
I/O generation after a hang.
This facility may also be employed to propagate arbitrary
reconfiguration requests to the database group membership.
In the initial implementation, the only reconfiguration
events propagated through this facility are for
communications errors and for a detected member death.

DSI408: Real Application Clusters Internals I-93

Membership Validation

Instance A
Instance B

LMON
CGS (IMR)

Instance C

LMON

CKPT

CGS (IMR)
CKPT

LMON
CGS (IMR)
CKPT

CKPT writes to control file every three seconds.

Control file
4-94

Copyright 2003, Oracle. All rights reserved.

Membership Validation
The CKPT process updates the control file every three seconds, an operation known as the
heartbeat. CKPT writes into a single block that is unique for each instance; thus no
coordination between instances is required. This block or record is called the checkpoint
progress record and is handled specially. The CREATE DATABASE MAXINSTANCE
parameter controls the number of these block records. The heartbeat also occurs in single
instance mode.
LMON sends messages to the other LMON processes. If the send fails or no message is
received within the timeout, then reconfiguration is triggered. The LMON message send
failure detection is controlled by _cgs_send_timeout. The default value is 300
seconds.
Control file update failure is controlled by _controlfile_enqueue_timeout. The
default value is 900 seconds.
Reducing these values could cause false failure detection under heavy load. Using values
that are too large could cause hang-like conditions, where a bad instance member remains
undetected.
Note: Although the description is of a process doing a particular job, the code is part of the
CGS layer.
DSI408: Real Application Clusters Internals I-94

Membership Invalidation

Members are evicted if:


A communications link is down
There is a split-brain (more than one subgroup) and
the member is not in the largest subgroup
The member is perceived to be inactive

4-95

An IMR initiated protocol results in an eviction


message: ORA-29740.
Vendor clusterware may perform node eviction.

Copyright 2003, Oracle. All rights reserved.

Membership Invalidation
IMR-initiated eviction of a member is not performed if a group membership change occurs
before the eviction can be executed.
Deciding the Membership
All members attempt to obtain a lock on a control file record (the Result Record) for
updating. The instance that obtains the lock tallies the votes from all members.
The group membership must conform to the decided membership before allowing the
GCS/GES reconfiguration to proceed; a skgxn reconfiguration with the correct
membership must be observed.
Vendor Clusterware
Vendor clusterware may also perform node evictions in the event of a cluster split-brain.
IMR detects a possible split-brain and waits for the vendor clusterware to resolve the splitbrain. If the vendor clusterware does not resolve the split-brain within
_IMR_SPLITBRAIN_RES_WAIT (default value of 600 milliseconds), then the IMR
proceeds with evictions.

DSI408: Real Application Clusters Internals I-95

Membership Invalidation (continued)


Bug#2209228 (fix 9.0.1.4/9.2) IMR RESOLVES SPLIT-BRAIN CONTRARY TO
CLUSTERWARE RESOLUTION
Bug#2401370 (fix 10.0) SPLIT-BRAIN WAIT IS NOT LONG ENOUGH.
Set _imr_splitbrain_res_wait as milliseconds (e.g., for a 10-minute wait),
specify _imr_splitbrain_res_wait=600000

DSI408: Real Application Clusters Internals I-96

CGS Reconfiguration Types

Group membership change


Initiated by skgxn
Caused by instance starting up or shutting down

Communications error
Initiated by IMR
Caused by communications error to either LMON or
GES/GCS

Detected member death


Initiated by IMR
Caused by instance failing to issue heartbeat to the
control file

4-97

Copyright 2003, Oracle. All rights reserved.

CGS Reconfiguration Types


Note: The skgxn code is the NM.

DSI408: Real Application Clusters Internals I-97

CGS Reconfiguration Protocol

Reconfiguration is initiated by skgxn or IMR.

Six reconfiguration steps ensure that members


have the same view of the group.
One instance coordinates activities with all
instances for CGS steps.
GES/GCS reconfiguration starts when CGS
reconfiguration is complete.

4-98

Copyright 2003, Oracle. All rights reserved.

CGS Reconfiguration Protocol


The reconfiguration is initiated when skgxn (that is, the NM) indicates a change in the
database group or the Instance Membership Recovery (IMR) detects a problem.
Reconfiguration is initially managed by the CGS, then the DLM (GES/GCS)
reconfiguration starts.
The coordinating instance is called the master node. This is usually known at the start of
reconfiguration, but if it is not known, one is nominated (typically the node that triggered
the reconfiguration). The master node hangs until all members send their reconfiguration
or incarnation acknowledgment. skgxnpstat should pick up the reconfiguration event.
CGS can handle nested reconfiguration events.
When the CGS reconfiguration steps are complete, the GES/GCS or IDLM reconfiguration
is started.

DSI408: Real Application Clusters Internals I-98

Reconfiguration Steps

Step 1:
a. Complete pending broadcast with RCFG status.
b. Freeze name service activity.
c. Freeze the lock database

Step 2:
a. Determine valid membership, Instance Membership
Recovery.
b. Synchronize incarnation.
c. Increment incarnation number.

Step 3:
Verify instance name uniqueness.

4-99

Copyright 2003, Oracle. All rights reserved.

Reconfiguration Steps
LMON trace file excerpt
*** 2002-08-23 17:26:01.262
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 1 0.
*** 2002-08-23 17:26:01.266
Name Service frozen
kjxgmcs: Setting state to 1 1.
*** 2002-08-23 17:26:01.367
Obtained RR update lock for sequence 1, RR seq 1
*** 2002-08-23 17:26:01.370
Voting results, upd 0, seq 2, bitmap: 0 1
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 2 2.
Performed the unique instance identification check
kjxgmps: proposing substate 3

DSI408: Real Application Clusters Internals I-99

Reconfiguration Steps

Step 4:
Delete nonlocal name service entries.

Step 5:
a. Republish local name entries.
b. Resubmit pending requests.

Step 6:
a. Publish LMD processes IPC port-ids in the name
service.
b. Unfreeze name service.

Step 7:
Return reconfiguration RCFG event to GES/GCS.

4-100

Copyright 2003, Oracle. All rights reserved.

Reconfiguration Steps (continued)


LMON trace file excerpt (continued)
kjxgmcs: Setting state to 2 3.
Name Service recovery started
Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 2 4.
Multicasted all local name entries for publish
Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 2 5.
Name Service normal
Name Service recovery done
*** 2002-08-23 17:26:01.378
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 2 6.
*** 2002-08-23 17:26:01.474
*** 2002-08-23 17:26:01.474
Reconfiguration started

DSI408: Real Application Clusters Internals I-100

IMR-Initiated Reconfiguration: Example


Broken
communication

Instance A
Instance B

LMON
CGS (IMR)

Instance C

LMON

CKPT

CGS (IMR)
CKPT

LMON
CGS (IMR)
CKPT

Control file
4-101

Copyright 2003, Oracle. All rights reserved.

IMR-Initiated Reconfiguration: Example


The scenario is a broken communication link. Instance C is no longer sending or receiving
LMON messages but otherwise is working normally.
Control File Vote Result Records (CFVRR) contains
seq# inst# bitmap
2
0
110
2
1
110
2
2
001

The CFVRR is stored in the same block as the heartbeat in the control file checkpoint
progress record (see kjxgr.c/h).
Alert log in Instance C
Errors in file
/export/oracle/app/admin/rac/bdump/rac2_lmon_10911.trc:

Instance C is evicted. Its bit does not show up in the other members list of valid
members, thus it must leave the cluster.
ORA-29740: evicted by member 0, group incarnation 3
LMON: terminating instance due to error 29740

DSI408: Real Application Clusters Internals I-101

LMON Trace Instance A


Commented and slightly edited for brevity.
Failure is detected by LMON.
*** 2002-08-19 15:26:40.360
kjxgrcomerr: Communications reconfig: instance 1 (2,2)

IMR initiates CGS reconfiguration.


kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2002-08-19 15:26:46.469
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 2 0.
*** 2002-08-19 15:26:46.473
Name Service frozen
kjxgmcs: Setting state to 2 1.

The instance which obtained the RR lock tallies the vote result from all nodes and
updates the CFVRR.
*** 2002-08-19 15:26:46.592
Obtained RR update lock for sequence 2, RR seq 2
*** 2002-08-19 15:27:29.198
kjxgfipccb: msg 0x80000001002babe8, mbo
0x80000001002babe0, type 22, ack 0, ref 0, stat 3
kjxgfipccb: Send timed out, stat 3 inst 1, type 22, tkt
(32144,0)
:
:
*** 2002-08-19 15:28:27.526
kjxgrrecp2: Waiting for split-brain resolution, upd 0,
seq 3
*** 2002-08-19 15:28:28.127
Voting results, upd 0, seq 3, bitmap: 0

The evicted instance is terminated.

CGS reconfiguration is proposed.

Evicting mem 1, stat 0x0007 err 0x0002


kjxgmps: proposing substate 2
kjxgmcs: Setting state to 3 2.
Performed the unique instance identification check
:
:
*** 2002-08-19 15:28:37.802

CGS/GES reconfiguration and instance recovery is started by the surviving instance.


Reconfiguration started
Synchronization timeout interval: 600 sec
List of nodes: 0,

DSI408: Real Application Clusters Internals I-102

Code References

4-103

kjxg* : the CGS layer


kgxg* : the NM/CGS (still called GMS) layer
Skgxn.v2

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-103

Summary

In this lesson, you should have learned about the:


Node Monitor functionality
Reconfiguration sequence

4-104

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-104

RAC Messaging System

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Outline the messaging subsystem architecture
Describe the trace options for IPC layers

5-107

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-107

RAC and Messaging


Node
Instance

Messages used for:


Lock changes
Data blocks
Cluster
information

Caches

ksi/ksq/kcl
GRD
CGS
NM

I
P
C

Other
nodes
(not
shown)

CM

5-108

Copyright 2003, Oracle. All rights reserved.

RAC and Messaging


Messaging is used by the lock system.
Messaging is used for both intra-instance communication (between the processes of the
same instance on the same node) and inter-instance communication (between processes
on other nodes).
Messaging is used by:
The LMON process to communicate with the other LMON processes
The LMD process to communicate with the other LMD processes
Any processs lock client performing direct send operations

DSI408: Real Application Clusters Internals I-108

Typical Three-Way Lock Messages


1, 2: Direct Send
3:
Memory Copy
4:
Deferred

Instance R

Instance H

1
4

Instance M

5-109

Copyright 2003, Oracle. All rights reserved.

Typical Three-Way Lock Messages


The DLM or GRM/GES functionalities are explained in later lessons.
Assume that instance R needs a block from the H instance and that the resource lock is
managed by the M instance.
1. The requester instance, R, sends a message to the master node, M. This is a critical
message, so this uses Direct Send. The transport protocol returns an
acknowledgment (not shown).
2. The master node, M, sends a command to forward the resource to the holding
instance, H. This, too, is a Direct Send.
3. The holding instance, H, sends the resource to the requesting instance, R. This is a
Memory Copy (memcpy) message, and the resource is received into its
destination memory.
4. The requestor instance, R, sends an acknowledgment for the resource message to
the master node, M. This message is not critical for response, so it is sent Deferred;
that is, it is placed in a queue for sending when convenient.

DSI408: Real Application Clusters Internals I-109

Asynchronous Traps

To communicate the status of lock requests, GES


uses two types of asynchronous traps (ASTs) or
interrupts:
Acquisition AST (AST)
Blocking AST (BAST)

Lock status may reflect late or lost messages.


V$LOCK_ELEMENT (or X$LE) columns MODE_HELD,
RELEASING, or ACQUIRING are non-zero

5-110

Copyright 2003, Oracle. All rights reserved.

Asynchronous Traps
When a process requests a lock on a resource, the GES sends an acquisition AST to
notify the processes that currently own locks on that resource in incompatible modes.
Upon notification, owners of the locks can relinquish them to permit access to the
requestor.
When a lock is obtained, an acquisition AST is sent to tell the requester that it now owns
the lock.
To determine whether a blocking AST has been sent by a requestor or whether an
acquisition AST has been sent by the blocker (or owner of the lock), query the fixed
view GV$LOCK_ELEMENT or X$LE and check which bits are set. Examples for
incompatible modes are shared and exclusive modes.
An acquisition AST acts like a wakeup call.

DSI408: Real Application Clusters Internals I-110

AST and BAST

Each IDLM client process has an AST queue.


The following operations take place when an LMD
delivers an AST:
LMD hangs an AST structure in the IDLM client AST
queue.
LMD posts the IDLM client.
IDLM client has to scan the AST queue to process
the delivered AST.

5-111

LMS delivers a BAST to a process that owns a


lock that conflicts with a converting request.

Copyright 2003, Oracle. All rights reserved.

AST and BAST


ASTs are delivered by LMD or LMS to the process that has submitted a lock request.
In the earlier releases of Oracle9i, all messages went to the LMD on the remote node,
which had to repost the message to the actual waiting process.

DSI408: Real Application Clusters Internals I-111

Message Buffers

The two message buffer types are:


KJCCMSG_T_REGULAR
KJCCMSG_T_BATCH

5-112

Copyright 2003, Oracle. All rights reserved.

Message Buffers
Any sender or receiver allocates a message structure (or message buffer) before sending
or receiving a message.
KJCCMSG_T_BATCH is mostly used in reconfiguration or in remastering, or after
delivering a buffer in cache fusion.
There are three pools of messages:
REGULAR: With initial #buffers = processes*2 + 2*10 + 10 + 20
BATCH: With initial #buffer = processes*2 + 2*10 + 10 + 20
RESERVE: With initial #buffer = min(2*processes, 1000)
If the REGULAR pool is exhausted, then more allocations are done from the shared
pool.

DSI408: Real Application Clusters Internals I-112

Message Buffer Queues


Allocate

MsgPool

OutstandingQueue
OutstandingQueue

FreeMsgQueue
FreeMsgQueue
Release
Send-done
PendingSendQueue
PendingSendQueue

Direct send

Send
SendQueue
Indirect send

5-113

Copyright 2003, Oracle. All rights reserved.

Message Buffer Queues


Several queues are in place to hold message buffers if they come from the SGA message
pools (like the REGULAR, BATCH and RESERVE pools). This is done to faciliate the
recovery message buffers.
OutstandingQueue, FreeMsgQueue, and PendingSendQueue are per process.
SendQueue and MsgPool are per instance. There is a threshold to trigger a process to
start releasing free available message buffers back to the shared message pools.

DSI408: Real Application Clusters Internals I-113

Messaging Deadlocks

5-114

Messaging can cause deadlocks to appear.


To avoid such deadlock situations, Oracle
introduced a Traffic Controller.

Copyright 2003, Oracle. All rights reserved.

Messaging Deadlocks
Messaging can cause deadlocks to appear. If you are waiting to send a message to
acquire a lock and there is another process waiting on the lock that you hold, then you
will not be checking on BASTs and so will not see that you are blocking someone. If
many writers are trying to send messages and no one is reading messages to free up
message buffer space, there can be a deadlock.
Like the interface, messaging protocol is port specific. The message is typically less
than 128 bytes, so the interconnect must be low latency. In addition, the number of
messages can be high. It typically depends on the number of locks or resources.
Basically the more locks or resources, the higher the traffic. In Oracle8, the number of
message buffers depended on the number of resources; in Oracle7, the number depended
on the number of locks.

DSI408: Real Application Clusters Internals I-114

Message Traffic Controller (TRFC)

5-115

Circumvents the possibility of send deadlocks


Uses the send buffer in kjga

Copyright 2003, Oracle. All rights reserved.

Message Traffic Controller (TRFC)


The TRFC is used to control the DLM traffic between all the nodes in the cluster by
buffering sender sends (in case of network congestion) and making the sender wait on
a send until the network window is big enough. This is managed by using tickets to
control the message flow.

DSI408: Real Application Clusters Internals I-115

TRFC Tickets

5-116

A number of tickets are kept in a pool.


A sender must acquire the ticket(s) before
performing a send operation.
Tickets are returned to the pool by the receiver
after reception.
GV$DLM_TRAFFIC_CONTROLLER shows the status
of the ticketing and the send buffers.

Copyright 2003, Oracle. All rights reserved.

TRFC Tickets
You use flow control to ensure that the remote receivers (LMD or LMS) have just the
right amount of messages to process. New requests from senders wait outside after
releasing the send latch, in case receivers run out of network buffer space. Tickets are
used to determine the network buffer space available.
Clients that want to send first get the required number of tickets from the ticket pool and
then send. The used tickets are released back to the pool by the receivers (LMS or LMD)
according to the remote receiver report of how many messages the remote receiver has
seen. Message sequence numbers of sending nodes and remote nodes are attached to
every message that is sent.
The maximum number of available tickets is a function of the network send buffer size.
If at any time tickets are not available, senders have to buffer the message, allowing
LMD or LMS to send the message on availability of the ticket. A node relies on
messages to come back from the remote node to release tickets for reuse. In most cases
this works, because most of the client requests eventually result in an ACK or ND.

DSI408: Real Application Clusters Internals I-116

TRFC Tickets (continued)


However, in some very specific and rare cases this may not be true. For instance, if
an application makes a large number of asynchronous blocking convert requests
without expecting notifications, you have a case where a request does not result in a
reply for some time. To force a reply from the remote node, you send a null request
to the remote node, forcing the remote node to send a null ACK back. Thus, if the
ticket level dips too low, you send a null request to the remote node.

DSI408: Real Application Clusters Internals I-117

TRFC Flow
Node 1

Node 2
Tickets are sent back
to requestor side by
attaching the number
of ACK tickets in the
message header.

Msg.
Msg.
Queued messages
waiting for tickets

LMD
LMS

No more
tickets

LMD
LMS

Msg.

sender

Tickets available

Msg.
Tickets depleted,
NULL_REQ message

5-118

Copyright 2003, Oracle. All rights reserved.

TRFC Flow
At the beginning, the number of available tickets is 500. One sent message consumes
one ticket. Each node maintains several counters for each communication partner.
AvailBuf: Number of buffers that are available to receive new messages (buffers
attributed to KSXP interface)
RecMsg: Number messages received, where message type is different from TEST,
NULL-REQ, or NULL-ACK
AvailMsg: Number of messages received (all types)
The pseudocode is:
if AvailBuf >= AvailMsg (if there are sufficient buffers)
then AckTickets = AvailMsg
else if RecMsg == AvailMsg (no NULL-REQUEST yet)
then AckTickets = AvailBuf
else if AvailMsg - RecMsg > AvailBuf (too many NULL-REQUEST)
then AckTickets = 0
else AckTickets = AvailBuf (AvailMsg - RecMsg)

DSI408: Real Application Clusters Internals I-118

TRFC Flow (continued)


Node 2 sends ACK tickets to node 1 to replenish the number of available tickets and
decrement AvailMsg, RecMsg, and AvailBuf with ACK tickets.
For more details, refer to kjcts_sndmsg, kjctr_updatetkt,
kjctr_collecttkt, and kjctcnrs (null request sent).

DSI408: Real Application Clusters Internals I-119

Message Traffic Statistics

System statistics V$SYSSTAT


gcs messages sent: number of PCM messages
sent
ges messages sent: number of non-PCM
messages sent

5-120

V$DLM_MISC reports statistics on messages of


local instance.

Copyright 2003, Oracle. All rights reserved.

Message Traffic Statistics


V$DLM_MISC is a direct view of x$kjifst.

DSI408: Real Application Clusters Internals I-120

Message Traffic Statistics (continued)


V$DLM_MISC
SQL> select name, value from V$DLM_MISC;
Name
Value
-------------------------------------- ---------messages sent directly
203662

Messages sent directly without going through queue (tickets available)


messages flow controlled
messages sent indirectly

104
148

Messages queued (and to be sent by LMD or LMS)


messages received logical
Messages received
flow control messages sent

178579
0

Null sent request + Null Acknowledge sent


flow control messages received
gcs msgs received
Number of PCM messages received
gcs msgs process time(ms)

1
1587
1867

PCM messages processed time (should include also CR build time)


ges msgs received

177013

Number of non-PCM messages received


ges msgs process time(ms)
Non-PCM messages processed time
msgs causing lmd to send msgs

30485
59070

LMD receives a message, processes it, and has to send another


message to end processing.
lmd
gcs
gcs
gcs

msg send time(ms)


side channel msgs actual
side channel msgs logical
pings refused

6104
16
154
0

When a ping is sent because of a conflict in PCM locks


(S and X for example), increment when ping cannot be processed
gcs writes refused

When a write request is processed and the processing is aborted


then increment this statistic
gcs error msgs

When an error message is received (rare)


gcs out-of-order msgs
gcs immediate (null) converts

0
16

Number of PCM converts done immediately (because compatible),


resource granted mode was NULL
gcs immediate cr (null) converts
gcs immediate (compatible) converts

1177
15

Number of PCM converts done immediately (because compatible),


resource granted mode was not NULL

DSI408: Real Application Clusters Internals I-121

Message Traffic Statistics (continued)


V$DLM_MISC
SQL> select name, value from V$DLM_MISC;
Name
...
gcs immediate cr (compatible) converts
gcs blocked converts
gcs queued converts
gcs blocked cr converts
gcs compatible basts
gcs compatible cr basts (local)
gcs cr basts to PIs
dynamically allocated gcs resources
dynamically allocated gcs shadows
gcs recovery claim msgs
gcs write request msgs
gcs flush pi msgs
gcs write notification msgs
gcs retry convert request
gcs forward cr to pinged instance
gcs cr serve without current lock
msgs sent queued

Value
25
10
0
16
2
1212
0
0
0
0
4
6
0
0
0
0
248

Number of messages dequeued from queued-messages list


msgs sent queue time (ms)
msgs sent queued on ksxp

9731
203910

When a message sent is completed by ksxp, increment this statistic


msgs sent queue time on ksxp (ms)
msgs received queue time (ms)
msgs received queued
implicit batch messages sent
implicit batch messages received
gcs refuse xid
gcs ast xid
gcs compatible cr basts (global)
messages received actual
process batch messages sent
process batch messages received
msgs causing lms(s) to send msgs
lms(s) msg send time(ms)

499892
311453
178600
6
46
0
0
7
177777
2
224
21
10

DSI408: Real Application Clusters Internals I-122

IPC

The IPC component:


Handles component-level demultiplexing

5-123

Parallel Query (IPQ)


Cache (data blocks)
DLM (GES)
Internal context

Handles Connection Management or Name Service


Integration
Integrates with the Post/Wait model used in the
Oracle server
Uses asynchronous request management,
including state management
Copyright 2003, Oracle. All rights reserved.

IPC
Because IPC was more synchronous in the releases before Oracle9i, the OPS systems
were more prone to hanging in this component. IPQ used its own interface (SKGXF).

DSI408: Real Application Clusters Internals I-123

IPC Code Stack

IPQ client

Cache client

DLM client
CGS client

KSXP

KSXP: Main IPC


Wait interface
Tracing
Message passing
Memory mapping

SKGXP: OSDdependent module

SKGXP

5-124

Copyright 2003, Oracle. All rights reserved.

IPC Code
The SKGXP module is the OSD module. The source that is available on tao includes
the reference implementation. This has extensive comments in skgxp.h.

DSI408: Real Application Clusters Internals I-124

Reference Implementation

For internal QA
Simple code for easy portability
Interface example
Uses standard protocols for communication
TCP/IP
UDP

5-125

Copyright 2003, Oracle. All rights reserved.

Reference Implementation
There are several reference implementations because there are several standard
protocols that can be used. These are available for the various ports.
Hardware vendors use the reference implementation as a starting point and replace the
protocol with their own optimized high-speed interconnect software by using their
hardware. This makes it very platform dependent.

DSI408: Real Application Clusters Internals I-125

KSXP Wait Interface to KSL

kslwat

ksl wait
facility
IPC

5-126

Default

IO

skgpwait

Net

ksxpwait

ksldwat

ksnwait

skgxpwait

skgfrwat odm_io

nsevwait

Copyright 2003, Oracle. All rights reserved.

KSXP Wait Interface to KSL


When Oracle processes expect something to happen, they usually update something in
the shared memory and wake up (post) some other Oracle process, and then wait to be
posted back.
Posts are considered unreliable, and there is no direct correlation between
receiving a post and the state change that has occurred.
The wait facilities allows processes to synchronize on I/O completion from a
single I/O source or a local post.

DSI408: Real Application Clusters Internals I-126

KSXP Tracing

Event 10401
Bit flags

5-127

0x01 Minimal in tracefile


0x04 BID tracking
0x08 Slow send debugging
0x10 Dump ksxp trace information to trace file via
ksdwrf instead of KST

KST tracing with _trace_events=10401:8:ALL

Copyright 2003, Oracle. All rights reserved.

KSXP Tracing
For more information, refer to ksxp.c of 10401. KST tracing is covered in a later
module.

DSI408: Real Application Clusters Internals I-127

KSXP Trace Records

All KSXP trace records contain the string 'KSXP'.


client says which component is performing the
operation (see ksxpcid.c). Cache = 1, DLM = 2,
IPQ = 3, CGS = 5.
krqh is the pointer to KSXP-level request handle.
srqh is the pointer to SKGXP-level request handle.
srqh is useful in correlating KSXP and SKGXP
tracing.

522683FB:000182BD
6
5 10401 39
KSXPQRCVB: ctx 2ec5a84 client 2 krqh 301c1bc srqh 301c218
buffer 2faca80

5-128

Copyright 2003, Oracle. All rights reserved.

KSXP Trace Records


The label states the record type:
KSXPQRCVB: Queue a receive buffer (shown above)
KSXPWAIT: Message completion
KSXPRCV: Message receive completion
KSXPMCPY: Remote memory copy
KSXPMPRP: Memory update

DSI408: Real Application Clusters Internals I-128

SKGXP Interface

Port Connection Interface

Memory Mapped Interface

5-129

Ports: Communication endpoints


Connections
Request handlers
skgxpcon, skgxpvsnd, skgxpvrcv, skgxpwait
Region
Buffer
Buffer ID (BID)
skgxprgn, skgxpprp, skgxpmcpy

Copyright 2003, Oracle. All rights reserved.

SKGXP Interface
The Port Connection is for asynchronous usethe client code submits a number of
requests to the interface and attempts to overlap the completion of these requests with
useful computation. This overlap of communication with computation acts to hide the
latency costs of remote communication.
Ports represent communication endpoints. Connections are used to cache information
regarding communication endpoints. Request handlers represent outstanding requests to
the interface (primarily outstanding message receives and sends).
Synchronization is provided by skgxpwait. Synchronization is integrated with the
standard VOS layer post/wait mechanism allowing Oracle processes to block the
waiting for outstanding network IPC or post from another process in the local instance.
The buffer cache uses the memory-mapped interface for cache fusion and parallel query
clients.
Regions are large areas of memory (such as the SGA). Clients that want to receive data
into their region prepare buffers in the region to receive data via the prepare call. The
output of the prepare call is a buffer ID or BID. BIDs are copy-by-value structures that
are transferred to remote instances via the lock manager. The BIDs are then used to
transfer data directly to the prepared buffer of the requesting process in the remote
instance.
DSI408: Real Application Clusters Internals I-129

Choosing an SKGXP Implementation

libskgxp.so contains the skgxp that is linked to


Oracle.
libskgxpd.so is a dummy implementation and
writes error messages when called.
Resolves linkage problems in non-RAC systems

5-130

Copyright 2003, Oracle. All rights reserved.

Choosing an SKGXP Implementation


Swap the libskgxp.so library with the library of your choice and relink using the
makefiles. Problems can occur if the files are not accessible via LD_LIBRARY_PATH
or if the protection is changed. Determining which library is linked in may be noted in
the LMON trace file.

DSI408: Real Application Clusters Internals I-130

SKGXP Tracing

Event 10402
Bit flags in level

5-131

KSXP_OSDTR_ERROR
KSXP_OSDTR_META
KSXP_OSDTR_SEND
KSXP_OSDTR_RCV
KSXP_OSDTR_WAIT
KSXP_OSDTR_MCPY
KSXP_OSDTR_MUP

0x01
0x02
0x04
0x08
0x10
0x20
0x40

Copyright 2003, Oracle. All rights reserved.

SKGXP Tracing
The levels for the event have changed considerably in Oracle9i Release 2. Examine
source skgxp.h, ksxp.c for details in older versions. In Oracle9i Release 1 (and
earlier), it was:
0x00040000
trace meta functions
0x00080000
trace send
0x00100000
trace receive
0x00200000
trace wait
0x00400000
trace cancel
0x00800000
trace post
0x02000000
trace unusual or error conditions
0x04000000
trace remote memory copies
0x08000000
trace buffer update notifications

DSI408: Real Application Clusters Internals I-131

Possible Hang Scenarios

5-132

If the Node Monitor and the IPC use different


protocols
Temporary drop-out on network

Copyright 2003, Oracle. All rights reserved.

Possible Hang Scenarios


On systems where the IPC traffic and node monitor communication traffic are on
separate networks, a hang may result when the IPC network fails, because the Node
Monitor has no knowledge of the IPC network. On Solaris, version 7, the DLM is the
Node Monitor, and it may or may not send traffic over the same interface that Oracle
uses.
Hangs Related to Send Timeouts or Send Failures
Oracle takes serious action in response to either of these events. The DLM times out on
sends a total of three times (about 10 minutes) before declaring a receiver unreachable.
Because the IPC interface guarantees reliable delivery, either event is taken to mean that
the instance is no longer reachable and should be removed from the cluster. The
instance goes into the reconfiguration state waiting for notification that the instance is
gone. If the timeout or failure was spurious, a hang results. Hangs that show IPC send
timeouts might indicate this condition.
To work around this problem, find the destination node that is thought to have failed and
shut it down.

DSI408: Real Application Clusters Internals I-132

Other Events for IPC Tracing

29726 : DLM IPC trace event


Level 9 and above turns on skgxp tracing.

29718 : CGS trace event


Level 10 and above turns on skgxp tracing.

10392 Parallel Query (kxfp)


Level 127 turns on skgxp tracing.

5-133

Copyright 2003, Oracle. All rights reserved.

Other Events for IPC Tracing


These events turn on IPC tracing as a side effect at high levels of tracing of their
functional stack.

DSI408: Real Application Clusters Internals I-133

Code References

kjc.h: Kernel Lock Manager Communication layer


ksxp.*: Kernel Service X (cross instance) IPC

5-134

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-134

Summary

In this lesson, you should have learned:


About the messaging components
How to activate tracing of IPC

5-135

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-135

System Commit Number

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Explain the function of the System Commit
Number (SCN)
Describe SCN propagation schemes

6-137

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-137

System Commit Number


Node
Instance
SCN

SCN
Other
nodes
(not
shown)
CM

6-138

Copyright 2003, Oracle. All rights reserved.

System Commit Number (SCN)


The SCN represents the logical clock of the database. As such, it has to be global in the
RAC. This is not possible without extra hardware, but it can be simulated well enough by
synchronizing the instance local SCNs. By using the SGA, you can handle process-toprocess SCN coordination in a non-RAC environment.

DSI408: Real Application Clusters Internals I-138

Logical Clock and Causality Propagation

Oracle uses SCNs to order events.


An update commits with an SCN.
Any process that tries to get an SCN at a later time
must always receive a greater or equal SCN value.
There is no ambiguity in the order of events and
their SCN.

6-139

You must synchronize SCNs between instances


from time to time in an RAC environment.
The synchronization activity is called causality
propagation.

Copyright 2003, Oracle. All rights reserved.

Logical Clock and Causality Propagation


In computation, the association of an event with an absolute real time is not essential; you
need to know only an unambiguous order of events.
In RAC, the causality may suffer:
Assume that process 1 on instance 1 performs an update and commit with SCN.
Process 2 on instance 2, which tries to get the SCN later, is not guaranteed to obtain a
higher or equal SCN value. Sometimes process 2 does not see the committed changes
that were made by process 1 even if a read is done after the committed change.
In practice, this occurrence is rare and the time window where it can occur is very small.

DSI408: Real Application Clusters Internals I-139

Basics of SCN

SCN wrap: 2 bytes


SCN base: 4 bytes
Monotonically increasing
Current SCN and Snapshot SCN

SCN Wrap

6-140

SCN Base

Copyright 2003, Oracle. All rights reserved.

Basics of SCN
Much can be said about the SCN and the nature of causality.
The essentials are:
The SCN must always increase and may skip a number of values.
The SCN must be kept in sync between multiple instances.
- In RAC: Between all instances mounting the database
- In distributed databases: All instances that are involved in a distributed
transaction (that is, when using database links)
- Synchronizing means using the highest known SCN. Otherwise it conflicts with
the requirement to increase.
Dependencies (causality) between changes must be maintained (for example, in
multiple changes to the same block by different transactions).
For more information, refer to Note 33015.1.
There is some distinction between the Current SCN that is used for a commit and the
Snapshot SCN that is used for a Consistent Read (CR) operation. The Snapshot SCN is the
highest SCN seen or used by the instance.

DSI408: Real Application Clusters Internals I-140

Basics of SCN (continued)


At startup, the SCNs across the nodes are initialized to the database SCN (the highest SCN
recorded at the last shutdown), which is synchronized across the cluster. All nodes have the
same SCN at startup.
The SCN from a kernel standpoint is a service. Before a client can use an SCN or call
CURRENT SCN, GET NEXT SCN, or GET SNAPSHOT SCN routines, it must initialize
the service. That initialization uses the database SCN.

DSI408: Real Application Clusters Internals I-141

SCN Latching

Updating the 6 bytes of an SCN must be atomic.


Latching modes are supported for compare and
swap (CAS) primitives.

CAS Primitive
None
32-bit CAS primitives
64-bit CAS primitives

6-142

Latch-Free Access
Reads
Reads and writes
Reads and writes

Access with Latch


Writes
SCN wrap changes only
Never

Copyright 2003, Oracle. All rights reserved.

SCN Latching
If the operation to update or increment the SCN cannot be performed as an atomic or
single CPU instruction, you must latch or lock the SCN data structure so that the other
processes do not see an invalid SCN.
Latchless CAS operations are controlled by the following initialization parameters:
_disable_latch_free_SCN_writes_via_32cas
The default is False (that is, enabled by default).
_disable_latch_free_SCN_writes_via_64cas
The default is True (that is, disabled by default, even if it is supported on the platform).

DSI408: Real Application Clusters Internals I-142

Lamport Implementation

Assign a time SCN(x) to an event x, such that for


any events a and b , if a -> b then SCN(a) <=
SCN(b).
Mechanism to assign Logical Time
Each instance increments local SCN between two
successive COMMITs.
If instance A sends a message m to instance B, then
m also contains instance As current SCN (SCNA) at
the time that m is sent. When instance B receives
message m, instance B sets instance Bs SCN to
max (SCNA , instance Bs current SCN ).

6-143

Copyright 2003, Oracle. All rights reserved.

Lamport Implementation
Earlier, Oracle OPS had a choice of SCN propagations, some of them using platformspecific hardware protocols. The Lamport scheme was the reference implementation.

DSI408: Real Application Clusters Internals I-143

Lamport SCN

Oracle9i RAC uses the Lamport scheme:


Attaches SCNs on each lock message
Guarantees partial ordering only
Preserves causality through periodic pinging of
the SC lock
Is more efficient, because each node can generate
SCNs simultaneously

6-144

Copyright 2003, Oracle. All rights reserved.

Lamport SCN
The Lamport SCN propagation assumes that there is a constant exchange of messages. If
an instance does many commits on blocks where it has cached all data, the SCN will not
change at the other nodes, as there are no messages sent. This is solved with a periodic
SCN update.
The SC global resource or lock is used to communicate the SCN for the periodic update.
Its value field contains the current SCN, and the instance holding the exclusive lock can
update the field. You can think of the SC lock as a dummy lock that is used if the SCN
has not been propagated recently through other lock or message activity.
For more information, refer to kjm.c.
Source References
The message sending routines in kjc.c will insert the current SCN into every message at
scn_kjctmsg. Messages that are received by LMD (9.0) or LMS (9.2) compare and
update the local SCN if the local SCN is lower.
The SCN is shown in message dump/traces.

DSI408: Real Application Clusters Internals I-144

Limitations on SCN Propagation


701

701
SCN sync.

Time

702

Tx1 Start
|
Commit

Tx3 Tx7
707

708

Tx2 Start
|
Commit

Tx8 Start
|
Commit

Instance 1
6-145

702

707

SCN sync.

Instance 2
Copyright 2003, Oracle. All rights reserved.

Limitations on SCN Propagation


If the beginning of Tx2 is later than the commit of Tx1 and less than the time delay
max_commit_propagation_delay, then Tx2 may not see the changes that are
made by Tx1.
Note that there is an implicit protocol in the kernel to synchronize the SCN every three
seconds by using LCK piggybacking of the SCN in DLM messages. In case of
communication problems, these messages are subject to the traffic controller.
Problems with SCN synchronization may manifest themselves as ORA-600 [2662] errors
(see note 28929.1).
If Tx2 wants to read a block that is used by Tx1, it builds a CR buffer based on too low an
SCN (701), because the local SCN for that buffer is still valid and they are not
synchronized yet.
If the local low SCN is later than Tx1s commit SCN, Tx2 sees the changes from Tx1. It is
OK to see it early; it absolutely has to see it after
max_commit_propagation_delay!
The SCN limitation is only evident in operations that do not cause lock messages to be
exchanged. Between max_commit_propagation_delay timeouts, the SCN is
synchronized via the LCK process and messaging, which are very dependent on the type
of work performed.
DSI408: Real Application Clusters Internals I-145

max_commit_propagation_delay

6-146

max_commit_propagation_delay is the delay


time to propagate SCN changes to the other nodes
after a commit.
max_commit_propagation_delay is given in
centiseconds; the default is 700 (7 seconds).
The SCN is also propagated with every lock
message.
The max_commit_propagation_delay parameter
has several effects.

Copyright 2003, Oracle. All rights reserved.

max_commit_propagation_delay
With Lamport SCN, every instance maintains locally generated SCNs. When they generate
a new SCN, the instance does not need to synchronize the SCN within the
max_commit_propagation_delay amount of time. Instances can increase their
locally generated SCN based on global SCNs
max_commit_propagation_delay < 1 second
Each time LGWR writes to the redo log (that is, with every commit):
- LGWR sends a message to the SCN resource (SC, 0, 0) master to update SCN.
- LGWR sends a message to every active instance to update SCN.
1 second < max_commit_propagation_delay < 7 seconds
Each time LGWR writes to the redo log, it also sends a message to the SCN
Resource Master to update the SCN.
If a Snapshot SCN is required by an instance and more than the
max_commit_propagation_delay time has elapsed since the last
synchronization event, then the process sends a message to the SCN resource master
to update the SCN.
7 seconds < max_commit_propagation_delay
Every three seconds, the LCK process sends a message to the SCN resource master
to update the SCN.
DSI408: Real Application Clusters Internals I-146

Piggybacking SCN in Messages


Instance B

Instance A
FG
LMS

clk_val_kjxreqh
> local SCN

Message
scn_kjctmsg
SCN

SCN

scn_kjctmsg
LMS

The SCN of the instance sending a message is


systematically stored in the message header.
6-147

Copyright 2003, Oracle. All rights reserved.

Piggybacking the SCN in Messages


During any message preparation in instance A, the scn_kjctmsg routine adds the
current SCN to the message. On receiving any message, instance B compares the SCN in
field clk_val_kjxreqh to the current SCN. If it is greater, then it updates the local
SCN to the SCN received in the message.

DSI408: Real Application Clusters Internals I-147

Periodic Synchronization

Every three seconds, LCK0 calls kcsciln.


Called for SCNLCK and SCNSRV only
If Lamport is in use, then pings SC lock
If not using Lamport, then update SCN Server or SC
Lock resource depending on the scheme

Periodic synchronization does not occur if


max_commit_propagation_delay is less than
one second
1:KJX_GET_SCN_REQ

LCK0

LMD0
2: Simple ACK, includes SCN

Node 2

Node 1

6-148

Copyright 2003, Oracle. All rights reserved.

Periodic Synchronization
The LCK0 timeout event, kcsmto, checks whether it is time for an SCN update.

DSI408: Real Application Clusters Internals I-148

SCN Generation in Earlier Versions


of Oracle

6-149

Lamport method was one of several.


Earlier choice of methods were less generic.

Copyright 2003, Oracle. All rights reserved.

SCN Generation in Earlier Versions of Oracle


In Oracle8i:
DLM lock (SCNLCK). SC resource in DLM, slow, uses Lamport if
max_commit_propagation_delay >700 centiseconds
SCN server (SCNSRV). Uses port-specific OSDs for SCN server, uses Lamport if
max_commit_propagation_delay >700
In Oracle8:
SCNLCK (as above)
SCNSRV (as above)
Broadcast on commit (SCNBOC). Used in the DLM lock scheme when
max_commit_propagation_delay <100
Hardware clock (SCNCNT)
In Oracle7:
DLM lock implementation (SCNLCK) using the SC resource in DLM
SCN Server (SCNSRV) was never really implemented
Lamport: Implemented in DLM. This did not possess full causality preservation until
Oracle 7.3.4
Hardware clock: SP2 switch, for example
DSI408: Real Application Clusters Internals I-149

Code References

kcm.*: Kernel Cache Miscellaneous


kcs.*: Kernel Cache SCN Management
scn.h: Lamport implementation details
sparams.h: Some comments on SCN schemes

6-150

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-150

Summary

In this lesson, you should have learned how to:


Explain SCN propagation
Describe the purpose of the SCN in lock messages

6-151

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-151

Global Resource Directory

Formerly the Distributed Lock Manager

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Describe the Global Resource Directory concepts
and components
Describe the global locking model of enqueues
Outline the internal resource allocations

7-153

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-153

RAC and Global Resource Directory (GRD)


Node

Previously known as the


Distributed Lock Manager
(DLM)

Instance

Caches

ksi/ksq/kcl
GRD/GCS/GES
CGS
NM

I
P
C

Other
nodes
(not
shown)

CM

7-154

Copyright 2003, Oracle. All rights reserved.

RAC and Global Resources


For particular applications or for historic reasons, the Distributed Lock Manager (DLM)
has many alternative terms that are used to describe it.
The Global Resource Directory (GRD) is the function that manages the locking or
ownership of all resources that are not limited to a single instance in RAC. Generally,
this is the same as a DLM, and the GRD is a DLM implementation, based on the IDLM
of Oracle8i and earlier releases.
The GRD can be considered to consist of Global Cache Services (GCS), which handles
the data blocks, and the Global Enqueue Service (GES), which handles the enqueues
and other global resources.
The terms GRD, GES, and GCS are the preferred terms, but DLM is the pervasive term
in all materials and, therefore, used in this course.

DSI408: Real Application Clusters Internals I-154

DLM History

7-155

Oracle7: External OS-based DLM


Oracle8: Integrated DLM
Oracle8i: Cache fusion 1, the CR problem
Oracle9i: Cache fusion 2

Copyright 2003, Oracle. All rights reserved.

DLM History
The Oracle DLM comes out of the development that is performed primarily on SP2 and
HP DLMs for Oracle7, which were used where the vendors did not provide any DLM.
In Oracle 7, version 3, Digital, Sequent, NCR, and Pyramid used their own DLMs. They
were all different, as were the debugging tools and the output. The particular
functionality that was supported in each case also varied, which made it difficult for
Oracle to implement certain functions on some platforms at certain releases. Groupbased locking is an example.
In Oracle7 DLMs, pipes facilitated the communication between the DLM daemons and
the client processes. In Oracle8, clients of the DLM have direct access to the DLM
structures in the SGA. This permits optimization of the communication path by allowing
clients to modify the structures directly and by waiting only on an LMD process to send
messages to remote nodes where remote operations must be performed. Therefore, local
lock operations can be considerably faster.
The DLM has been continuously improved with more views, better deadlock detection,
and changed message paths to eliminate needless context switches. The Cache Fusion
improvements are more of a change in how the client buffer handling routines use the
DLM.
DSI408: Real Application Clusters Internals I-155

DLM Concepts: Terminology

7-156

Resource: Any object accessed by the application


Client: Any process asking for a resource
Lock: An intention of a client on a resource
DLM services: Allow client applications to create,
modify, and delete locks that are shared
DLM database: Stores information on resources,
locks, processes

Copyright 2003, Oracle. All rights reserved.

DLM Concepts: Terminology


Since Oracle8, the DLM database has been integrated in the Oracle SGA (that is, part of
the IDLM).
Directory Node Structures: Area in DLM memory that stores which node is the lock
master for each lock.. In Oracle8i the master node is always the directory node. In
Oracle9i, the dynamic remastering uses a lookup table to map the hashed master key to
the actual master (this is explained later), but it is not named the directory node.

DSI408: Real Application Clusters Internals I-156

DLM Concepts: Resources

The DLM does not provide the ability to lock the


objects themselves.
The DLM provides the resources as the lockable
entity.
The client code defines what this resource
represents and what protocols are satisfactory to
access it. There are two resource types:
PCM resources are for block buffers.
Non-PCM resources are [0x10000f8][0x1],[BL]
row locks (transaction
Grant Q
enqueues), file locks,
Convert Q
and instance locks.
Lock value block

7-157

Copyright 2003, Oracle. All rights reserved.

Resources
A resource is just a name. Each resource can have a list of locks that are currently
granted to users. This list is called the Grant Q. Similarly there is a Convert Q, which is
a queue of locks that are waiting to be converted. In addition, a resource has a 16-byte
lock value block (LVB) that contains a small amount of data. The LVB is used in some
resources. For example, the PS resource for parallel query slaves uses it to pass the
kxfpqd structure to the other nodes.
The two resource types have different data structures.

DSI408: Real Application Clusters Internals I-157

DLM Concepts: Locks

A client (user) must get a lock on a resource to be


able to use what it represents.
Two types, with different data structures
Enqueues: Locks on non-PCM resources
Lock Elements: Locks on PCM resources

Locks can be acquired in various modes in


accordance with a matrix of compatible modes.
[0x10000f8][0x1],[BL]

Grant Q
Convert Q
Lock value block

7-158

lockp
PID
GID/DID

Copyright 2003, Oracle. All rights reserved.

Locks
If the lock before use rule has not been followed by the Oracle programmer, then that
is a bug. It may not show up as system or data corruption for some time.
The DLM lock modes and the Oracle locking modes are not identical. The locking
matrix for the DLM is covered in later slides. The lock matrix depends on the type of
lock.
Locks are placed on a resource. When a process has a lock on the grant queue of the
resource, it is said to own the resource. Imprecise usage also talks of owning the
lock.
The example in the slide shows a lock on the Grant Q of the resource. The lock may be
either process- or group-owned. If it is process-owned, the PID field shows which
process holds the lock. In the case of group-owned locks, the GID field has a group
number, and the DID field has the Transaction ID (TxID) of the client transaction.

DSI408: Real Application Clusters Internals I-158

DLM Concepts: Processes

A representation in the DLM of a process that


requested or acquired the locks:
[0x10000f8][0x1],[BL]

7-159

Grant Q

lockp

Convert Q

PID

Procp

Lock value block

GID/DID

PID

Copyright 2003, Oracle. All rights reserved.

Process-Based Versus Session-Based Locking


In a simple implementation, a DLM provides a lock to a process. This works fine when
the process-to-session mapping is maintained. In MTS and XA, however, the session
may migrate or multiple processes may contribute to a transaction. It is preferable to be
able to provide a session-based identifier to control access to the lock. This is what
group-owned locking does. Generally, Oracle provides the transaction ID as the group
ID, and then anyone working on that transaction simply provides that XID and lock
operations are honored.
Domains
Domains are largely redundant in Oracle8 because there is a DLM for each database.
Although present in Oracle8, the domain functionality is largely unused.

DSI408: Real Application Clusters Internals I-159

DLM Concepts: Shadow Resources

Resources are mastered on a node.


The master node has all resource information,
such as full grant queues and convert queues.
The shadow resource exists on any other node
that has an interest in this resource; it knows only
about locks on its own node.

[0x10000f8][0x1],[BL]

Grant Q (All nodes)


Convert Q (All nodes)
Master node

7-160

[0x10000f8][0x1],[BL]

Grant Q (local)
Convert Q (local)
Shadow node

Copyright 2003, Oracle. All rights reserved.

Persistent Resources
The shadow resource exists on any other node that has an interest in a resource, that is,
any node on which a lock is open against that resource.
A persistent resource is maintained in a dubious state in the DLM following the closure
of all locks on it when the processes holding the locks exited abnormally while holding
a lock in PW or EX mode.
Recovery Domain (rdomain)
A recovery domain is the mechanism by which persistent resources can be recovered.
Each persistent resource is linked to a recovery domain. There is one such domain per
database.

DSI408: Real Application Clusters Internals I-160

DLM Concepts: Copy Locks

When a lock is held on a node other than the master


node, the master keeps a copy of the lock locally.
[0x10000f8][0x1],[BL]

Grant Q
Convert Q
Master node

7-161

[0x10000f8][0x1],[BL]

Grant Q
Convert Q
Shadow node

lockp

lockp

Copy lock

Owner node

Copyright 2003, Oracle. All rights reserved.

DLM Concepts: Copy Locks


There is only one copy of the lock for every other node that has an interest in this
resource. The copy lock is held at the highest node where the other node holds a lock.
This is the information that the master node requires. The other node maintains all the
other information that is required.
The master node has the master lock, and the local node has the shadow lock.

DSI408: Real Application Clusters Internals I-161

Resource or Lock Mastering

7-162

The DLM maintains information about the locks on


all nodes that are interested in a given resource.
Lock mastering is distributed among all nodes in
the cluster.
The master node contains the description of the
resource and at least the lock on this resource
with the highest LOCKING mode.
The master node for a resource is computed by
using several arrays: res_hash_val_kjga (for
non-PCM resources) and pcm_hv_kjga (for PCM
resources).

Copyright 2003, Oracle. All rights reserved.

Resource or Lock Mastering


The DLM mastering algorithm chooses one node to manage the relevant information of
a resource and its locks on a resource by resource basis; this node is referred to as the
master node.
The res_hash_val_kjga and pcm_hv_kjga arrays are updated at
reconfiguration when a node joins or leaves the cluster. The update minimizes resource
migration. Each element of the arrays is a bucket and contains a physical node number.
For non-PCM resources, you hash the resource name to obtain a bucket number bidx
and then look up the master node number with res_hash_val_kjga[bidx].
These arrays are private to each node. The algorithm is covered in detail in later lessons.

DSI408: Real Application Clusters Internals I-162

Basic Resource Structures

7-163

Resource name: Unique name to identify the


resource. This is three ub4 numbers, the last
interpreted as a character pair.
Value block: Area in memory that is used to store
information about the resource
Granted queue: Locks granted on resources
Convert queue: Locks in the process of
converting from one mode to another

Copyright 2003, Oracle. All rights reserved.

Basic Resource Structures


Each non-PCM resource is identified in the cluster by its name (for example struct
kjr).
The name consists of three integers of 4 bytes (ub4 n[3]).
For non-PCM or enqueues: n[0] is set to id1, n[1] is set to id2, n[3]
receives string values, such as DI or LB.
A PCM resource is identified by a name with two integers, with the third integer
character pair implied as BL.
DLM uses the resource name to compute the resource master node.

DSI408: Real Application Clusters Internals I-163

DLM Structures

PCM (GCS) and non-PCM (GES) resources are


kept separate and use separate code paths.
GES:
Resource table: kjr and kjrt
Lock table: kjlt
Processes: kjpt

GCS:
Resource table: kjbr
Lock table: kjbl

7-164

Copyright 2003, Oracle. All rights reserved.

DLM Structures
The separation of GES and GCS resource handling is new to Oracle9i. The earlier
versions had more common structures and code paths.
There are differences in these structures between versions 9.0.1 and 9.2
kjr (partial)
kjurvb
kjurn
kjsolk
kjsolk
kjsolk
kjsolk
ub2
ub1
ub1
kjuvlst
ub2
kjsolk
kjsolkl
ub1
ub1
kjulevel

valblk_kjr;
/* the value of the lock
resname_kjr;
/* the resource name
grant_q_kjr;
/* list of granted resources
convert_q_kjr;
/* list of resources being converted
req_q_kjr;
/* list of open reqs when master_node unknown
scan_q_kjr;
/* For the DLMD to perform move_scan_cvt etc
grant_count_kjr[6];
/* count of # of locks at each level
granted_bits_kjr;
entry_kjr;
/* dir, master, local
valstate_kjr;
/* state of valblk
master_node_kjr;
/* ID of the node mastering the resource
hash_q_kjr;
/* hash list : hp
*hp_kjr;
options_kjr;
/* same as open option
remaster_kjr;
next_cvt_kjr;
/* Global next cvt. mode

DSI408: Real Application Clusters Internals I-164

*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/

DLM Structures (continued)


LocK manager Resource Table structure
typedef struct kjrt
{
kjsolkl
*reshash_kjrt;
/* resource hash bucket array */
ub4
n_reshash_kjrt;
ub4
*res_bucket_seq_kjrt;
ksspa
res_cache_kjrt[3];
/* cache of freeable resources */
ub4
res_cnt_kjrt[3];
/* count on cached resources */
boolean
clear_cache_kjrt;
/* How should we clear the cache */
ub4
res_cache_sz_kjrt;
/* size of resource cached */
sb2
pral_kjrt;
/* Flag indicating preallocation of object
Values: -0-need,1-have,2-don't */
ksspa
*res_parent_kjrt;
/* parent of resources */
ksllt
*latch_kjrt;
/* resource freelist latch array */
ub4
num_lst_kjrt;
/* number resource freelist */
}kjrt;
Lock manager Lock Table
typedef struct kjlt
{
ksllt
*latch_kjlt;
ksspa
gpar_kjlt;
ub2
num_lst_kjlt;
}kjlt;
Process table Structure
typedef struct kjpt
{
ub4
maxproc_kjpt;
ub4
clnt_kjpt;
ub4
n_prochash_kjpt;
kjsolkl
*prochash_kjpt;
ksllt
*latch_kjpt;
} kjpt;

/* tab latch */
/* parent of group locks */
/* number of lock freelist */

/* maximum number of items in table */


/* # local clients */

/* FreeList Latch */

DSI408: Real Application Clusters Internals I-165

DLM Structures (continued)


/* PCM resource structure */
typedef struct kjbr {
kjsolk
hash_q_kjbr;
ub4
resname_kjbr[2];
kjsolk
scan_q_kjbr; /* chain to
kjsolk
grant_q_kjbr;
kjsolk
convert_q_kjbr;
/*
ub4
diskscn_bas_kjbr;
ub2
diskscn_wrap_kjbr;
ub2
writereqscn_wrap_kjbr;
ub4
writereqscn_bas_kjbr;
struct kjbl *sender_kjbr;
ub2
senderver_kjbr;
ub2
writerver_kjbr;
struct kjbl *writer_kjbr;
ub1
mode_role_kjbr; /* one of
ub1
flags_kjbr;
ub1
rfpcount_kjbr;
ub1
history_kjbr;
kxid
xid_kjbr;
} kjbr ;

/* 68 bytes on sun4u
/* hash list : hp
/* the resource name
lmd scan q of grantable resources
/* list of granted resources
list of resources being converted
/* scn(base) known to be on disk
/* scn(wrap) known to be on disk
/* scn(wrap) requested for write
/* scn(base) requested for write
/* lock elected to send block
/* version# of above lock
/* version# of lock below
/* lock elected to write block
'n', 's', 'x' && one of 'l' or 'g'
/* ignorewip, free etc.
/* refuse ping counter
/* resource operation history
/* split transaction ID

*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/

/* kjbl - PCM lock structure


** Clients and most of the DLM will use the KJUSER* or KJ_* modes and kscns
typedef struct kjbl {
/* 52 bytes on sun4u
union {
/* discriminate lock@master and lock@client
struct {
/* for lock@master
kgglk
state_q_kjbl;
/* link to chain to resource
kjbopqi
*rqinfo_kjbl;
/* target bid
struct kjbr *resp_kjbl;
/* pointer to my resource
} kjbllam;
/* KJB Lock Lock At Master
struct {
/* for lock@client
ub4
disk_base_kjbl;
/* disk version(base) for replay
ub2
disk_wrap_kjbl;
/* disk version(wrap) for replay
ub1
master_node_kjbl;
/* master instance#
ub1
client_flag_kjbl;
/* flags specific to client locks
ub2
update_seq_kjbl;
/* last update to master
} kjbllac;
/* KJB Lock Lock At Client
} kjblmcd;
/* KJB Lock Master Client Discrimnant
void *remote_lockp_kjbl;
/* pointer to client lock or shadow
ub2
remote_ver_kjbl;
/* remote lock version#
ub2
ver_kjbl;
/* my version#
ub2
msg_seq_kjbl;
/* client->master seq#
ub2
reqid_kjbl;
/* requestid for convert
ub2
creqid_kjbl; /* requestid for convert that has been cancelled
ub2
pi_wrap_kjbl;
/* scn(wrap) of highest pi
ub4
pi_base_kjbl;
/* scn(base) of highest pi
ub1
mode_role_kjbl; /* one of 'n', 's', 'x' && one of 'l' or 'g'
ub1
state_kjbl;
/* _L|_R|_W|_S, notify, which q, lock type
ub1
node_kjbl;
/* instance lock belongs to
ub1
flags_kjbl;
/* lock flag bits
ub2
rreqid_kjbl;
/* save the reqid
ub2
write_wrap_kjbl;
/* last write request version(wrap)
ub4
write_base_kjbl;
/* last write request version(base)
ub4
history_kjbl;
/* lock operation history
} kjbl;

*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/

DSI408: Real Application Clusters Internals I-166

Lock Mode Changes

New lock
requested

GRANT
QUEUE

Conversion
granted

7-167

CONVERT
QUEUE

Compatible
In-place
conversion

Incompatible
conversion

Copyright 2003, Oracle. All rights reserved.

Lock Changes
Locks are placed on the resource grant or convert queue. If the lock mode changes, then
it is moved between the queues.
If several locks exist on the grant queue, then they must be compatible. Locks of the
same mode are not necessarily compatible with another of the same mode. The
compatibility matrix of the various locks differs between GES and GCS locks.
Compatible in-place conversions are typically downgrades, converting to a lesser mode.
Some exceptions exist and are covered later.
A lock can leave the convert queue under any of the following conditions:
Process requests the lock termination, (that is, removes the lock).
Process cancels the conversion; the lock is moved back to the grant queue in
previous mode.
The requested mode is compatible with the most restrictive lock in the grant queue
and with all the previous modes of the convert queue, and the lock is in the head of
the convert queue. Convert requests are processed first in, first out (FIFO).

DSI408: Real Application Clusters Internals I-167

Simple Lock Changes on a Resource


1

7-168

Res.

Res.

Res.

Res.

Res.

A:CR

Grant
Convert

NL: Null
CR: Concurrent Read
EX: Exclusive Write

A:CR

B:CR

A:CR

B:CR

C:CR

A:NL

B:CR

C:CR

Grant

A:NL

B:CR

Convert

C:CR EX

Grant
Convert
Grant
Convert
Grant
Convert

Copyright 2003, Oracle. All rights reserved.

Simple Lock Changes on a Resource


Example of a resource getting locks placed on its grant or resource queues.
1. A shareable read lock (Concurrent Read) is granted.
2. Another shareable read lock is granted. They are compatible and can reside on the
grant queue.
3. Another sharable read lock is placed on the grant queue.
4. One lock converts to shareable NULL. This conversion can be done in place
because it is a simple downgrade.
5. Another lock attempts to convert to exclusive write. It has to be placed on the
convert queue.

DSI408: Real Application Clusters Internals I-168

Changes on a Resource with Deadlock


1

Res.

Res.

Res.

Res.

A:CR

B:CR

Grant

B:CR

C:CW

Convert

A:CREX

Grant

C:CW

Convert

A:CREX

Grant

C:NL

Convert

A:CREX

Grant
Convert

NL: Null

7-169

C:CW

B:CRPR

B:CRPR

CR: Concurrent Read PR: Protected Read


CW: Concurrent Write EX: Exclusive Write

Copyright 2003, Oracle. All rights reserved.

Changes on a Resource with Deadlock


The convert queue is a first in, first out (FIFO) queue. This may lead to deadlock
situations.
1. Two shareable read locks and a shareable write are granted.
2. Lock A attempts an upgrade to exclusive write (no other access allowed). This
mode is incompatible with the mode of B and C, so it gets placed on the convert
queue. A note is kept of the old mode, in case the conversion is canceled.
3. Lock B attempts an upgrade to the protected read mode (no other writers allowed).
This mode is incompatible with Cs mode, and is also placed on the convert queue.
4. Lock C downgrades to share NULL (no restrictions on other access). Lock A
cannot complete its conversion, because even though exclusive write is compatible
with NULL, it is not compatible with lock Bs old share read mode. Lock B could
complete its conversion, but it is in the queue behind lock A.

DSI408: Real Application Clusters Internals I-169

DLM Functions

Distributed: DLM exists in each instance of the


cluster.
Coordinates requests and access to shared
resources between different instances
Keeps an inventory of all locks
Grants and notifies processes when a resource
becomes available
Notifies owners of a lock when other processes
request the lock

7-170

Fault tolerance: DLM can survive n1 node failures.


Deadlock detection: DLM must be able to detect
and report deadlock.

Copyright 2003, Oracle. All rights reserved.

DLM Functions
Interprocess communication is critical to the DLM because it is distributed. Being
distributed permits the DLM to share the load of mastering (administering) resources.
The result of this is that you may lock a resource on one node but actually have to
communicate with the LMD processes on another node entirely. Fault tolerance requires
that no vital information about locked resources is lost irrespective of how many DLM
instances fail.
The durability of the database (that is, being able to recover blocks that are lost in an
aborted instances buffer cache) is not a DLM function, but global cache handling of
blocks still uses the same log before write rule to ensure durability.

DSI408: Real Application Clusters Internals I-170

DLM Functionality in
Global Enqueue Service Daemon (LMD0)

Performing periodic scanning for move-scanconvert operations


Performing periodic scanning of the timer queue
for locks with expired timers
Performing deadlock detection
Processing incoming messages for non-PCM
locks
There is only one LMD0 in 9.2

7-171

Copyright 2003, Oracle. All rights reserved.

DLM Functionality in Global Enqueue Service Daemon (LMD0)


The DLM or GRD consists of the GES component and the GCS component.
The move-scan-convert operation is a periodic check, if a lock that is currently waiting
on the convert queue is eligible for the grant queue.
LMD0s Loop: kjmdm
If lock db is frozen:
- Stops any deadlock detection: kjdddei
- Freeze and reset: kjfzfcl
The lock db is either in a frozen or a running state. In the frozen state, it is not possible
to get any locks from the DLM or to create any new resources. The DLM is frozen
during reconfiguration so that the node failure can be recovered from.
If lock db is open:
1. Check for converting locks: kjcvscn.
2. Deadlock detection: kjddits/kjddscn.
3. Clean up recovery domains: kjprsem.
4. Update stats: kjxstc.
5. Send flow control messages: kjctssb.

DSI408: Real Application Clusters Internals I-171

DLM Functionality in Global Enqueue Service


Daemon (LMD0) (continued)
LMD0 is the core of the DLM. If it were not for the odd unpleasant failure or
reintroduction, it would probably do well without LMON. Nonetheless, LMD0
handles all lock operations and creation of resources, the detection of deadlocks, and
the sending of messages to other LMD0s.
Statistics are updated only if _lm_statistics is TRUE. In Oracle8i, statistics
for the two views V$DLM_CONVERT_LOCAL and V$DLM_CONVERT_REMOTE
require that event 29700 is also set. You also need to set timed_statistics to
TRUE for timing information to be valid.
Note: _lm_statistics parameter does not exist in Oracle9.2 or in Oracle 9.0.1.
It does exist in Oracle 8.1.5 and Oracle 8.1.6.

DSI408: Real Application Clusters Internals I-172

DLM Functionality in
Global Enqueue Service Monitor (LMON)

Publishing work load of the node (active PQ user,


active PQ session)
Processing naming-service requests that are
queued by the client
Polling the Cluster Manager to manage
reconfiguration:
Instance joining the group
Instance leaving the group, shutdown or node death

Performing Dynamic Remastering (only if


explicitly enabled)
There is only one LMON in 9.2

7-173

Copyright 2003, Oracle. All rights reserved.

DLM Functionality in Global Enqueue Service Monitor (LMON)


Dynamic Remastering (DMR) is not enabled by default in Oracle9i, Release 2. It can be
enabled by setting _kcl_local_file_time in version 9.2.0.
LMONs Loop: kjfcln
Listens for local messages: kjcswmg
Responds to reconfig events: kjfcrfg
Cleans out the GES cache: kjrchc
Reconfiguration is perhaps the most significant of LMONs responsibilities. It is used
during the recovery from a node failure (or other shutdown of a DLM instance) and
during the startup of new DLM instances. kjfcrfg is the reconfiguration routine.
The DLM caches resources and lock structures and, as already explained, has freelists
on which resources are placed when they are no longer needed. kjrchc cleans out the
DLM cache of resources; it is a housekeeping operation.

DSI408: Real Application Clusters Internals I-173

DLM Functionality in
Global Cache Service Process (LMS)

Scanning PCM-resources that have grantable


converting locks
Processing down-convert queue
Flushing messages, if messages are enqueued and
exceeded _side_channel_batch_timeout
Processing remote messages for PCM locks

The number of LMS processes is fixed by _lm_lms

7-174

Default value is max(#CPU/4, 2)

Copyright 2003, Oracle. All rights reserved.

DLM Functionality in Global Cache Service Process (LMS)


The down-convert queue is handled in kclpbi.
(The number of LMS processes can be dynamic and adjusted by workload if
_lm_dynamic_lms is set to TRUE. But this is not functioning in 9.2, so the
parameter should be FALSE.)

DSI408: Real Application Clusters Internals I-174

DLM Functionality in
Other Processes

DIAG process:
Provides low-overhead in-memory tracing and
logging
Manages and maintains the diagnosability across
multiple instances
Helps execute ORADEBUG on all nodes of the RAC
cluster

All processes:
Process PING for BUFFER-CACHE
Process-deferred queue and CR log-flush queue
Adjust local SCN (Lamport) when receiving DLM
messages

7-175

Copyright 2003, Oracle. All rights reserved.

DLM Functionality in Other Processes


The PING handling for the buffer cache was done by LCK in previous versions.
The CR log-flush queue is handled in kclpto.
PMON still does all forms of cleanup after unexpected process death, including the
release of locks and other DLM calls (see kjplhd/kjgxda.)

DSI408: Real Application Clusters Internals I-175

Configuring GES Resources

Initial allocation is:


64 if cluster_database is not set
_lm_ress if parameter is defined
1.1 * ( localres + (number_of_instance-1) *
localres / number_of_instance ) otherwise

If exhausted, then more resources are allocated in


shared_pool.
ges_ress in V$RESOURCE_LIMIT shows the high
water mark.

7-176

Copyright 2003, Oracle. All rights reserved.

Configuring GES Resources


GES resources are the non-PCM resources.
The localres value is the sum of local resources, which is calculated by:
localres = processes + dlm_locks + transactions +
enqueue_resources + db_files + 7 +
parallel_max_servers * cluster_database_instance +
parallel_max_servers + cluster_database_instance + 200

To view the usage:


SELECT * FROM V$RESOURCE_LIMIT
WHERE RESOURCE_NAME LIKE 'ges%' ;

DSI408: Real Application Clusters Internals I-176

Configuring GES Locks

Initial allocation is:


128 if cluster_database is not set
_lm_locks if parameter is defined
(localres+_enqueue_locks) +
(number_of_instance-1 *
(localres+_enqueue_locks) / number_of_instance)
otherwise

7-177

If exhausted, then more locks are allocated in


shared_pool.
ges_locks in V$RESOURCE_LIMIT shows the high
water mark.

Copyright 2003, Oracle. All rights reserved.

Configuring GES Locks


The localres value is the same as in the previous slide.

DSI408: Real Application Clusters Internals I-177

Configuring GCS Resources

Initial allocation is:


_gcs_resources if defined
2* _db_block_buffers if primary/secondary
instances are configured (RAC Guard, Failover)
max(1.1*_db_block_buffers, 2500) otherwise

If exhausted, then more resources allocated from


shared_pool in increments of 1024.
gcs_resource in V$RESOURCE_LIMIT shows the
high water mark.

7-178

Copyright 2003, Oracle. All rights reserved.

Configuring GCS Resources


GCS resources are the PCM resources.
Note: Parameter for default value is based on _db_block_buffers (leading
underscore), not db_block_buffers.
To view the usage:
SELECT * FROM V$RESOURCE_LIMIT
WHERE RESOURCE_NAME LIKE 'gcs%' ;

DSI408: Real Application Clusters Internals I-178

Configuring GCS Locks

Initial allocation is:


_pcm_shadow_locks if defined
max( 1.1* _ db_block_buffers, 2500) otherwise

7-179

If exhausted, then more locks are allocated in


shared_pool in increments of 1024.
gcs_shadows in V$RESOURCE_LIMIT shows the
high water mark.

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-179

Configuring DLM processes

Initial allocation is:


_lm_procs if set
max( ( 64 + 256 ) + ( number_of_instance-1 ),
processes ) otherwise

7-180

If exhausted, then allocate more structures in


shared_pool.
ges_procs in V$RESOURCE_LIMIT shows the high
water mark.

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-180

Logical to Physical Nodes Mapping

hash_node_kjga
maps logical to
physical node.
hash_node_kjga[0]
always contains one
live node.
This array is updated
in a three-step
reconfiguration.

N1

Dead node

N2

Live node

2
3
5
-

N3

N4

N5
-

hash_node_kjga

7-181

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-181

Buckets to Logical Nodes Mapping

Resource N
Hash value
of name

N1

0
0

0
res_hashed_val_kjga

Resource M

0
0

N2

N3
N4

N5

0
pcm_hv_kjga
7-182

hash_node_kjga

Copyright 2003, Oracle. All rights reserved.

Buckets to Logical Nodes Mapping


Initially, the res_hashed_val_kjga and pcm_hv_kjga all point to the first
hash_node_kjga, which must be the first instance to start up.
The number of buckets is set by _lm_res_part; the default value is 1289.
Each element res_hashed_val_kjga and pcm_hv_kjga is a bucket.

DSI408: Real Application Clusters Internals I-182

Mapping for a New Node Joining the


Cluster
Resource N

N1

0
1

0
res_hashed_val_kjga
Resource M

2
N2
4
-

0
1

N3

N4

N5
-

1
pcm_hv_kjga
7-183

hash_node_kjga

Copyright 2003, Oracle. All rights reserved.

Mapping for a New Node Joining the Cluster


When a second instance joins the cluster, hash_node_kjga reflects this state. Then
res_hashed_val_kjga and pcm_hv_kjga are updated. Each instance publishes
its weight, which is _gcs_resource if defined (otherwise, it is
_db_block_buffers). In LMON trace file, you can see:
kjfcpiora: publish my
...
res_master_weight for
res_master_weight for
Total master weight =

weight 6331
node 0 is 6331
node 1 is 6331
12662

DSI408: Real Application Clusters Internals I-183

Mapping for a New Node Joining the Cluster (continued)


The new instances joining the cluster compute the redistribution differently from the
old instance. In the example, node 4 computes values for
res_hashed_val_kjga and pcm_hv_kjga as:
total_weight = sum of weight of every alive node = 12662
For each alive node in cluster, avgpart = (weight_of_node / total_weight) *
buckets = buckets/2
For each node in hash_node_kjga :
- For i in 0 to (buckets 1), if buckets are not attributed and the current
node does not have more than avgpart buckets, then attribute the buckets
to the current node by setting pcm_hv_kjga[i] with the current node
and marking this bucket as need_remastering.
- For i in 0 to (buckets 1), attribute bucket i to the node in a round-robin
manner and mark bucket as need_remastering.

DSI408: Real Application Clusters Internals I-184

Remapping When Node Joins

res_hashed_val_kjga
Step 1

N2

7-185

N1

N3

N4

N5

hash_node_kjga

U
1
pcm_hv_kjga
Copyright 2003, Oracle. All rights reserved.

Mapping When a Node Joins on Old Node


Old nodes update arrays:
Compute avgpart = buckets / number_of_alive_nodes
Takeoff buckets from death instance or instance having more than avgpart buckets:
For i in 0 to ( buckets 1), pnode = res_hashed_val_kjga[i]. If pnode
has death (shutdown) or if pnode has more than avgpart buckets, then set
res_hashed_val_kjga[i] to UNKNOWN.
Attribute buckets with UNKNOWN flag to under-allocated nodes: for i in 0 to
(buckets 1), if bucket has UNKNOWN flag, then for k in 0 to (number of alive
nodes 1), pnode = hash_node_kjga[k]. If pnode has less than avgpart
buckets, then set res_hashed_val_kjga[i] = pnode.
Apply the same calculation to update pcm_hv_kjga, but avgpart for each node
is computed as weight(node) / sum_weight(every node).
Non-PCM resources are evenly distributed to every alive node, and PCM resource
are distributed based on weight (or _db_block_buffers) of node.
For more details, refer to kjshashcfg.

DSI408: Real Application Clusters Internals I-185

Mapping Broadcast by Master Node

N1
0
1

3:Send
master node
ID

N2

2: Determine master node

N3

0
1

7-186

N4
4: Master sends
hash tables

1: Send
hash_node_kjga[0]

N5

Copyright 2003, Oracle. All rights reserved.

Mapping Broadcast by Master Node


The complete mapping table is broadcast to all members:
1. Send the hash_node_kjga[0] indicating if the current node is new or old.
2. After receiving every message, determine which is the lowest surviving node that
will be elected as master node (in this example, node 2).
3. Inform everyone what the master node is.
4. Master node broadcasts pcm_hv_kjga and res_hashed_val_kjga to other
nodes in the cluster.
Broadcast is done in step 5 of reconfiguration and if the number of alive nodes in the
cluster is at least two.

DSI408: Real Application Clusters Internals I-186

Master Node Determination for GES

If there is only one node in the cluster, then it is


the master node.
For RT or IR resources, the master node is
hash_node_kjga[0].
Otherwise, let key = sum of resource name (three
integers):
For TX enqueues and _lm_tx_delta >0
master node = hash_node_kjga
[ (key % 1289) % number live nodes]
Otherwise, master node =
res_hashed_val_kjga
[ key % length(res_hashed_val_kjga) ]

7-187

Copyright 2003, Oracle. All rights reserved.

Master Node Determination for GES


RT is the redo thread global enqueue, IR is the instance recovery serialization global
enqueue, and TX is the transaction enqueue.
The default value of _lm_tx_delta is 16.
The length refers to the number of elements.

DSI408: Real Application Clusters Internals I-187

Master Node Determination for GCS

7-188

If there is only one node in the cluster, then it is


the master node.
Otherwise, let key = sum of resource name (two
integers).
Master node = pcm_hv_kjga
[ key % length(pcm_hv_kjga)]

Copyright 2003, Oracle. All rights reserved.

Master Node Determination for GCS


The algorithm is slightly different if dynamic resource remastering is active. It is not
active in 9.2.

DSI408: Real Application Clusters Internals I-188

Dump and Trace of Remastering

7-189

Query X$KJDRHV to see res_hashed_val_kjga.


Query X$KJDRPCMHV to see pcm_hv_kjga.

Event 29731, level 14, traces LMON remastering


progress.

Copyright 2003, Oracle. All rights reserved.

Dump and Trace of Remastering


Partial DESCRIBE of X$KJDRHV
Name
Type
-------------- ------
KJDRHVID
NUMBER
bucket ID (from 1 to N)
KJDRHVCMAS
NUMBER
master-node that this bucket is attributed to
KJDRHVPMAS
NUMBER
previous master (before reconfiguration)
KJDRHVRMCNT
NUMBER
number of reconfigurations
Partial DESCRIBE of X$KJDRPCMHV
Name
Type
-------------- ------
KJDRPCMHVID
NUMBER
bucket ID ( from 1 to N )
KJDRPCMHVCMAS NUMBER
master-node that this bucket is attributed to
KJDRPCMHVPMAS NUMBER
previous master (before reconfiguration)
KJDRPCMHVRMCNT NUMBER
number of reconfigurations
DSI408: Real Application Clusters Internals I-189

DLM Functions

The main DLM client APIs are:


kjual: Connection to DLM
kjpsod: Disconnection from DLM
kjusuc: Synchronous open and convert a lock
kjuscv: Synchronous convert a lock
kjuscl: Synchronous close a lock
kjuuc: Asynchronous open and convert a lock
kjucv: Asynchronous convert a lock

7-190

Copyright 2003, Oracle. All rights reserved.

DLM Functions
kjual is called when the Oracle shadow process is started.
kjpsod is called before the Oracle shadow process leaves.
The other functions are used to manage only non-PCM resources and locks.

DSI408: Real Application Clusters Internals I-190

kjual Connection to DLM

Every DLM client (local or remote process) is


identified by a kjp structure:

OS process PID
Process node number
Process flags (such as DEAD, RMOT, LOCL)
List of process-created DLM locks
Queue of pending AST for the process
Various statistics on lock conversion activity

For a local process, the structure is allocated by


kjual at process start.

For a remote process, the structure is allocated by


LMD when a lock creation request comes from a
remote instance.

7-191

Copyright 2003, Oracle. All rights reserved.

kjual Connection to DLM


Interesting members of kjp structures are :
ub4
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define

flg_kjp;
/* process flag
KJP_DEAD 0x0001
/* process is dead, pending cleaned up
KJP_LMON 0x0002
/* process is the DLM-MON
KJP_DLMD 0x0004
/* process is DLMD
KJP_RMOT 0x0008
/* remote process
KJP_LOCL 0x0010
/* local process
KJP_IOPENDING 0x0020
/* has i/o pending, dont remove
KJP_IID
0x0040 /* 'Important' process: death =>inst termn
KJP_DLMS 0x0080
/* process is LMS
KJP_DIAG 0x0100
/* process is DIAG
KJP_RMRDR 0x0200 /* p. is reading a PT/HV struct, critical sec

DSI408: Real Application Clusters Internals I-191

*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/

kjual Connection to DLM (continued)


kjsolk
kjsolk
skgpid
kjftnid
word
ksupr
ub4
ub4
ub4
ub4

lock_q_kjp;
/* list of locks created by this process
ast_q_kjp;
/* ast queue
pid_kjp;
/* OS pid of process
node_kjp;
/* ID of the node the proccess belong to
orapnum_kjp;
/* oracle process number
*oraproc_kjp;
/* oracle process structure address
loc_lck_cvt_tm_kjp[KJST_CONVTYPE];
/* cumulative time of local converts
loc_lck_cvt_ct_kjp[KJST_CONVTYPE];
/* cumulative number of local converts
rem_lck_cvt_tm_kjp[KJST_CONVTYPE];
/* cumulative time of remote converts
rem_lck_cvt_ct_kjp[KJST_CONVTYPE];
/* cumulative number of remote converts

DSI408: Real Application Clusters Internals I-192

*/
*/
*/
*/
*/
*/
*/
*/
*/
*/

kjual Flow

Pid-1
1: Allocate and
initialize
P1

Procs.

LMD0

Locks
2: Update ges_procs
in v$resource_limit

7-193

Res.
LMON

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-193

kjpsod Flow
1: Flag procp
2: Clear ASTs,
put to freelist

P1

Pid-1

KJP_DEAD

Procs.

LMD0

Locks
3: Update ges_procs
in v$resource_limit

7-194

Res.
LMON

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-194

DML Enqueue Handling Flow: Example

In this example, three processes on


two nodes work on the EMPLOYEE
table:
1.
2.
3.
4.
5.
6.

7-195

P1 locks table in share mode.


P2 locks table in share mode.
P2 does rollback.
P1 locks table in exclusive mode.
P3 locks table in share mode.
P1 does rollback.

P1 and P2 are on node 1; P3 is on


node 2.
The enqueue for EMPLOYEE is
mastered on node 2.

P1

Node 1

P2

P3

Node 2

Enq.

Copyright 2003, Oracle. All rights reserved.

DML Enqueue Handling Flow: Example


The steps in the slide are covered twice in the following slides, focusing first on the lock
states and then on the code references.

DSI408: Real Application Clusters Internals I-195

Step 1: P1 Locks Table in Share Mode


Instance 1
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSERPR KJUSERPR 65549
2 16190
1
0

Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSERPR KJUSERPR
0
0 13354
0
0

7-196

Copyright 2003, Oracle. All rights reserved.

Step 1: P1 Locks Table in Share Mode


The EMPLOYEE table in this example has object ID 0x6dfd, thus the enqueue is
[TM][0x6dfd][0]. The columns RESOURCE_NAME, ON_CONVERT_Q, ON_GRANT_Q,
MASTER_NODE, and NEXT_CVT_LEVEL are from V$DLM_RESS, and the columns
GRANT_LEVEL, REQUEST_LEVEL, TRANSACTION_ID0, TRANSACTION_ID1,
PID, OPEN_OPT_DEADLOCK, and OWNER_NODE are from V$DLM_ALL_LOCKS.
Column names may be abbreviated in the slides.

DSI408: Real Application Clusters Internals I-196

Step 2: P2 Locks Table in Share Mode


Instance 1
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV
--------KJUSERPR
KJUSERPR

REQUEST_ TX_ID0 TX_ID1


PID OPENDEADLOCK OWNER_NODE
-------- ------ ------ ------ ------------- ---------KJUSERPR 65551
2 16287
1
0
KJUSERPR 65549
2 16190
1
0

Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSERPR KJUSERPR
0
0 13354
0
0

7-197

Copyright 2003, Oracle. All rights reserved.

Step 2: P2 Locks Table in Share Mode


There are no changes for instance 2 locks.

DSI408: Real Application Clusters Internals I-197

Step 3: P2 Does Rollback


Instance 1
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSERPR KJUSERPR 65549
2 16190
1
0

Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSERPR KJUSERPR
0
0 13354
0
0

7-198

Copyright 2003, Oracle. All rights reserved.

Step 3: P2 Does Rollback


There are no changes for instance 2. You are effectively in the same state as at step 1.

DSI408: Real Application Clusters Internals I-198

Step 4: P1 Locks Table in Exclusive Mode


Instance 1
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSEREX KJUSEREX 65549
2 16190
1
0

Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSEREX KJUSEREX
0
0 13354
0
0

7-199

Copyright 2003, Oracle. All rights reserved.

Step 4: P1 Locks Table in Exclusive Mode


This causes changes for both instances.

DSI408: Real Application Clusters Internals I-199

Step 5: P3 Locks Table in Share Mode


Instance 1
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSERPR KJUSERPR 65549
2 16190
1
0

Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
1
1
1 KJUSERNL
GRANT_LEV
--------KJUSEREX
KJUSERNL

7-200

REQUEST_ TX_ID0 TX_ID1


PID OPENDEADLOCK OWNER_NODE
-------- ------ ------ ------ ------------- ---------KJUSEREX
0
0 13354
0
0
KJUSERPR 131085
2 16199
1
1

Copyright 2003, Oracle. All rights reserved.

Step 5: P3 Locks Table in Share Mode


One lock is in the convert queue (REQUEST_LEVEL is KJUSEREX and
GRANT_LEVEL is KJUSERNL) on instance 2. There is no change in instance 1.

DSI408: Real Application Clusters Internals I-200

Step 6: P1 Does Rollback


Instance 1
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL

Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
1
1
1 KJUSERNL
GRANT_LEV
--------KJUSERNL
KJUSERPR

7-201

REQUEST_ TX_ID0 TX_ID1


PID OPENDEADLOCK OWNER_NODE
-------- ------ ------ ------ ------------- ---------KJUSERNL
0
0 13354
0
0
KJUSERPR 131085
2 16199
1
1

Copyright 2003, Oracle. All rights reserved.

Step 6: P1 Does Rollback


Instance 1 now has no rows in V$DLM_ALL_LOCKS. V$DLM_ALL_LOCKS is updated
on instance 2, but the lock is not removed. GRANT_LEVEL and REQUEST_LEVEL are
both set to KJUSERNL.

DSI408: Real Application Clusters Internals I-201

Steps 1 and 2: Code Flow

ktaiam

Kernel Transaction Access,


Internal Allocate DML lock

ksqgtl

Get an enqueue

ksqcmi

General Get/Convert function


(Change Mode Internal)

ksipget
kjusuc

7-202

Get a group lock


Synchronous upconvert

Copyright 2003, Oracle. All rights reserved.

Steps 1 and 2: Code Flow


The next few slides show the same steps in greater detail. For each step, there is an
overview of the code stack active, followed by the corresponding flow detail.
In step 1, P1 locks the table in share mode. In step 2, P2 locks the same table in share
mode. The difference shows up in the kjusuc processing.
ksqgtl: Get an enqueue, type = TM, id1 = table_object_id, id2 = 0, timeout =
infinite. Allocate enqueue lock and hang lock on appropriate resource before calling
ksqcmi.
ksqcmi: General get-convert function. Register wait-event for the specific enqueue.
Compute XID for DLM. Set up option for lock get (DEADLOCK detection required).
Only register the wait event enqueue, set up kjiwev. When kjusuc waits for AST,
the wait-event registered in kjiwev is used.
ksipget: Synchronous interface to DLM for lock GET. Set up DLM resource name.
Set timeout to infinite. Increment global lock sync gets. On return of kjusuc,
increment global lock get time.

DSI408: Real Application Clusters Internals I-202

Step 1: kjusuc Flow Detail

lock_q_kjp resp_kjl

Proc.1

Res.1

procp_kjl Lock1

KJL_OPENING
1:Allocate
P1

2:Set
3:Allocate

4:Compute
Instance 1

7-203

Copyright 2003, Oracle. All rights reserved.

Step 1: kjusuc Flow Detail


1: Allocate lock1 and update V$RESOURCE_LIMIT.
2: Set lock state to KJL_OPENING.
3: Allocate resource1 and update V$RESOURCE_LIMIT.
4: Compute Master-node. This uses the algorithm explained earlier.
Because this is the first time the instance has shown interest in this resource, it has to
send a message to the master node.

DSI408: Real Application Clusters Internals I-203

Step 1: kjusuc Flow Detail


Proc.1
Res.1
Res.1
Lock1
8:Granted

6:Lock CVT
9:AST

LMD0

2:Allocate
3:Send

LMD0

KJX_CONV_AST_IND

P1

Lock1

1:Allocate

7:Loop
5:Send

KJX_OPEN_CONVERT_DIR_REQ

10:Complete
Instance 1

7-204

Instance 2

Copyright 2003, Oracle. All rights reserved.

Step 1: kjusuc Flow Detail (continued)


5: Send message to the master (directory) instance.
Now two activities occur in parallel in the two instances (instance number in
parentheses).
(1) 6: Put lock on convert queue, and hang it in the deadlock queue. This type lock
TM has an infinite timeout, so it is not attached to the timer queue, but is put on the
deadlock queue as it can become part of a deadlock.
(1) 7: Loop waiting for AST, by testing a flag event=enqueue.
(2) 1: Allocate process 1 descriptor and lock 1. Lock 1 is in same mode.
(2) 2: Because the resource has never been used in instance 2, it must be created, and
then it is linked to lock 1. Because it is the first time that resource 1 is used in instance 2,
the open convert request will be successful.
(2) 3: Queue and send a message to the requester instance.
Instance 1 has been waiting to continue; the remaining steps happen in instance 1.
8: Put lock in grant queue and remove it from the deadlock queue.
9: Send AST to client process by setting its flag.
10: Process the AST; clear KJL_OPENING and exit.
DSI408: Real Application Clusters Internals I-204

Step 2: kjusuc Flow Detail


Proc.1

Lock1

Proc.2

Res.1
Lock2

1:Allocate
P2

KJL_OPENING
KJL_CONVERTING

2:Set
3:Allocate

4:Complete
Instance 1

7-205

Copyright 2003, Oracle. All rights reserved.

Step 2: kjusuc Flow Detail


The resource exists already, so processing is simpler. This lock can be granted
immediately, because of the following:
There is no incompatible lock locally.
The requesting mode is S, which is the same as the held mode. Granting another S
mode lock does not increase the lock mode.
There is no need to send a message to the master instance.
1. Allocate lock1 and update V$RESOURCE_LIMIT.
2. Set the lock state to KJL_OPENING, KJL_CONVERTING.
3. Hang the lock on existing resource 1.
4. Process AST; clear KJL_OPENING, KJL_CONVERTING and exit.

DSI408: Real Application Clusters Internals I-205

Step 3: Code Flow

ktaidm

Kernel Transaction Access,


Internal Delete DML lock

ksqrcl

Release an enqueue

ksqcmi

General Get/Convert function


(Change Mode Internal)

ksiprls
kjuscl

7-206

Release a group lock


Synchronous close

Copyright 2003, Oracle. All rights reserved.

Step 3: Code Flow


In step 3, P2 releases the table share mode lock by doing a rollback.
Ksiprls is the synchronous interface to DLM for CLOSE lock. On return of kjuscl,
increment the global lock releases statistic.

DSI408: Real Application Clusters Internals I-206

Step 3: kjuscl Flow Detail


Proc.1

Lock1

Proc.2

Res.1
Lock2
KJL_CLOSING

2:Remove
P2

1:Set
3:Free

4:Complete
Instance 1

7-207

Copyright 2003, Oracle. All rights reserved.

Step 3: kjuscl Flow Detail


Because lock 1 is still attached to resource 1, you cannot free resource 1.
1. Set lock state to KJL_CLOSING.
2. Remove lock 2 and process 2 from resource 1.
3. Free lock 2. Update V$RESOURCE_LIMIT.
4. Exit. Because removing lock 2 does not change the held mode of resource 1, or its
request mode, no message is sent to the master node.

DSI408: Real Application Clusters Internals I-207

Step 4: Code Flow

ktagetg0
ksqcnv

Convert an enqueue

ksqcmi

General Get/Convert function


(Change Mode Internal)

ksipcon
kjuscv

7-208

Kernel Transaction Access,


Get Generic DML lock

Convert a group lock


Synchronous Convert

Copyright 2003, Oracle. All rights reserved.

Step 4: Code Flow


In step 4, P1 upgrades the table share mode lock to an exclusive lock.
ksqcnv provides a lock description, as obtained previously with kjusuc.
ksipcon is the synchronous interface to DLM for lock convert. Calls kjuscv with
timeout = infinite, increments global lock sync converts, and updates global lock
convert time.

DSI408: Real Application Clusters Internals I-208

Step 4: kjuscv Flow Detail

Res.1

Proc.1
Lock1

KJL_CONVERTING

P1

1:Set

3:Deadlock queue

2:Re-Queue

Instance 1

7-209

Copyright 2003, Oracle. All rights reserved.

Step 4: kjuscv Flow Detail


Resource 1 and lock 1 are allocated and linked.
Because satisfying this conversion would bring the resources held mode from S to X,
and instance 1 is not the master instance, a message must be sent to the master instance
to see if conversion is possible.
1. Set lock state to KJL_CONVERTING.
2. Remove lock from grant queue to convert queue for resource 1. Lock 1 is not hung
on the timer queue because kjuscv is called with timeout = infinite.
3. Hang lock 1 on the deadlock queue, because lock1 is deadlockable.
Note that the lock is hung on the timer queue and the deadlock queue only if the lock is
local. In other words, the owning instance of the lock is the same as the local instance.

DSI408: Real Application Clusters Internals I-209

Step 4: kjuscv Flow Detail


Proc.1
Res.1
Lock1

Res.1
Lock1
6:Granted
7:AST

LMD0

2:Send

LMD0

KJX_CONV_AST_IND

P1

1:Convert

5:Loop
4:Send

KJX_CONVERT_REQ

8:Complete
Instance 1

7-210

Instance 2

Copyright 2003, Oracle. All rights reserved.

Step 4: kjuscv Flow Detail (continued)


4. Send message to the master (directory) instance.
5. Loop waiting for AST, by testing a flag event = enqueue.
Instance 2
1. Convert lock 1 from S to X immediately, because you are in the master instance
and there is no conflict.
2. Queue and send a message to the requester instance.
Instance 1 has been waiting to continue.
6. Put lock in grant queue and remove it from the deadlock queue.
7. Send AST to client process by setting its flag.
8. Process the AST; clear KJL_CONVERTING and exit.

DSI408: Real Application Clusters Internals I-210

Step 5: kjuscv Flow Detail


3:Queue

Res.1

Proc.1
Lock1
Proc.3

1:Allocate
LMD0

5:Send
KJX_CONV_AST_IND

P3

Lock3

KJL_OPENING
2:Set KJL_CONVERTING
4:Queue

Instance 1

7-211

Instance 2

Copyright 2003, Oracle. All rights reserved.

Step 5: kjusuc Flow Detail


In step 1, P3 requests locking the table in share mode. The code path is the same as for
step 1 and 2, with processing in kjusuc.
1. Allocate lock 3 and process 3, update V$DLM_RESOURCE_LIMIT.
2. Set the state of lock 3 to KJL_OPENING, KJL_CONVERTING.
3. Put lock 3 in the convert queue for resource 1. Lock 3 is in conflict with lock 1 so
it cannot be granted immediately.
4. Put lock 3 in the deadlock queue.
5. Send a message to see if something has changed in the blocker instance. One
message is sent for every lock on the grant queue of resource 1 and in conflict with
lock 3. One message is also sent for every lock in the convert queue with a
previous mode conflicting with lock 3.

DSI408: Real Application Clusters Internals I-211

Step 6: kjuscl Flow Detail


Proc.1

Res.1

Res.1

Proc.1
Lock1

Lock1
KJL_CLOSING

Proc.3

5:Free
1:Set

Lock3

2:Change

P1

4:Release
3:Send

LMD0
KJX_CONVERT_REQ

6:Complete
Instance 1

7-212

Instance 2

Copyright 2003, Oracle. All rights reserved.

Step 6: kjuscl Flow Detail


In step 6, P1 releases its exclusive table lock by doing a rollback in instance 1.
1. Set lock state to KJL_CLOSING.
2. Move the lock from S mode to N mode.
3. After converting from X to NULL, lock1 lowers the resource1 held-mode and
must send a message KJX_CONVERT_REQ to the master instance.
4. Release resource 1 because there are no longer any locks on it, and update
V$RESOURCE_LIMIT.
5. Free lock 1 and update V$RESOURCE_LIMIT.
6. Exit.

DSI408: Real Application Clusters Internals I-212

Step 6: kjuscl Flow Detail


Res.1

Proc.1
Lock1
Proc.3
Lock3

P3

1:Convert

LMD0

4:Complete
3:AST

2:Grant
Instance 2

7-213

Copyright 2003, Oracle. All rights reserved.

Step 6: kjuscl Flow Detail (continued)


On receiving the KJX_CONVERT_REQ message from instance 1
1. Lock 1 is converted from X to NULL.
2. Attempt to grant all locks on the convert queue for resource 1, because lock 1 has
been downgraded to NULL. This therefore grants lock 3.
3. An AST is sent to P3, which is still waiting from step 5.
4. P3 processes the AST, completes its lock acquisition, and exits the DLM, letting
the transaction continue.

DSI408: Real Application Clusters Internals I-213

Code References

kj*.*: Kernel Lock manager


kcl.*: Kernel Cache Lock background process

7-214

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-214

Summary

In this lesson, you should have learned about the:


Lock manager architecture
Main functional flow of global locks

7-215

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-215

References and Further Reading

Oracle8.0 DLM Under the Covers and Beside the


Point, by Daniel Semler (1998)

7-216

Copyright 2003, Oracle. All rights reserved.

References and Further Reading


Daniel Semlers paper is available under WEBIV reference note 72568.1.

DSI408: Real Application Clusters Internals I-216

Cache Coherency (Part One)

Enqueues/Non-PCM

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Describe enqueue types
Follow the locking and deadlock detection
algorithms

8-219

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-219

Cache Coherency: Enqueues


Node

Other
nodes

Instance

Caches

ksi/ksq
GRD(GES)
CGS

I
P
C

There are over 70


types of enqueues,
such as:

NM

CF Control Files
CI Cross Instance
Call
DM Mount Lock
LB Library Cache
Lock
IR Instance
Recovery

CM

8-220

Copyright 2003, Oracle. All rights reserved.

RAC and Global Resources


The GRD consists of Global Cache Services (GCS), which handles the data blocks, and
Global Enqueue Service (GES), which handles enqueues and other global resources.
The enqueues (representing such things as transactions) have to be kept coherent across
instances.
The global resources covered by GES are the row cache (dictionary cache) and the library
cache.

DSI408: Real Application Clusters Internals I-220

Alphabetical List of Enqueues


The boldfaced items in the following list are not documented in Oracle9i Real Application
Clusters Deployment and Performance, Appendix A. Most of these items are also listed in
the Database Reference manual under V$LOCK.
AK
DLM deadlock Detection
BR
Backup Recovery
CF
Controlfile Transaction
CI
Cross-instance Call Invocation
CU
Bind Enqueue
DF
Datafile
DL
Direct Loader Index Creation
DM
Database Mount
DR
Distributed Recovery
DV
PL/SQL Diana Version
DX
Distributed TX
FS
File Set
HW
Space Management on specific segment
IN
Instance Number
IR
Instance Recovery
IS
Instance State
IV
Library Cache Invalidation
JQ
Job Queue
KK
Redo Log Kick
KM
Resource Manager Load
L[A-P] Library Cache Lock
MM
Mount Defenition
MR
Media Recovery
N[A-Z] Library Cache Pin
OC
Outline Management
OL
Outline Management
PF
Password File
PI
Parallel Slaves
PR
Process Startup
PS
Parallel Slave Synchronization
Q[A-Z] Row Cache
RT
Redo Thread

DSI408: Real Application Clusters Internals I-221

Alphabetical List of Enqueues (continued)


SC
System Commit Number
SM
SMON
SN
Sequence Number
SQ
Sequence Number Enqueue
SR
Synchronized Replication
SS
Sort Segment
ST
Space Management Transaction
SV
Sequence Number Value
SW
Resume/Suspend change
TA
Transaction Recovery / Generic Transaction Enqueue
TM
DML Enqueue
TS
Temporary Segment (also TableSpace)
TT
Temporary Table
TX
Transaction
UL
User-defined Locks
UN
User Name
US
Undo Segment, Serialization
WL
Being Written Redo Log
XA
Instance Attribute Lock
XR
CKPT Direct Block Loader
XI
Instance Registration Lock
The list is not complete. Look for ksqget calls in the source code to get more
information.

DSI408: Real Application Clusters Internals I-222

Enqueue Types

Enqueues are broadly divided into:


Instance: Instance mount and recovery; manage
SCN
Transaction: Locking tables and rows
Library cache, such as cursors
Dictionary cache
Parallel Query
User mode

8-223

Most enqueues are used in single and shared


instances; a few are relevant to shared instances
only.

Copyright 2003, Oracle. All rights reserved.

Enqueue Types
Refer to WebIV Note 1020008.6 for a lock decoding script. The standard supplied
CATBLOCK script creates the view DBA_LOCK and DBA_LOCK_INTERNAL. These DBA
views do not expand the RAC-only enqueues.
User mode enqueues are created and used by applications, they are a simple named
resource without relation to server data structures.

DSI408: Real Application Clusters Internals I-223

Enqueue Structure

V$LOCK examines which locks are queued on the


resources.
Resource structure: ksqrs
<TM,432,0>

Owners

Waiters

Converters

Lock structures:
ksqlk
(showing modes)

S -> X

SX

8-224

Copyright 2003, Oracle. All rights reserved.

Enqueue Structure
When access is required by a session, a lock structure ksqlk is obtained and a request is
made to gain access to the resource at a specific level (mode). The lock structure is placed
on one of the three linked lists (called the owner, waiter, and converter lists) that hang off
of the resource.

DSI408: Real Application Clusters Internals I-224

Examining Enqueues

8-225

V$LOCK: Locks held


V$ENQUEUE_STAT : Enqueue statistics by type

Copyright 2003, Oracle. All rights reserved.

Examining Enqueues
In V$LOCK, the mode held (LMODE) and request (REQUEST) columns determine if the
enqueue is an owner, waiter, or converter:
Held
Request Enqueue is
Nonzero
Zero
Owner
Nonzero
Nonzero Converter
Zero
Nonzero Waiting
For V$ENQUEUE_STAT, the average time waited in milliseconds is
CUM_WAIT_TIME / TOTAL_WAIT#.

DSI408: Real Application Clusters Internals I-225

Enqueues and DLM


Enqueues are requested by clients in the ksq layer. If it
must be a global enqueue, then a similarly named DLM
lock is requested in the kj layer.
Get
ksqget

Convert
ksqcnl

Release
ksqrcl
ksq

Local

Enqueue processing
ksqcmi
ksipget

ksipcon

ksi

kjusuc

kjuscv

kju

Global

DLM

8-226

Copyright 2003, Oracle. All rights reserved.

Enqueues and DLM


Local enqueues have their processing completed in the ksq. Global enqueues are
processed further in ksi, kju, and so on.
Each enqueue resource has a corresponding DLM resource, and each enqueue lock has a
corresponding DLM lock.
Every DLM lock for global enqueue uses group-based locking, even though every process
in an Oracle instance belongs to the same group. The code distinguishes group-based
and process-owned locks, but there is no longer a group concept.
If there is a current transaction, then the transaction identifier (XID) is part of the DLM
lock identification and is then used for deadlock detection.
If there is no current Oracle transaction, then an identifier concatenating the thread number
(2 bytes), Oracle process ID (2 bytes), and ksuseq is used. ksuseq always begins with
a 0 for an Oracle process and is incremented for each identifier.

DSI408: Real Application Clusters Internals I-226

Source Tree for Non-PCM Lock Flow

KSQ

KGL

KQR

Misc.
Clients

KQLM

KSI

KJU

8-227

Copyright 2003, Oracle. All rights reserved.

Source Tree for Non-PCM Lock Flow


The ksq layer always calls the ksi layer with an XID to create the DLM lock. Other
layers, such as kqr or kqlm, call the ksi layer without an XID (process-owned) and
therefore do not use the deadlock detection feature of DLM.

DSI408: Real Application Clusters Internals I-227

Lock Modes

8-228

Enqueues are resources that are locked in various


modes.
The DLM lock modes differ from other modules in
naming.
DLM

Value

Local

Granted (Owner)

Other Grants

GCS

NULL
CR
CW
PR
PW
EX

0
1
2
3
4
5

NULL
SS
SX
S
SSX
X

No Access
Read
Read or Write
Read
Read or Write
Read or Write

Anything
Read or Write
Read or Write
Read
Read
No Access

9
9

Copyright 2003, Oracle. All rights reserved.

Lock Modes
These are the GES lock modes. The naming differences between the DLM and the kernel
lock mode names result from historical reasons.
For GCS locks, only the NULL, Share, and Exclusive locks are used.

DSI408: Real Application Clusters Internals I-228

Lock Compatibility

NL:NL
CR:SS
CW:SX
PR:S
PW:SSX
EX:X

8-229

NL:NL
Yes
Yes
Yes
Yes
Yes
Yes

CR:SS
Yes
Yes
Yes
Yes
Yes
No

CW:SX
Yes
Yes
Yes
No
No
No

PR:S
Yes
Yes
No
Yes
No
No

PW:SSX
Yes
Yes
No
No
No
No

EX:X
Yes
No
No
No
No
No

Copyright 2003, Oracle. All rights reserved.

Lock Compatibility
Compatible locks can exist on the grant queue at the same time. The locks on the request
queue are incompatible with the locks on the grant queue and are incompatible with other
locks on the convert queue.
Note that although a PR or S mode is more restrictive, it is not compatible with the lesser
mode CW. This prohibits simple downgrading of the lock mode from PR to CW.
A special case exists for the PR and CW combination. A PR lock on the convert queue can
be compatible with the most restrictive mode lock on the grant queue (for example,
another PR lock) and still not be compatible with a less restrictive lock (the CW lock) on
the grant queue.
The GCS lock modes are underlined.

DSI408: Real Application Clusters Internals I-229

Deadlock Detection:
The Classic Deadlock
time

Process 1
Locks resource
R1 in mode X

Process 2

OK

Requests
resource R2 in
mode X
Waits

Waits

Locks resource
R2 in mode X
OK
Requests
resource R1 in
mode X
Waits

Deadlock

8-230

Copyright 2003, Oracle. All rights reserved.

Deadlock Detection: The Classic Deadlock


The slide shows the classic deadlock scenario. The resources in question could be anything.
In the server, they could be rows, tables, ITL slots, or library cache or row cache locks.
This situation can also occur in a RAC cluster, even where the processes are on separate
nodes.

DSI408: Real Application Clusters Internals I-230

Deadlock Detection:
The Classic Deadlock
L2
P1

R1m

Blocked convert request


of a lock on a resource

L3 P2

Lock held in a blocking


mode on a resource
Distributed resource

R2m

Process performing lock


operation on a resource

N1

Node in a cluster

R2s

L4

Node x in a cluster
Px
Process x on a node
Lx
Lock x
Rym Resource y (master)
Rys Resource y (shadow)
Nx

P2

R1s
L5
P3

P1
N2

8-231

Copyright 2003, Oracle. All rights reserved.

Deadlock Detection: The Classic Deadlock (continued)


The slide shows the classic deadlock as it is viewed by deadlock detection algorithms. In
this case, the two processes are on different nodes, and the resources are distributed with a
master and a shadow of each resource present. It is evident that in a multinode
environment, deadlock detection requires tracking lock converters and blockers from one
node to another. This task is performed by the LMD processes.

DSI408: Real Application Clusters Internals I-231

Deadlock Detection:
A More General Example
L2
P1

P1

R1m

R1s
L1

L3 P2

Blocked convert request


of a lock on a resource

P3

Lock held in a blocking


mode on a resource

L8

R4m

R2m

P2

Node in a cluster

R2s

P2

R4s
R13

R3m

L5
P3

P1

P1

Node x in a cluster
Px
Process x on a node
Lx
Lock x
Rym Resource y (master)
Rys Resource y (shadow)
Nx

L7
P2

L6 P3
N2

8-232

Process performing lock


operation on a resource

N4

N1

L4

Distributed resource

N3

Copyright 2003, Oracle. All rights reserved.

Deadlock Detection: A More General Example


Whenever processes share resources, deadlock situations can occur. A simple deadlock
scenario occurs when entity A holds resource Y in exclusive mode, entity B holds resource
Z in exclusive mode, and each entity contends for the resource held by the other. If neither
entity is willing to give up its access rights on the held resource, a deadlock has occurred.
The owning entity that the lock manager uses to determine deadlocks is determined by an
ID that is passed to the lock manager during the lock open calls. This may be a process
identified by a PID or an Oracle transaction identified by a deadlock ID (DID).
The lock manager performs deadlock detection whenever a request is made to convert a
lock and the request cannot be granted in a short period of time. As part of the convert
option in the lock convert call, the user specifies whether a particular lock will participate
in deadlock detection.
Wait-For Graph
In the context of lock operations, a wait-for graph is a graph where nodes are the
participating processes (or transactions) and resources, and the edges are the converting
and held locks.
In a generalized case, this graph involves multiple resources and locks being operated by
many processes and transactions spanning some or all of the nodes of a cluster.
A cycle in the wait-for graph indicates a deadlock situation.
DSI408: Real Application Clusters Internals I-232

Deadlock Detection and Resolution

Deadlock detection is done at several layers.


ksq resolves local deadlocks (non-RAC).
DLM in kjd resolves global deadlocks.
Message deadlocks are prevented by the Message
Traffic Controller (TRFC).

8-233

Oracle deadlock detection is driven by timeouts.

Copyright 2003, Oracle. All rights reserved.

Deadlock Detection and Resolution


Deadlock detection can be performed whenever any lock is requested or when needed.
Finding out whether there is a deadlock can be very time consuming because the number
of resources and locks increases. The Oracle kernel therefore uses the when needed
approach to check for deadlocks whenever someone has waited a long time, presumably
because there is a deadlock.
Resolution of a deadlock requires one holder to release the locks, thereby effectively
aborting its work.

DSI408: Real Application Clusters Internals I-233

Timeout-Based Deadlock Detection

Each deadlock detectable lock is put on the


deadlock timer queue if it is queued for convert.
A deadlock search starts when timeout on the
convert expires.
Timeout is _lm_dd_interval seconds;
the default is 60.

LMD performs the search, one lock at a time.


A deadlock graph trace file is generated.
The dd_ts_server resource (DI,0,0) must be held
in EX mode to perform a deadlock search.

8-234

Copyright 2003, Oracle. All rights reserved.

Timeout-Based Deadlock Detection


Deadlock detection searches attempts to find a cycle. It begins by building graphs of the
converting lock through the blocking processes and then through the locks that they are
waiting on. It may well span more than one node. If a cycle is found, then the solution is to
return an error to one of the processes in the cycle. If the deadlock cycle is contained
entirely within a node, then the last process in the cycle is the one that gets the error. If the
cycle spans nodes, then the process that initiated the search receives the error.
Timeout for deadlock detection to start is current_time + (60 + number_of_nodes
/ 2) seconds.

DSI408: Real Application Clusters Internals I-234

Deadlock Graph Printout

/users/t920r/admin/t920r/bdump/t920r_1_lmd0_24675.trc
Oracle9i Enterprise Edition Release 9.2.0.1.0 Production
With the Partitioning, Real Application Clusters, OL
JServer Release 9.2.0.1.0 - Production

Instance name: t920r_1


Redo thread mounted by this instance: 0 <none>
Oracle process number: 5
Unix process pid: 24675, image: oracle@sunblade (LMD0)
*** 2002-07-11 09:45:04.187
Global Wait-For-Graph(WFG) at ddTS[0.27] :
BLOCKED 22c432bc 5 [0x6dfe][0x0],[TM] [65549,2] 0
BLOCKER 22c4c19c 5 [0x6dfe][0x0],[TM] [131085,2] 1
BLOCKED 22c6224c 5 [0x6dfd][0x0],[TM] [131085,2] 1
BLOCKER 22c42eac 5 [0x6dfd][0x0],[TM] [65549,2] 0
LOCK
MODE
8-235

ID1

ID2 TYPE

Copyright 2003, Oracle. All rights reserved.

Deadlock Graph Printout


When a database is opened, each LMD0 process opens a lock in NULL mode on
resource DI,0,0. Each instance in turn performs a deadlock detection by converting this
lock from NULL to X mode. Each deadlock detection is limited in time
(number_of_nodes_in_cluster / 2 minutes). If a deadlock is not found during this time,
then the deadlock detection is aborted to avoid spending too much time tracing deadlock
graphs, because these can be very lengthy. The deadlock detection is distributed.
When a lock is moved to the CONVERT-QUEUE of a resource (because of trying to
convert to a conflicting mode), this lock is attached to the end of a deadlock queue. This
lock will be a candidate for deadlock detection.
Deadlock detection involves the following three steps :
1. Deadlock search: In this step, an oriented wait-for graph is built. Several nodes can
be involved in the building of this graph if the deadlock spans several nodes.
2. Deadlock validation: If a deadlock is found, then each node that is involved in the
previous search validates each lock in its own subgraph. (These locks must remain
valid, that is, not canceled.)
3. Wait-for graph printing: If the previous step is successful, then the whole graph is
printed.
DSI408: Real Application Clusters Internals I-235

Deadlock Flow
DI-0-0
resource

EX

NL

LMD0

LMD0
Begin deadlock detection

L3

L2

L1

Deadlock queue

Node 2

Node 1
8-236

Copyright 2003, Oracle. All rights reserved.

Deadlock Flow
When an enqueue lock enters the convert queue and if it can be deadlocked (that is, if it is
of the type TM, TX, or UL), then the lock information is also put in the deadlock queue.
At this time, you compute a time to deadlock detection, time_to_dd (expressed in
seconds for this lock), as the number of active nodes / 2 + _lm_dd_interval as a
timestamp, which is now + time_to_dd.
LMD0 checks the deadlock queue every five seconds and starts a deadlock search if the
deadlock queue is not empty and if the lock at the head of the deadlock queue is in the
queue for more than time_to_dd. Otherwise, LMD0 moves the lock in the head of the
deadlock queue to the tail and returns to normal activity.
If a deadlock detection starts on node 1, then LMD0 converts its lock on DI,0,0 from
NULL to EXCLUSIVE; in the whole cluster, only one node is allowed to start DD.

DSI408: Real Application Clusters Internals I-236

Deadlock Flow
DI-0-0
resource

EX

NL

LMD0

L3

LMD0

L2

L1

Deadlock queue

Node 1
8-237

Node 2
Copyright 2003, Oracle. All rights reserved.

Deadlock Flow (continued)


If the DI lock to EX mode conversion is successful, then LMD0 performs the following:
1. Take a lock L1, which is in convert state (otherwise, it would not be in the deadlock
queue) and is owned by a process P from the head of the deadlock queue.
2. Put L1 in the deadlock queue just before a lock having a timestamp bigger than L1s
timestamp, and then start a deadlock detection from L1. The time_to_dd of L1 is
adjusted to the number of active nodes * 10 + _lm_dd_interval. This adjusted
value is used for the second and all subsequent deadlock searches for L1.
3. LMD0 begins to build an oriented graph, with L1 as the base head.
For each lock, a counter is maintained, which reports the number of times deadlock
detection is started from the lock.
So a deadlock is found at most time_to_dd seconds after the beginning of conversion.

DSI408: Real Application Clusters Internals I-237

Deadlock Flow: One Node


DI-0-0
resource

EX

NL

LMD0

LMD0
Deadlock graph
R1
X11

X12

X13
R132
X1

Node 1
8-238

Node 2
Copyright 2003, Oracle. All rights reserved.

Deadlock Flow: One Node


Graph generation ends when a deadlock is found locally (in this slide), or when no
deadlock is found or a remote resource/process is found (following slide).
Notes for Deadlock Graph
(R1) a resource with lock L1 which is owned by X1 and is on convert queue of R1
(X11, X12, X13) XID of owners of locks on grant_q or convert queue of R1
which conflict with L1
(R132) resource that one lock of X13 and is on convert queue
(X1) X1 is owner of one lock of grant_q or convert queue of R132 conflicting
with X13.
(Yellow triangle) Here the deadlock is found.

DSI408: Real Application Clusters Internals I-238

Deadlock Flow: Two Nodes


DI-0-0
resource

EX

NL

LMD0

LMD0
Deadlock graph
R1

Deadlock graph
KJX_DEADLOCK_IND

R1

message

X11

8-239

X12

X13

X11

X12

X13

R132

R132

X1

X1

Copyright 2003, Oracle. All rights reserved.

Deadlock Flow: Two Nodes


Deadlock graph building on node 1 ends after finding a remote resource and after LMD0
returns to its message-processing loop.
Notes for Deadlock Graph
The dashed resources are shadow enqueues of the other nodes' owned enqueues.
At node 1, deadlock graph generation performs the same steps until it examines the
holders of R132 and finds X1 to be on the other node.
Node 1 deadlock detection sends message KJX_DEADLOCK_IND to node 2 to
continue deadlock detection.
Node 2 builds the same graph.
X1 is the owner of one lock of grant_q or convert queue of R132 conflicting with
X13.
Deadlock is found.
Node 2 may have concluded that there is no deadlock: X1 does not own anything else in
the graph, in which case it sends a BACKTRACK message to node 1, effectively asking it to
search further or to conclude that there is no deadlock.
Node 2 may also have found a resource on another node (node 1 or node 3), in which case
(like node 1) it sends the KJX_DEADLOCK_IND message for further remote processing.

DSI408: Real Application Clusters Internals I-239

Parallel DML (PDML) Deadlocks

8-240

Locks that are identified by the transaction


identifier (XID) may fail to detect deadlocks
involving PDML operations that have the same XID.
A spanning set of a transaction TX is the list of
nodes where this transaction takes place.
The coordinator of the PDML transaction
publishes the spanning set by using the CGS
name service.
When a lock is opened and found to be involved in
a PDML, then IDLM is informed by the API (to
perform a global DD).

Copyright 2003, Oracle. All rights reserved.

Parallel DML (PDML) Deadlocks


A spanning set is identified with name = <XID> of TX and value = <spanning set>.
You can use the CGS name service to create or search for a name (the spanning set
identifier).

DSI408: Real Application Clusters Internals I-240

Deadlock Detection Algorithm

The simple algorithm is enhanced to account for the


PDML identifiers.

8-241

Copyright 2003, Oracle. All rights reserved.

Deadlock Detection Algorithm: Examples


In this example, transaction X1 converts a lock L on a resource R, then pushes RES(R) on
top of the stack.
While( stack is not empty){
pop element from STACK ;
if( RES){
push all conflicting LOCK of grant queue to top of STACK
push all LOCK conflicting on convert queue on top of STACK
or LOCK ahead on convert-queue to top of STACK
} else if LOCK and LOCK is remote {
send message to remote to continue;
save current stack and go back to normal TASK;
} else push RES that current LOCK has
lock on convert-queue on top of STACK
}

A deadlock is found when X1 is found on top of the stack.

DSI408: Real Application Clusters Internals I-241

Deadlock Detection Algorithm: Examples (continued)


In this example, dd-starting is lock L on the convert-queue of a resource R, then pushes
(R,L) on top of the stack. The algorithm is recursive and uses a stack. The stack contains
RES, TXN , TXN_GLOBAL, TXN_REMOTE, SHADOW_GRANT or SHADOW_CONVERT
While( stack is not empty){
pop element from STACK ;
switch( type_of_pop_element)
case RES:
L1 = poped lock
for each lock L on grant-queue of RES conflicting with L1 {
if Lbelong to the same XID as dd-starting lock then deadlock is found
if Lbelong to a remote node then push (SHADOW_GRANT,L) to top of stack
if L is local and L not in global TX then push (TXN,L) to top of STACK
if L is local and L in global TX then push (TXN_GLOBAL,L) to top of STACK
}
for each lock L on convert-queue conflicting with L1 {
if Lbelong to the same XID as dd-starting lock then deadlock is found
if Lbelong to a remote node then push (SHADOW_CONVERT,L) to top of stack
if L is local and L not in global TX then push (TXN,L) to top of STACK
if L is local and L in global TX then push (TXN_GLOBAL,L) to top of STACK
}
break;
case TXN:
L1 = poped lock
for each lock L with same XID as L1 {
if L is on convert-queue of ressource R then push (RES,R) to top of
stack
}
break;
case TXN_GLOBAL:
L1 = poped lock
query CGS to find out associated spanning set
for each node in spanning set different from local node {
push (TXN_REMOTE,L1) to top of stack
}
if local node belong also to spanning set then push ( TXN, L1) to top of
stack
break;
case TXN_REMOTE :
send message to remote node to continue ( message type KJX_DEADLOCK_IND)
go back to normal TASK
break;
case SHADOW_GRANT:
case SHADOW_CONVERT:
send message to remote node to continue ( message type KJX_DEADLOCK_IND)
go back to normal TASK
<----------- DD temporaryly stop in this node
break;
}
<----------- end switch
}
<----------- end while
If stack is empty {
if local node is not the dd-starting node then send a message dd-starting node
to perform a BACKTRACK ( message type
if local node is the dd-starting node then DEADLOCK is not found
}

DSI408: Real Application Clusters Internals I-242

Deadlock Detection Algorithm: Examples (continued)


When LMD0 in a node receives a message KJX_DEADLOCK_IND asking for
BACKTRACK, then deadlock detection resumes with the stack at the appropriate position.
When LMD0 in a node receives a message KJX_DEADLOCK_IND asking to continue DD,
then:
switch( sub-type of message){
case message_sent_by_SHADOW_GRANT :
case message_sent_by_SHADOW_CONVERT :
push the involved ressource on top of STACK
break;
case message_sent_by_TXN_REMOTE :
<-------- GLOBAL transaction
for L every converting lock owned by involved GLOBAL transaction {
R = resource where L in on converting-queue
push (RES, R) on top of STACK
}
break;
}
process STACK as described previous page

DSI408: Real Application Clusters Internals I-243

Deadlock Validation Steps

8-244

When the stack is popped, a wait-for graph (a list


of linked locks keeping track of DD PATH) is built
at the same time.
When a deadlock is found, then deadlock
validation occurs.
The validation also identifies the victim lock.
The victim lock is generally the starting-deadlocksearch lock.

Copyright 2003, Oracle. All rights reserved.

Deadlock Validation Steps


If deadlock is found on a node other than dd-starting node {
send a message to dd-starting node asking for validation
/* dd-starting node, when receiving the request for validation, will
start validation as below */
} else { /* validation */
follow wait-for graph to see and examine lock by lock
{ if a lock in wait-for graph is invalid ( canceled) then the whole DD
is invalidated }
}
if the node of last lock in the wait-for graph is the local node or
this node receives a request for a validation and wait-for graph is
already validated {
/* the whole wait-for graph is validated , here we must be in ddstarting node */
if local_node is not the lowest node{
send the wait-for graph to node with lowest to print
} else print the whole wait-for graph
} else {
if local_node is not the lowest node send the wait-for graph to node
with lowest to print
send a message to node of last-lock in the wait-for graph to continue
VALIDATE ( LMD0 of this node will validate the subgraph with previous
code )
}

DSI408: Real Application Clusters Internals I-244

Code References

ksq.*: Kernel Service enQueues

8-245

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-245

Summary

In this lesson, you should have learned about:


GES activity in locking resources
LMD0 deadlock detection

8-246

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-246

Cache Coherency (Part Two)

Blocks/PCM Locks

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Describe the global cache service concepts and
components
Outline the history of Cache Fusion
Describe the flow of blocks and their locks in
Cache Fusion

9-249

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-249

Cache Coherency: Blocks


Node

Other
nodes

Instance

Caches

kcb/kcl
GRD(GCS)
CGS

I
P
C

NM
CM

9-250

Copyright 2003, Oracle. All rights reserved.

Cache Coherency: Blocks


The GRD consists of Global Cache Services (GCS), which handles the data blocks, and
the Global Enqueue Service (GES), which handles the enqueues and other global
resources.
The term cache coherency is often used to refer to keeping the data buffer caches
coherent across instances, as it does represent the bulk of the cache coherency activity.
This cache coherency is handled by PCM locks. The block cache coherency can be
handled in two ways: disk pings and Cache Fusion. Oracle9i has both methods available.

DSI408: Real Application Clusters Internals I-250

Block Cache Contention

Block cache contention occurs when two caches want


the same resource.
Read/read contention
Write/read contention
Write/write contention

Holder

9-251

Requestor

Copyright 2003, Oracle. All rights reserved.

Block Cache Contention


Contention occurs when instance H holds the resource and instance R requests the
resource. Instance R gets the resource. The complexity of Cache Fusion depends on how
much control of the resource is retained by instance H and the different types of requests
supported, such as current read and consistent read.
In Oracle9i, the resource could be sent via the communication services in all of the
following three cases. Enabling multiblock locks disables this.
Read/Read Contention
Read/read contention is currently not a problem due to shared disk architecture. A data
block from a read-only tablespace can be read by any instance without DLM
intervention. The blocks read this way (for example, from read-only tablespaces) are not
transferred across the caches in the current implementation.
Write/Read Contention
Depending on the read request type, instance H reduces its access rights (downgrades
the lock) on the block and sends a copy to instance R. This was the major change in
Oracle8i.
Write/Write Contention
Instance H reduces its access rights on the block and sends a copy to instance R. This
was the major change in Oracle9i. In earlier releases, it would flush the block to disk.
DSI408: Real Application Clusters Internals I-251

Earlier Cache Coherency:


Oracle8 Ping Protocol

Checks the instance for a lock


Requests DLM to acquire lock in specified mode
If there is a conflict, the holder is requested by the
master to write to disk and downgrade.
BAST is sent.
AST is sent on successful downgrade.

9-252

Reads the block from the disk

Copyright 2003, Oracle. All rights reserved.

Oracle8 Ping Protocol


The protocol was also used in Oracle7 and earlier versions. All block data transfer was
via the disks. The DLM kept track of block ownership; that is, either one instance had
exclusive access, or several instances had shared read access. Any read request thus
involved a downgrading from exclusive to shared mode, by:
Flushing the redo log
Writing the block to the disk
Recovery
For recovery of one, several, or all instances, only the log threads of failed instances
apply. The log threads can be processed in any order. The ping protocol effectively
penalizes the steady-state OPS performance in favor of simpler and efficient recovery.

DSI408: Real Application Clusters Internals I-252

Earlier Cache Coherency:


Oracle8i CR Server

9-253

Designed for the write/read contention


Holder constructs consistent read copy.
CR blocks are shipped across the communication
path.
Fairness counter implements Light Work Rule.

Copyright 2003, Oracle. All rights reserved.

Oracle8i CR Server
The holder of a data block, on receiving a consistent read (CR) request, uses the undo
data (the blocks of which were locally resident in the cache) to construct the block.
Light Work Rule and Fairness Counter
If creating the consistent read version block involves too much work (such as reading
blocks from disk), then the holder sends the block to the requestor, and the requestor
completes the CR fabrication. The holder maintains a fairness counter of CR requests.
After the fairness threshold is reached, the holder downgrades it to lock mode.

DSI408: Real Application Clusters Internals I-253

Earlier Cache Coherency:


Oracle8i CR Server

Requesting instance:
Foreground process prepares the buffer.
Sends the message to the master and waits
Gets CR buffer or a lock to read from disk

Master:
Checks the lock mode
Forwards the request to the holder if X mode held
Grants shared lock to the requestor on other modes

Holder:
Sends CR buffer

9-254

Copyright 2003, Oracle. All rights reserved.

Oracle8i CR Server (continued)


The CR Server code was executed by a dedicated process, the Block Server Process
(BSP).
The Oracle8 ping protocol is used in the case of write/write contention, or any request
other than those for a consistent read.

DSI408: Real Application Clusters Internals I-254

Oracle9i Cache Fusion Protocol

9-255

Addresses write/write contention


Eliminates disk ping protocol; sends current
blocks via the communication path
Handles the recovery of blocks that have been
transferred across the cache
Uses the CR server functionality for write/read
contention

Copyright 2003, Oracle. All rights reserved.

Oracle9i Cache Fusion Protocol


There are problems when shipping current blocks between instances. Consider a simple
case:
1. Instance A modifies a block, then the block is shipped to instance B. Before any
dirty block is sent, a log flush is made.
2. Instance B modifies the block, then the block is shipped back to instance A.
3. Instance A modifies the block again. No write of the block to disk has occurred in
any steps.
Note
If instance A dies, then its log contains records of modifications with a gap.
Modifications done in instance B are stored in the log of instance B.
For instance As crash recovery of the block, the two logs must be merged before they
can be applied. The current recovery code does not support this, except for media
recovery. The log merge, even if implemented, would require time and resources that
are proportional to the total number of instances. It does not matter whether instance B
does the crash recovery or not.

DSI408: Real Application Clusters Internals I-255

GCS (PCM) Locks

9-256

PCM locks manage the locking of data blocks in


the buffer cache.
PCM locks are internally mapped to a lock element
block class.
The block classes are described in
V$LOCK_ELEMENT, based on X$LE.
The PCM lock state information is stored in data
structures called lock elements.
The LMSn processes handle the PCM locks.

Copyright 2003, Oracle. All rights reserved.

GCS (PCM) Locks


The synchronization cost for instance locks can be high. PCM locks are typically much
more numerous than non-PCM locks. The number of non-PCM locks does not grow as
high as the number of PCM locks. The local enqueues that become global can still be
seen in the V$LOCK view. Some instance locks and PCM locks, however, cannot be
seen in the V$LOCK view.

DSI408: Real Application Clusters Internals I-256

PCM Lock Attributes

Cache fusion separates PCM lock attributes into:


Lock modes
Lock roles
Past images

9-257

Copyright 2003, Oracle. All rights reserved.

PCM Lock Attributes


Cache fusion changes the use of PCM locks in the Oracle server and relates the locks to
the shipping of blocks through the system via IPC. The objectives are to separate the
modes of locks from the roles that are assigned to the lock holders, and to maintain
knowledge about the versions of past images of blocks throughout the system.

DSI408: Real Application Clusters Internals I-257

Lock Modes

PCM locks use the following modes:


Exclusive (X)
Shared (S)
Null (N)

9-258

Lock mode compatibility is described as:


.

X
S
N

+
+

+
+
+

Copyright 2003, Oracle. All rights reserved.

Lock Modes
A lock mode describes the access rights to the resource.
The compatibility matrix is clusterwide. For example, if a resource has an S lock on one
instance, then there cannot be an X lock for that resource anywhere else in the cluster.

DSI408: Real Application Clusters Internals I-258

Lock Roles

Roles can be:


Local: Block is dirty in the local cache.
Global: Block is dirty in a remote cache or several
caches.

9-259

Roles are for Cache Fusion.

Copyright 2003, Oracle. All rights reserved.

Lock Roles
A lock role describes how the resource is to be handled. The treatment differs if the
block resides in only one cache.

DSI408: Real Application Clusters Internals I-259

Past Image

Is an indication
0: It is absent.
1: It is present.

9-260

Is present on modified block buffers that are not


current

Copyright 2003, Oracle. All rights reserved.

Lock Past Image Attribute


Initially, a block is acquired in a local lock role with no past images. If the block is
modified locally and other instances express interest in the block, then the instance
holding the block keeps a past image (PI) and ships a copy of the block, and then the
role becomes global.
A PI represents the state of a dirtied buffer. Initially, a block is acquired in L role, with
no past images present. The node that modifies the block keeps past images, as the lock
role becomes G, only after another instance expresses interest in this block. A PI block
is used for efficient recovery across the cluster, and can be used to satisfy a CR request,
remote or local.
A PI must be kept by the node until it receives notification from the master that a write
to disk has completed covering that version. The node then logs a Block Written Record
(BWR). The BWR is not necessary for the correctness of recovery, so it need not be
flushed.
When a new current block arrives on a node, a previous PI is kept untouched because it
might be needed by some other node. When a block is pinged out of a node carrying a
past image and the current version, it might or might not be combined to a single PI. At
the time of the ping, the master tells it whether there is a write in progress that will
cover the older past image. If a write is not in progress, then the older PI is replaced by
the existing current block. If a write is in progress, then this merge is not done and the
existing current becomes another PI. There can be an indeterminate number of PIs.
DSI408: Real Application Clusters Internals I-260

Local Lock Role

Possible lock modes are S or X.


All changes are on the disk version, except for any
local changes (mode X).
When requested by the master instance, the
holding instance serves a copy of the block to
others.
If the block is globally clean, then this instances
lock role remains local.
If the block is modified by this instance and passed
on dirty, then a past image is retained and the lock
role becomes global.

9-261

Lock mode reads from disk if the block is not in


the cache.
Lock mode may write block if lock is X.
Copyright 2003, Oracle. All rights reserved.

Local Lock Role


The local role states that the block can be handled very similarly to the way it is done in
single instance mode. In local role, the lock mode reads from disks and writes the dirty
block back to disk when it ages out without any further DLM activity.

DSI408: Real Application Clusters Internals I-261

Global Lock Role

9-262

Possible lock modes are N, S, or X.


Implies other instances also had or have the block
in global mode.
The block is globally dirty when role G is assigned.
It can modify the block further in mode X.
The instance cannot read from disk; it is not
known whether the disk copy is current or not.
Instance serves a copy to others when instructed
by the master.
Instance may only write a block in X mode or PI
when directed by the master.
The write requests must be sent to the master.

Copyright 2003, Oracle. All rights reserved.

Global Lock Role


A global lock role limits the handling of a block, because another instance also has a
dirty version of the block, and the disk version of the block is obsolete.

DSI408: Real Application Clusters Internals I-262

Block Classes

9-263

There are ten classes of ORACLE blocks.


Each ORACLE block is protected by a PCM lock
that is described by a lock element structure.

Copyright 2003, Oracle. All rights reserved.

Block Classes
Class
Description
1
DATA
2
SORT. These are never protected by PCM locks, because they are private to
one instance.
3
SAVE UNDO BLOCK, used for TBS management
4
SEGMENT HEADER
5
SAVE UNDO SEGMENT HEADER, used for TBS management
6
FREE-LIST
7
EXTENT MAP, used for unlimited extents
8
BITMAP BLOCK for locally managed tablespaces
9
BITMAP INDEX BLOCK for locally managed tablespaces
>=11 If odd, it is an UNDO HEADER, and the block type is (RBS_number*2) +
11, used for the transaction table.
If even, it is an UNDO BLOCK, and the block type is (RBS_number*2) + 12,
used for undo blocks.

DSI408: Real Application Clusters Internals I-263

Lock Elements (LE)

9-264

Reside in the SGA


Hold lock state information (converting, granted,
and so on)
Are managed by the lock processes to determine
the mode of the locks (exclusive, null, shared, and
so on)
Hold a chain of cache buffers that are covered by
the lock element
Allow the Oracle database to keep track of cache
buffers that must be written to disk in case a lock
element (mode) needs to be downgraded (X N)

Copyright 2003, Oracle. All rights reserved.

Lock Elements
The lock elements (LE) are also known as BL type enqueues.

DSI408: Real Application Clusters Internals I-264

Allocation of New LE

For blocks other than UNDO:


id1 = BNO bit-ored ( AFN << 22)
id2 = ( AFN >> 10) << 15

For UNDO blocks:


id1 = ( BNO / _kcl_undo_grouping) %
_kcl_undo_locks
id2 = block class

9-265

The LE is identified by <BL,id1, id2>.


Which LMSn to use is given by:
(id1+id2)%(number_of_LMS_procs)

Copyright 2003, Oracle. All rights reserved.

Allocation of a New LE
The block that is to be covered by the LE has an absolute file ID (AFN) and a block
number (BNO).
Note: Cache fusion applies only to blocks other than UNDO.
The default value of _kcl_undo_grouping is 32.
The default value of _kcl_undo_locks is 128. This represents the number of locks
per UNDO segment.

DSI408: Real Application Clusters Internals I-265

Hash Chain of LE
Every active releasable LE is in one hash chain.

Hash
Chain
head
of LE

9-266

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

Copyright 2003, Oracle. All rights reserved.

Hash Chain of LE
The number of hash chain heads or buckets (NBH) is the nearest prime lower than
_db_block_buffers.
The hash algorithm for LE is ID1 modulus NBH.

DSI408: Real Application Clusters Internals I-266

Block to LE Mapping
BEGIN

LE
with same
Id1, Id2
in chain

yes Use it for


block

End

no
Take LE
from freeyes
list and
initialize
with id1,
id2

Some
LE in
free-list

no

Post LMS to
free some LE

9-267

Wait 20 ms
on "Global
cache
freelist
wait"

Link LE
into the
hash chain

End

Retry only once

Copyright 2003, Oracle. All rights reserved.

Block to LE Mapping
When LEs need to be freed, you must post the LMS that is associated with the <id1,
id2> LE. The statistics global cache freelist waits is incremented.

DSI408: Real Application Clusters Internals I-267

Queues of LE for LMS

Latch

9-268

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

LE

Down-convert queue
LE with BAST
Lazy-close queue
when WRITE is done
Deferred-ping queue
when timeout

LE

Long-flush queue
wait for log flush

Copyright 2003, Oracle. All rights reserved.

Queues of LE for LMS


Each LMS process has a number of latches equal to gc_latches. Each latch protects
several queues.
The lazy-close queue is also used for clearing blocks and no-buffer operations.

DSI408: Real Application Clusters Internals I-268

LMSn Free of LE
BEGIN
Choose a LE
from associated
lazy close
queue
Get Latch of
queue
no

Buffer
linked
to LE

yes
Get buffer's
HashList

Compute rdba,
tsn from LE;
get HashList of
(rdba,tsn)
Got
Hash-Latch
in Shared,
NoWait

yes

no
Free queue
latch

9-269

go through
code path
of BAST
management

Get Latch in
Shared Wait

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-269

End

Cache Fusion Examples: Overview

Initial state:
D
Master
A

9-270

1008

Copyright 2003, Oracle. All rights reserved.

Cache Fusion Examples: Overview


Initial State
The examples in the following slides show the messages and the resource status changes
in Cache Fusion, and the transfer of blocks between instances and disks. This slide
shows the setup that is used for these examples.
There are four instances, A, B, C, and D, and a shared drive. For simplicity, the
examples use just one block that is initially shown on the disk with a system change
number (SCN) of 1008.
Lock state has a three-letter indication; lock mode is indicated with the letters N, S, or X;
lock role is indicated with the letters L or G; and PI is shown with 0 or 1.
The block that is used throughout these examples has its resource master on instance D.
This is to show the lock messages clearly. If the lock master and the block coincide on a
node, then some optimizations occur, reducing the number of messages or choosing a
different way to get the block.

DSI408: Real Application Clusters Internals I-270

Cache Fusion Examples: Overview (continued)


The following slides show the final state of the block transition as well as the lock
transitions. The initial state for each example is the final state of the previous
example according to the roadmap:
(1) --+-> (3) ---> (4) --+-> (5) ---> (6)
|

+-> (2)

+-> (7) --+-> (8) ---> (9)


|
+-> (10)

Example 11 is a stand-alone example.

DSI408: Real Application Clusters Internals I-271

Cache Fusion: Example 1

Getting a block from the disk:


D
Master

1:LReq (S,C)

2:Grant SL0

1008

SL0

3:Read
B

9-272

4:Notify

1008

Copyright 2003, Oracle. All rights reserved.

Example 1: Getting a Block from the Disk


Instance C wants to read the block in its current version.
1. Instance C sends a message to the master requesting a shared lock on the block.
The (S,C) indicates Shared, Instance C.
2. Master D grants instance C the lock as SL0.
3. After C receives the lock, it initiates an I/O to read the block from disk.
4. C is notified that the read is complete and now holds the block with SCN 1008.

DSI408: Real Application Clusters Internals I-272

Cache Fusion: Example 2

Getting a block from the cache:


D
Master
A

2:Ping(S,B)

1008

SL0
4:Assume(SL0)
3:Send(SL,SL0)

1:LReq (S,B)

1008
SL0

9-273

1008

Copyright 2003, Oracle. All rights reserved.

Example 2: Getting a Block from the Cache


Following example 1, the steps for instance B attempting to read the block are:
1. Instance B requests master D for a shared lock. It has no knowledge of where the
block is; it simply asks for the access rights of a shared lock.
2. The lock master at instance D knows that the block is being held in instance C;
therefore it sends a ping message to instance C, instead of granting the lock as it
did in example 1.
3. Instance C sends the block to instance B and indicates that instance B should take
the S lock and the current lock mode and role of instance C, which is SL0.
4. Instance B sends a message to master D that it received the block and will assume
SL0. This message is sent asynchronously, whereas other messages were sent
synchronously.
Optimization in the code may decide that it is less of a load on the whole cluster (or less
latency) to read the block from the disk, instead of sending messages and blocks over
the network.

DSI408: Real Application Clusters Internals I-273

Cache Fusion: Example 3

Getting a clean block from the cluster for


modifications:
D
Master
A

2:Ping(X,B)

1008

SL0CR
4:Assume(SL0)
3:Send(X, Close)

1:LReq (X,B)

1009
XL0

9-274

1008

Copyright 2003, Oracle. All rights reserved.

Example 3: Getting a Clean Block from the Cluster for Modifications


Following example 1, the steps when instance B requires the block in write mode are:
1. Instance B requests master D for an exclusive (X) lock.
2. The lock master knows all the nodes that hold an S lock and sends a ping (X,-)
message to close their locks (that is, to discard their copy), until only one is left.
The lock master then sends a ping(X,B) to C.
3. Instance C sends the block to instance B with lock information and closes its lock;
that is, it discards the block. The block can be held in CR mode; this does not
require a lock, and this is not a PI.
4. Instance B sends a message to master D that it has assumed XL0. It then modifies
the block to SCN 1009.

DSI408: Real Application Clusters Internals I-274

Cache Fusion: Example 4

Getting a dirty block from the cluster and modifying it:


D
Master
1:LReq (X,A)

1013

4:Assume(XG0,
NG1,1009)

XG0

CR

3:Send(XG,NG1)

2:Ping(X,A)

1009
XL0NG1

9-275

1008

1008

Copyright 2003, Oracle. All rights reserved.

Example 4: Getting a Dirty Block from the Cluster and Modifying It


Following example 3, the steps for instance A attempting to modify the block are:
1. Instance A sends an X lock request to master D.
2. Master D sends a ping to B to give up the block to instance A.
3. Instance B sends the dirty block to instance A, retains the block, and converts its
lock to NG1. The SG1 mode would be incompatible with the XG0 lock at instance
A. Instance B has a PI at SCN 1009.
4. Instance A informs master D that it got the block and assumed XG0 on the lock. It
then modifies the block to SCN 1013.
This is a write/write contention.

DSI408: Real Application Clusters Internals I-275

Cache Fusion: Example 5

Getting a shared copy of the writeable buffer:

2:Ping(S,C)

4:Assume
(SG0,SG1,1013)

D
Master
1:LReq(S,C)

1013
XG0SG1

3:Send(SG,SG1)

1009
NG1

9-276

1013

SG0

1008

Copyright 2003, Oracle. All rights reserved.

Example 5: Getting a Shared Copy of the Writeable Buffer


Following example 4, the steps for instance C attempting to read the current block are:
1. Instance C sends a share lock request to master D.
2. Master D sends a ping to instance A, saying instance C wants a share copy.
3. Instance A, when it has finished the work on the buffer, flushes the redo log and
sends the block at SCN 1013 to instance C, retains a PI, and converts its lock to
SG1.
4. Instance C gets the block at SCN 1013 and sends a message to the master that it
assumed a lock mode of SG0.

DSI408: Real Application Clusters Internals I-276

Cache Fusion: Example 6

Getting a shared copy of the dirty shared buffer:


D
Master

2:Ping(S,B)

1013

SG1
3:Send
(SG,SG1)

SG0
4:Assume
(SG1,SG1,1013)

1:LReq (S,B)

1013
NG1SG1

9-277

1013

1008

Copyright 2003, Oracle. All rights reserved.

Example 6: Getting a Shared Copy of the Dirty Shared Buffer


Following example 5, instance B wants a shared copy of the block. This differs from
example 2 as the blocks are dirty (the disk copy is out of date) and available in two
caches.
1. Instance B sends master D a request for an S lock.
2. Now master D knows that both A and C have a shared copy of the block. It
chooses one instance and sends a ping message.
3. Instance A sends the block to instance B with lock information.
4. Instance B sends a message that it has assumed the lock in SG1 mode.
The Shared Selection Rule picks an instance that holds the resource in decreasing
preference from this list:
Master, if it has a lock S (shortest message path)
Instance with S mode holding the last PI (most recent nonmaster access)
Shared Local
Most recently granted S

DSI408: Real Application Clusters Internals I-277

Cache Fusion: Example 7

Writing blocks back to disk:


D
Master

5:W Notify

2:ReqW

1013
6:Flush PI

XG0XL0

1:Req W( )

9-278

4:Notify

1009
NG1

3:Write

1013

Copyright 2003, Oracle. All rights reserved.

Example 7: Writing Blocks Back to Disk


Following example 4, the steps for instance B attempting to write the block are:
1. Instance B sends a write request to master D with the necessary SCN.
2. Master decides the current node or latest holding node for the requested write. In
this case, it sends the write request to A and remembers that it asked A to write the
block.
3. Instance A issues a write to disk.
4. Instance A gets the notification that the write has completed.
5. Instance A notifies the master that the write has completed.
6. On receipt of write notification, master D tells all PI holders to discard their locks
and the block buffer.

DSI408: Real Application Clusters Internals I-278

Cache Fusion: Example 8

Getting the shared buffer once it is written:

2:Ping(S,B)

D
Master

4:Assume(SL0,SL0)
1:Req(S,B)

1013
XL0SL0

9-279

1013

C
3:Send(SL,SL)

SL0

1013

Copyright 2003, Oracle. All rights reserved.

Example 8: Getting the Shared Buffer Once It Is Written


Following example 7, the steps for instance C attempting to read the block, after it has
been written, are:
1. Instance C requests the shared lock from master D.
2. Master D knows instance A holds the lock in XL0, and sends a ping message to
instance A.
3. Instance A sends the block to instance C and downgrades its lock to SL0.
4. Instance C assumes SL0.

DSI408: Real Application Clusters Internals I-279

Cache Fusion: Example 9

Getting the shared buffer from multiple copies:


D
Master
A

2:Ping(S,B)

1013

SL0

1013

1:LReq(S,B)

SL0

3:Send(SL,SL)
4:Assume(SL0,SL0)

1013
SL0

9-280

1013

Copyright 2003, Oracle. All rights reserved.

Example 9: Getting the Shared Buffer from Multiple Copies


Following example 8, this example shows instance B getting a shared copy of the block.
In this case, both A and C are the candidates. According to the Shared Selection Rule,
instance C (last one to receive the shared lock) gets the ping message and serves the
block.

DSI408: Real Application Clusters Internals I-280

Cache Fusion: Example 10

Getting the shared buffer from dirty:

2:Ping(S,B)

D
Master

4:Assume(SG0,SG1,1015)
1:LReq(S,B)

1015
XL0 SG1

9-281

1015

C
3:Send(SG,SG)

SG0

1013

Copyright 2003, Oracle. All rights reserved.

Example 10: Getting the Shared Buffer from Dirty


After writing the block in example 7, instance A further dirties the block. Instance C
attempts to read the block. In this case, instance A downgrades its lock to SG1, retains
the PI, and sends the block to instance C.

DSI408: Real Application Clusters Internals I-281

Cache Fusion: Example 11

Consistent read request:


D
Master
A

1.1:CRreq

1025 3:Create CR
XL0

4:Send CR image

1.2:NoCRavailable

1013

9-282

2:Make CR

1022

Copyright 2003, Oracle. All rights reserved.

Example 11: Consistent Read Request (CR Server)


The previous examples have shown current read for shared or exclusive in all cases. For
consistent read, the CR server process is used. A CR block is without a lock as it is a
local scratch copy by definition.
1. Instance B requires a version 1013 of the block and has no block of higher
version in its own buffer.
a. It sends a CR request to the master.
b. If there is no appropriate block copy in the other caches, then the master
returns the request, indicating that instance B must get the current copy of
the block, and performs rollback. This would then be the same as example 1
earlier.
2. In the slide diagram, instance C has a copy of the block, but this is a later version.
The master instance D sends the request to instance C to ship a 1013 version of
the block.
3. Instance C takes the 1025 buffer, makes a copy, and applies undo on the copy
until it matches 1013.
4. The block is sent. Instance B receives no lock change, and there is no assume
message. If instance C is unable to make the CR copy because it does not have the
undo blocks available, it sends a message to instance B to construct the CR block
itself. The light-weight rule also causes instance C to flush its 1025 copy to disk,
thus enabling instance B to get a read-current copy (share lock) to construct its
own CR copies.
DSI408: Real Application Clusters Internals I-282

Views

9-283

V$LOCK_ELEMENT: Based on X$LE, shows the


status of each PCM lock stored in the SGA
V$BH: Based on X$BH, shows the status and pings
of every buffer

Copyright 2003, Oracle. All rights reserved.

Views
X$BH: see WebIV note 33568.1

DSI408: Real Application Clusters Internals I-283

Views (continued)
V$LOCK_ELEMENT
lock_element_addr: raw address for the lock element covering a buffer
indx:
lock element number
class:
block class (1 = data/index, 2 = sort, etc.)
lock_element_name:
flags:
status of the lock element (1 = fusion lock, 2 = no buffer on
LE, 4 = has deferred ping, 8 = LE waiting for log flush,
16=LE is being evicted, 32 = LE has been deactivated, 64 =
LE is fixed)
mode_held:
lock mode held (0 = null, 3 = S, 5 = X)
block_count:
number of blocks covered by the PCM lock
releasing:
Release flags. Non-zero if PCM lock is being downgraded.
acquiring:
Acquiring flags. Non-zero if PCM lock is being upgraded.
invalid:
Non-zero if PCM lock is invalid, always 0 in
V$LOCK_ELEMENT
Release Flags
Value Description
KCLLEBP
01
Process has sent a request to DLM
KCLLEAP
02
Acquisition Pending, the lock
operation has been started.
KCLLERECON
04
CR request aborted since because
reconfig
KCLLEINVAL
08
CR request could not started
because RECONFIG.
KCLLECOMM
10
CR request failed because time out.
KCLLENRN
20
No recovery needed.
KCLLESUSP
40
PI is suspect.
KCLLEHIGH
80
Our PI is the highest (can be made
current).
Acquire Flags
Value Description
KCLLEBA
01
BAST has been delivered.
KCLLESHR
02
Downgrad to SHARE mode
KCLLECLS
04
About to be closed
KCLLESCP
08
Scan completed
KCLLERP
10
Release processing, enables down
convert
KCLLEDCL
20
On Down Convert list
KCLLEDCS
40
Down-convert has been started.
KCLLEREAL
80
Real BAST has arrived during fake
bast.
KCLLEDFR
100
BAST has been deferred once.
More detail in kcl0.h
DSI408: Real Application Clusters Internals I-284

Views (continued)
V$BH
file#:
block#:
class#:
status:

datafile number
block number
class of the block
status of the block (free=not in use, xcur=exclusive, scur=shared
current, cr=consistent read, read=reading from disk; mrec=mr mode,
irec=ir mode)
xnc:
# of PCM lock conversions
lock_element_addr: raw lock element address
lock_element_name:
lock_element_class:
dirty:
(Y) block modified
temp:
(Y) temporary block
ping:
(Y) block pinged
stale:
(Y) block is stale
direct:
(Y) direct block
new:
(Y) new block
objd:
object number
ts#:
tablespace number
Column state of X$BH can contain following value :
0 or FREE
1 or EXLCUR
2 or SHRCUR
3 or CR
4 or READING
5 or MRECOVERY
6 or IRECOVERY
7 or WRITING
8 or PI

DSI408: Real Application Clusters Internals I-285

Parameters

_LM_LMS
Default value min(#CPU/4, 10)
0 if cluster_database is false

GC_FILES_TO_LOCKS
Same value as Oracle8i, but setting this disables
Cache Fusion for the specified files

9-286

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-286

Summary

In this lesson, you should have learned about:


Cache fusion implementation levels
Flow of locks and blocks in Cache Fusion

9-287

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-287

Cache Fusion 1

CR Server

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Describe Consistent Read (CR) Cache Fusion
Outline the flow of CR request handling

10-289

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-289

Cache Fusion: Consistent Read Blocks


Node

Other
nodes

Instance

Caches

kcb/kcl
GRD(GCS)
CGS

I
P
C

NM
CM

10-290

Copyright 2003, Oracle. All rights reserved.

Cache Fusion: Consistent Read Blocks


Cache coherency of Consistent Read (CR) blocks was introduced in Oracle8i. The
configuration of the feature and its detailed implementation are different in Oracle9i,
but the functionality is the same.

DSI408: Real Application Clusters Internals I-290

Consistent Read Review

10-291

Current Block: The most recent version of a block


CR Block: A coherent version of a block with only
the committed changes
CRSCN: SCN for block
CR_Xid: Transaction ID for which block is limited
CR_uba: UBA for transaction: kcbdsxid
CRSfl: Snapshot flag
Snap_SCN: SCN of a snapshot of a block from a
particular point in time
Snap_UBA: UBA at time Snap_SCN
Env_Scn: SCN at current time
Env_uba: UBA of the current transaction
Copyright 2003, Oracle. All rights reserved.

Consistent Read Review


Consistent Read is the Oracle implementation of the read committed isolation level.
There are two possibilities:
Statement level: Query results are consistent with respect to the start of the query
(snap_scn = current SCN when the query starts its Execute phase).
Transaction level: Query results are consistent with respect to the beginning of
the transaction (snap_scn = current SCN when the transaction begins).
Consistent Read sees the world in an asymmetric way: A transaction sees only other
committed changes but sees its own uncommitted changes.
CR_Xid, CR_uba, and CRSfl are available as CR stat structures that are associated
with each buffer in the cache (also available in X$BH).
The Snap_UBA is useful for the consistent read problem of a modified block, for
example:
UPDATE T SET status = status+1 WHERE status > 0

DSI408: Real Application Clusters Internals I-291

Getting a CR Buffer

ktrget:
Initializes a buffer cache CR scan request
Calls kcbgtcr for the best resident buffer to start
from to build the CR buffer
Calls ktrgcm to build the CR buffer by applying
undo
Returns CR buffer to the requestor

kcbgtcr:
If successful, returns the best candidate
(performed by ktrexf or examination function)
Scans the hash bucket for the DBA for buffers that
may be used to build a CR buffer
If not successful, calls kcbget

10-292

Copyright 2003, Oracle. All rights reserved.

Getting a CR Buffer
Any and all queries start with getting a CR buffer version of the block.

DSI408: Real Application Clusters Internals I-292

Getting a CR Buffer

kcbget:
Retries the scan just tried by kcbgtcr
If you find a buffer, you return it now.
If not, then if it is being READ in or there is a current
mode buffer, you wait until it is available and then
rescan the buffer.
If these fail, you cannot use any locally cached
buffers.

If the above fails:


CR server manages the CR request.

10-293

Copyright 2003, Oracle. All rights reserved.

Getting a CR Buffer (continued)


The CR server was a separate background process in Oracle8i. In Oracle9i and later, the
same functionality is part of the LMS process.
Prior to Oracle8i, instead of issuing a CR request, a ping operation was started to get the
current block from disk.

DSI408: Real Application Clusters Internals I-293

Getting a CR Buffer in Oracle9i Release 2

Owner instance

Requesting instance

UNDO

Current
CR

10-294

CR

Copyright 2003, Oracle. All rights reserved.

Getting a CR Buffer in Oracle9i Release 2


This feature has been available since Oracle8i. In contrast, before Oracle8i the current
block and all undo blocks were pinged across to the requesting instance to construct
the CR buffer at its destination.

DSI408: Real Application Clusters Internals I-294

CR Server in Oracle9i Release 2


Master
1. Ask for CR and
LOCK in SHARE mode.
LMS

Requestor
LMS

3. Send info to LMS


including (port,IP)
address for answer.

2. No conflict
mode:
grant LOCK

3. AST for
conversion

Interconnect
message

Holder
9. Send CR
buffer.

LMS
8

4. Read since
LOCK is granted.
FG

4. Build CR
block and stop
when completed
or IO required.

5. Ask LGWR
to flush REDO.
6,7

LGWR
Database

10-295

Log

Copyright 2003, Oracle. All rights reserved.

CR Server in Oracle9i Release 2


There are three instances involved: the requestor instance, the lock master instance, and
the current block owner instance.
The lock is granted if one of the following is true:
Resource held mode is NULL.
Resource held mode is S and there is no holder of an S lock in the master node.
Otherwise, the master forwards the CR request to the holder node.
If the lock is global, then you choose a node to forward the CR request to as follows:
If there is a past image (PI) at the lock master instance, and the PI SCN is greater
than snap-scn, then the master node is this node.
Otherwise, you choose a PI with the smallest SCN and PI-SCN greater than snapSCN. The owner node of this PI is the node you forward the CR request to. The PI
with smallest SCN is the most interesting one, because you have less UNDO to be
applied.
If there is no PI at all, you choose the node that the current buffer belongs to.

DSI408: Real Application Clusters Internals I-295

CR Requests

If there is no usable local buffer:


Construct a message to the LMS master node
for the BL resource covering the block
Message contains:
Lock convert request
Message to the CR server for the requested buffer

10-296

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-296

CR Requests

Resource master node will either:


Grant the lock mode
Forward the CR request to PI or CURRENT holder
node

10-297

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-297

Light Work Rule

The LMS process of the node CR request is


forwarded to build the CR buffer by calling
kcbgtcr or ktrget. LMS stops the CR buffer
building and sends what it has when the light
work rule is fired:
I/O is required.
A buffer with the same class, same AFN, and same
blockID but with different objectID is found,
signifying a dropped or truncated object.
Write in progress

10-298

Ship that buffer to the requestor: Requestor


completes the CR build.

Copyright 2003, Oracle. All rights reserved.

Light Work Rule


The CR server only does light work, which does not include I/O.

DSI408: Real Application Clusters Internals I-298

Fairness

LMS (building the CR buffer) also performs a down


convert of the lock covering the buffer, if:
The block is not UNDO and the lock is held in X
mode
Too many CR requests for a buffer since the last
change was made to the block. The holder pings the
block to disk. LMS does this if there is more than
_fairness_threshold CR requests.

10-299

_fairness_threshold default value is 4

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-299

Statistics
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global

10-300

cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache

gets
get time
converts
convert time
cr blocks received
cr block receive time
current blocks received
current block receive time
cr blocks served
cr block build time
cr block flush time
cr block send time
current blocks served
current block pin time
current block flush time
current block send time
freelist waits
defers
convert timeouts
blocks lost
claim blocks lost
blocks corrupt
prepare failures
skip prepare failures

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-300

Wait Events
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global

10-301

cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache

open s
open x
null to s
null to x
s to x
cr request
cr disk request
busy
freelist wait
bg acks
pending ast
retry prepare
cancel wait
cr cancel wait
pred cancel wait
domain validation
assume wait
recovery free wait
recovery quiesce wait
claim wait

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-301

Fixed Table X$KCLCRST Statistics


REQCR
:CR Request
REQCUR :CURRENT Request
REQDATA :DATA Block Request
REQUNDO :UNDO Block Request
REQTX
:UNDO Header Request
RESCUR :CURRENT result
RESPRIV :Only Readable By Requestor Result
RESZERO :Only Readable By 0 XID Result
RESDISK :Read From Disk Result
RESFAIL :Retry Result
RESWAIT :
FAIRDC :Fairness Down Convert
FAIRCL :Fairness Count Cleared
FREEDC :Fairness Down Convert On Free Lock Element
FLUSH
:LMS Has To Wait For A Block Flush
FLUSHQ :Request Put On Log Flush Queue
FLUSHF :Log Flush Queue Full
FLUSHMX :Max Log Flush Time
LIGHT
:Light Work Rule Signaled
LIGHT1, LIGHT2
ERROR
:Some Error Signaled
HINT, NOCUR, PIPING, PIFAIL, WRITEPI

10-302

Copyright 2003, Oracle. All rights reserved.

Fixed Table X$KCLCRST Statistics


An extract of this table is available in the view V$CR_BLOCK_SERVER (V$BSP in
Oracle8i).

DSI408: Real Application Clusters Internals I-302

CR Requestor-Side Algorithm
kcbgtcr
ktrget

BEGIN

BEGIN
call kcbgtcr
to get a best
buffer

"consistent get"++
compute and follow hash bucket;
for each buffer:
call ktrexf to find best buffer

call ktrgcm to
apply UNDO (if
any) to
produce a good
CR buffer
END

best
buffer found
in local
cache?
yes

no
call kcbzib
to get the
buffer

END
10-303

Copyright 2003, Oracle. All rights reserved.

CR Requestor-Side Algorithm
The following statistics are incremented by ktrgcm:
cleanouts and rollbacks - consistent read is incremented if UNDO is applied to
BUFFER and CLEANOUT is performed.
rollbacks only - consistent read gets is incremented if UNDO is applied to
BUFFER and no CLEANOUT is performed.
cleanouts only - consistent read gets is incremented if no UNDO is applied and
CLEANOUT is performed.
no work - consistent read gets is incremented if no UNDO is applied and no
CLEANOUT is performed.
When UNDO is applied to produce a CR BUFFER, other UNDO blocks should be read.
When CLEANOUT is performed, the TX transaction table must be read.

DSI408: Real Application Clusters Internals I-303

CR Requestor-Side Algorithm
kcbzib
for CR
request

BEGIN

call kcbzgb to get


a buffer and set
the state of this
buffer to READING

no

bit KCBBHFCR
is set or
LE mode>=requested
mode?
yes

db
mounted
shared?

no

yes
Call kclgclk
asking for convert
LE in SHARED mode
with KCLCVCR option
10-304

Read block from disk


to buffer already
allocated
increment statistic
"physical reads"
Buffer CR
is received
and usable

END

Copyright 2003, Oracle. All rights reserved.

CR Requestor-Side Algorithm (continued)


Bit KCBBHFCR is set if a timeout occurs during LE conversion.
LE mode >= requested mode, if the DLM conversion succeeded.
Note: Only the CR case is presented here.

DSI408: Real Application Clusters Internals I-304

CR Requestor-Side Algorithm
kclgclk
BEGIN

kclcls
BEGIN

Call kclcls for


each buffer to
see status of LE

Find or locate LE
LE in
transition?

no
Some
LE to convert
yes
or to open
wait 1 sec on
?
no
"global cache busy"
yes
DLM
requested
Call kclscrs to
END
END
mode > LE held
start CR
no
mode?
Call kclwcrs
to wait for CR
yes
to complete
Allocate a lock
Set bit 0X1 of
ctx and link buffer
LE->acquiring
END
to LE
10-305

Copyright 2003, Oracle. All rights reserved.

CR Requestor-Side Algorithm (continued)


kclcls indicates whether some LE has to be opened (first time for buffer) or whether
some LE must be converted, because LE held-mode is smaller than CR requested
mode (S).
If an LE is associated with a global lock and the lock already exists (not NEW), then
you also allocate a lock context and link to the LE. You issue a predecessor-block read
for this LE. This is done because you no longer have the PI in your cache and you
cannot read from the disk because the lock is global.
LE mode allows the requested mode to return and let kcbzib read the buffer.
LE is in transition if acquiring != 0 or releasing != 0.

DSI408: Real Application Clusters Internals I-305

CR Requestor-Side Algorithm
kclscrs

BEGIN

no
some LE
left?

END

yes
Take LE and setup a CR request

Call kjbcropen
Set bit 2 of
LE->acquiring

yes

Call kjbcrconvert yes


Set bit 2 of
LE->acquiring

LE lock
not opened
yet?
no

LE
lockmode
NULL?
no
Call kjbpredread
Set bit 2 of LE->acquiring

10-306

Copyright 2003, Oracle. All rights reserved.

CR Requestor-Side Algorithm (continued)


In the three cases (lock open, lock convert, or predecessor read), you receive either a
buffer or a lock grant with some differences:
For lock-open or lock-convert, you receive a buffer or a grant.
For predecessor-read:
- You receive a grant and the lock role is converted to local if there is no PI for
the buffer in the cluster.
- You receive a buffer containing the highest PI (sent by some node) in the
cluster.
You step to call kjbpredread when the lock role is global and the LE is already
opened and you do not have any more PI in your cache.

DSI408: Real Application Clusters Internals I-306

CR Requestor-Side Algorithm
kclwcrs

BEGIN

req not
examined?
Next CR request

yes

req
completed?

no
Increment "global cache
current blocks received" and
"global cache current block
receive time"
Set CR request status
"completed"
Increment "global cache CR
blocks received" and "global
cache CR block receive time"
Set CR request status
"completed"
yes
Set request status
"completed"
AST has fired; lock granted S

10-307

no

yes
req type
"predread"
and buffer
received
?

END

req not
completed?

no

yes
no

Wait 1 sec on
"global cache
CR request"

Get
message

yes

req type
"open" or
"convert" and buffer
received?
bit 2
LE->acq
no
cleared?
yes

no

Copyright 2003, Oracle. All rights reserved.

kclwcrs
The description of kclwcrs is simple, and the code path for error management is not
displayed.

DSI408: Real Application Clusters Internals I-307

CR Requestor-Side AST Delivery


Requestor node
LE
2: Set
bit 0X2
of acq.

1:
Locate
LE.

6: Unset bit
0X2 of LE with
AST callback
provided by FG.
LMS

7:
Post FG.

FG

4: Wait on
"global cache
CR request".

Master node

LMS
5: Notify
that LOCK
is granted.

3: CR submit along with


lock request with
(ip,port) information.

Scenario where LOCK is granted to FG


10-308

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-308

CR Requestor-Side CR Buffer Delivery


Requestor node

Master node

acquiring

LE

2: Set
bit 0X2
of acq.

1:
Locate
LE.

LMS

5: Build
CR buffer.

FG

LMS
3: CR submit along
with lock request with
(ip,port) information.

4: Wait on
"global cache
CR request".

6: Deliver CR buffer
with (ip,port)
information.

Scenario where CR Buffer is delivered to FG

10-309

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-309

CR Server-Side Algorithm
Call kcbgtcr to get block with
kclexf as examination function to
retain only CURRENT block

BEGIN

request
for CURRENT
block?
no

yes

error
from kcbgtcr
or ktrget
?
yes

REQCUR++
REQ{DATA|UNDO|TX}++

REQCR++
REQDATA++

no
error
KCBOERLWRx
?

Call ktrget to
fabricate CR buffer

yes
ERROR++
LIGHTx++

buffer
state is
CR?
yes

10-310

no

Set req status to STATCUR


RESCUR++

Set req status to


STATPRIV
RES{PRIV|ZERO}++

Send ERROR
to requestor
RESFAIL++

FLUSH LOG
SEND BACK BUFFER
FAIRNESS MANAGEMENT

END

Copyright 2003, Oracle. All rights reserved.

CR Server-Side Algorithm
X$KCLCRST.LIGHTn is incremented if the light work rule fires while the CR block is
building, because of the following reasons:
A buffer is found with the same AFN and BLOCKNUM but the object-id in the
buffer is different from the object-id that is submitted by the requestor (object was
DROPPED or TRUNCATED after consistent read <is> started and before the end).
A wait for WRITE COMPLETE
A wait because the buffer is in READING state
Buffer is suspended and a free buffer is needed
A wait for free buffer wait
A read block from disk to buffer-cache
A wait for space for redo
A wait for ITL
X$KCLCRST.LIGHT1 is incremented if a block is found with bit modification
started set; in this case the process sleeps some seconds, and when it wakes up, the
same process is still modifying the block.
X$KCLCRST.LIGHT2 is incremented if a buffer is in instance RECOVERY state.
This description of kclgcr is simplified.
DSI408: Real Application Clusters Internals I-310

CR Server-Side Algorithm
kclgcr

REDO
ondisk?

BEGIN

yes

no
X$KCLCRST.FLUSH++

no

room
in logflush
queue?
yes
Add new element
in logflush queue
X$KCLCRST.FLUSHQ++

X$KCLCRST.FLUSHF++
call kcrfisd and wait on
"log file sync" but only once

10-311

END

Copyright 2003, Oracle. All rights reserved.

kclgcr
FLUSH LOG
Note: There are no more than 255* processes elements in the logflush queue.

DSI408: Real Application Clusters Internals I-311

CR Server-Side Algorithm
BEGIN
Increment
LE.FAIRNESS_COUNTER
Queued in
LOG FLUSH
phase

LOGFLUSH
queued?

yes

END 1

no
Send CR buffer
to requestor

no

LE heldmode is
EXCLUSIVE and
LE.FAIRNESS_COUNTER >=
fairness_threshold?
yes

Update statistics

requested
block is
UNDO or UNDO
header?

X$KCLCRST.FAIRDC++
downgrade LE
to SHARE mode
END 2
yes

END 3

no
10-312

Copyright 2003, Oracle. All rights reserved.

kclgcr (continued)
Send back buffer fairness management.
At END 1 the buffer is not sent; this is done in LOGFLUSH queue processing.
The following statistics are updated after the CR buffer is sent to the requestor:
global cache cr block build time with time spent in ktrget or kcbgtcr
global cache cr block log flush time with time spent in LOG FLUSH phase
global cache cr block send time with time spent in CR block sending
Note: LE.FAIRNESS_COUNTER is reset at each buffer modification.

DSI408: Real Application Clusters Internals I-312

CR Server-Side Algorithm
BEGIN

kclqchk

Next element

dequeue
element

yes

element
on LOGFLUSH
queue?

caller
asks for
wait?

yes

no
END

no

call kcrfisd
to check if REDO
is on disk

call kcrfisd to flush


REDO; wait on "log file
sync" if redo is not on
disk, but only once

REDO
on disk?

no

yes
send CR buffer to requestor

10-313

Copyright 2003, Oracle. All rights reserved.

kclqchk
LOGFLUSH queue processing.
After the CR buffer is sent to the requestor, the following statistics are updated:
global cache cr block build time with time spent in ktrget or kcbgtcr
global cache cr block log flush time with time spent in LOG FLUSH phase
global cache cr block send time with time spent in CR block sending

DSI408: Real Application Clusters Internals I-313

Summary

In this lesson, you should have learned how to:


Describe CR server functionality
Outline CR processing

10-314

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-314

Cache Fusion 2

Current Block: XCUR

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to


describe the flow of current blocks in Cache Fusion.

11-317

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-317

Cache Fusion: Current Blocks


Node

Other
nodes

Instance

Caches

kcb/kcl
GRD(GCS)
CGS

I
P
C

NM
CM

11-318

Copyright 2003, Oracle. All rights reserved.

Cache Fusion: Current Blocks


Cache coherency of current (XCUR) blocks was introduced in Oracle9i.

DSI408: Real Application Clusters Internals I-318

PCM Locks and Resources

PCM DLM locks that are owned by the local


instance are allocated and embedded in an LE
structure.
PCM DLM locks that are owned by remote
instances and mastered by the local instance are
allocated in SHARED_POOL.
LE in
kclle structure

LE_ADDR

X$LE
PCM DLM resource in
kjbr structure
X$KJBR

11-319

KJBLLOCKP-0x60
KJBLRESP

PCM DLM lock in


kjbl structure
X$KJBL

KJBRRESP

Copyright 2003, Oracle. All rights reserved.

PCM Locks and Resources


Fields of interest in the kclle structure: kcllerls or releasing; kcllelnm or
name(id1,id2); kcllemode or held-mode; kclleacq or acquiring; kcllelck or
DLM lock.
Fields of interest in the kjbr structure: resname_kjbr[2] or resource name;
grant_q_kjbr or grant queue; convert_q_kjbr or convert queue;
mode_role_kjbr, which is a bitwise merge of grant mode and role-interpreted
NULL(0x00), S(0x01), X(0x02), L0 Local (0x00), G0 Global without PI (0x08), G1
Global with PI (0x018).
The field mode_role_kjbl in kjbl is a bitwise merge of grant, request, and lock
mode: 0x00 if grant NULL; 0x01 if grant S; 0x02 if grant X; 0x04 lock has been opened
at master; 0x08 if global role (otherwise local); 0x10 has one or more PI; 0x20 if request
CR; 0x40 if request S; 0x80 if request X.

DSI408: Real Application Clusters Internals I-319

Fusion: Long Example

Three instances
One block is
SELECT on I3
selected and
3
2
updated
UPDATE on I2
SELECT on I2 Instance 2 is the
4
master of the
UPDATE on I1
block resource
Start

Write on I1
7

SELECT on I3

SELECT on I3

11-320

Copyright 2003, Oracle. All rights reserved.

Fusion: Long Example


The SQL for each step is one of:
SELECT * FROM emp WHERE empno = ;
UPDATE emp SET sal = sal + 10 WHERE empno = ; COMMIT;
ALTER SYSTEM CHECKPOINT LOCAL;

The empno is chosen differently in each instance to avoid considering transactions


locks and to limit the flow to PCM locks. All rows are in the same block, which has
number 10 and is in file 8 in the subsequent dumps.
Step Purpose
1
Lock and block acquisition, remote master
2
Lock and block acquisition, local master, shared block
3
Lock conversion, lock downgrade
4
Block fusion, write/write
5
Block fusion, write/read (CR)
6
Write involves locks, discard PI
7
Block fusion write/read, similar to step 5

DSI408: Real Application Clusters Internals I-320

Fusion: Examples (continued)


You use the following SQL statements to monitor locks and resource states in each instance:
1. SELECT state, mode_held, le_addr, class, dbarfil, dbablk,
cr_scn_bas, cr_scn_wrp
FROM x$bh
WHERE obj IN (SELECT data_object_id
FROM dba_objects
WHERE owner='SCOTT'
AND object_name='EMP')
AND class = 1;
2. SELECT name, le_class, le_rls, le_acq, le_mode, le_write,
le_local
FROM x$le
WHERE le_addr IN (SELECT le_addr
FROM x$bh
WHERE obj IN (SELECT data_object_id
FROM dba_objects
WHERE owner='SCOTT'
AND object_name='EMP')
AND class = 1
AND state != 3 );
3. SELECT r.* FROM x$kjbr r
WHERE r.kjbrname LIKE '%[0x200000a][0x0],[BL]%';
4. SELECT l.kjblname, l.kjblrole, l.kjblgrant, l.kjblrequest,
l.kjbllockst, l.kjblresp
FROM x$kjbl l
WHERE l.kjblname LIKE '%[0x200000a][0x0],[BL]%';
A resource name is (id1, id2), with BL for PCM locks. The id1 and id2 for our block are
derived by:
Id1 = blockno || ( fileno << 22)
= 10 || ( 8 << 22)
= 0x200000a
Id2 = ( fileno >> 10) << 15
= ( 8 >> 10 ) << 15
= 0

DSI408: Real Application Clusters Internals I-321

Initial State
X$LE
no rows
X$BH
no rows
X$KJBR
no rows
X$KJBL
no rows

selected
selected
selected
selected

Instance 1

Instance 2
(Master)

Instance 3

11-322

Copyright 2003, Oracle. All rights reserved.

Initial State
Initially, nothing has been read into cache or locked, so the queries do not return any
rows.
In displaying the X$KJBR.KJBRNAME in subsequent slides, the column has been
truncated to fit. It has the same value as the X$KJBL.KJBLNAME for these examples.

DSI408: Real Application Clusters Internals I-322

Step 1:
Instance 3 Performs SELECT

Instance 2

Instance 1

(Master)
2 Grant(SL0)

1 CRREQ(S)
3 Read

Instance 3
4 Notify

11-323

Copyright 2003, Oracle. All rights reserved.

Step 1: Instance 3 Performs SELECT


Because there is no lock yet, master grants SL0 mode to instance 3. Instance 3 then
reads the block from the disk to its buffer cache.

DSI408: Real Application Clusters Internals I-323

Lock Changes in Instance 3


X$BH

Before

no rows selected

X$LE
no rows selected

X$KJBR
no rows selected

X$KJBL
no rows selected

After

X$BH
no rows selected

X$LE
no rows selected

X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FE343C KJUSERPR KJUSERNL
0 [0x200000a]
1 22884D40 00
00

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSERPR KJUSERNL GRANTED
22FE343C

11-324

Copyright 2003, Oracle. All rights reserved.

Step 1 (continued): Lock Changes in Instance 3


You see the resources that were created and the local locks that were acquired.

DSI408: Real Application Clusters Internals I-324

Lock Changes in Instance 2


X$BH

Before

no rows selected

X$LE
no rows selected

X$KJBR
no rows selected

X$KJBL
no rows selected

X$BH
STATE
MODE_HELD LE_ADDR CLASS
DBARFIL
DBABLK
CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------2
0 24FF9030
1
8
10
0
0

X$LE
NAME LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
3
0
1

X$KJBR
no rows selected

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST
KJBLRESP
------------------------- ---------- --------- --------- ------------ -------[0x200000a][0x0],[BL]
0 KJUSERPR KJUSERNL GRANTED
00

11-325

Copyright 2003, Oracle. All rights reserved.

Step 1 (continued): Lock Changes in Instance 2


The X$LE.STATE is 2, which means that the buffer is shared current.

DSI408: Real Application Clusters Internals I-325

After

Step 2:
Instance 2 Performs SELECT
1 CRREQ(S)

Instance 1

2 Grant(SL0)

Instance 2
(Master)

3 Read

4 Notify

Instance 3

11-326

Copyright 2003, Oracle. All rights reserved.

Step 2: Instance 2 Performs SELECT


The master grants SL0 mode to instance 2, because:
There is an S lock on the resource (owned by instance 3).
There is no S lock on the same resource in the master.
Instance 2 then reads the block from the disk to the BUFFER-CACHE.
The behavior changes if there is an S lock on the master or
_cr_grant_local_role is TRUE. In this case, the master forwards the CR request
to an instance owner of the S lock (instance 3). This instance sends the current buffer (as
a lock in S mode) to instance 2.
The default value for _cr_grant_local_role is FALSE.

DSI408: Real Application Clusters Internals I-326

Lock Changes in Instance 2


Before

X$BH
no rows selected

X$LE
no rows selected

X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FE343C KJUSERPR KJUSERNL
0 [0x200000a]
1 22884D40 00
00

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSERPR KJUSERNL GRANTED
22FE343C

X$BH
STATE
MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------2
0 253F3A10
1
8
10
0
0

After

X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
3
0
1

X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FE343C KJUSERPR KJUSERNL
0 [0x200000a]
1 22884D40 00
00

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
0 KJUSERPR
[0x200000a][0x0],[BL]
0 KJUSERPR

11-327

KJBLREQUE
--------KJUSERNL
KJUSERNL

KJBLLOCKST
----------GRANTED
GRANTED

KJBLRESP
-------22FE343C
22FE343C

Copyright 2003, Oracle. All rights reserved.

Step 2 (continued): Lock Changes in Instance 2


The X$BH.STATE is 2, which is shared current.
X$LE.NAMEs value 3355442 is 0x200000D.

DSI408: Real Application Clusters Internals I-327

Step 3:
Instance 2 Performs UPDATE
5 ASSUME(XL0,close)
to master

1 LREQ(X)
to master

Instance 2

Instance 1

(Master)

4 Send
Buffer to requestor
2 PING(X,Node2)

Instance 3
3 Make
Buffer CR

11-328

Copyright 2003, Oracle. All rights reserved.

Step 3: Instance 2 Performs UPDATE


Instance 2, the requestor, sends an X request to the master (itself).
The Master (instance 2) sends ping X to the S lock holder (instance 3).
Instance 3 converts the buffer state from S CURRENT to CR and closes the lock.
Instance 3 sends the buffer to the requestor (instance 2).
The requestor (instance 2) sends ASSUME to the master (itself) for lock mode and tells
the master that the previous holder (instance 3) has closed the lock.

DSI408: Real Application Clusters Internals I-328

Lock Changes in Instance 2


Before

X$BH
no rows selected

X$LE
no rows selected

X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSERPR KJUSERNL
0 [0x200000a]
1 22882980 00
00

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSERPR KJUSERNL GRANTED
22FD8B24

X$BH
STATE
MODE_HELD
LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253ECED0
1
8
10
0
0

After

X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
1

X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL KJBRROLE
KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL 0
[0x200000a]
1 253ECF30 00
00

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSEREX KJUSERNL GRANTED
22FD8B24

11-329

Copyright 2003, Oracle. All rights reserved.

Step 3 (continued): Lock Changes in Instance 2


The X$BH.STATE shows 1, which is buffer current (X CURRENT).
The X$KJBR.KJBRROLE shows 0, signifying that the lock owned by instance 2 is
XL0, which implies that the lock owned by instance 3 is closed.

DSI408: Real Application Clusters Internals I-329

Lock Changes in Instance 3

X$BH

Before

STATE MODE_HELD LE_ADDR


CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
--------- ---------- -------- ---------- ---------- ---------- ---------- ---------2
0 253F2690
1
8
10
0
0

X$LE
NAME LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
--------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
3
0
1

X$KJBR
no rows selected

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSERPR KJUSERNL GRANTED
00

X$BH
STATE MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------3
0 00
1
8
10
1423681
0

X$LE
no rows selected

X$KJBR
no rows selected

X$KJBL
no rows selected

11-330

Copyright 2003, Oracle. All rights reserved.

Step 3 (continued): Lock Changes in Instance 3


The X$BH.STATE changes from 2 to 3 (that is, S changes to CR).
There are no rows for the lock because it has been closed.

DSI408: Real Application Clusters Internals I-330

After

Step 4:
Instance 1 Performs UPDATE
2 PING(X)
3 Set lock
to NG1

1 LREQ(X)

Instance 2

Instance 1

(Master)
5 Send block

4 Buffer
X CURRENT
to PI

6 ASSUME(XG0, NG1)

Instance 3

11-331

Copyright 2003, Oracle. All rights reserved.

Step 4: Instance 1 Performs UPDATE


Instance 1, the requestor, sends an X request to the master (instance 2).
The master (instance 2) sends ping X to the X lock holder (itself).
Instance 2 converts the buffer state from the local X CURRENT to PI.
Instance 2 sends the buffer to the requestor (instance 1).
The requestor (instance 1) sends ASSUME to the master (instance 2) for lock mode and
tells the master that instance 1 has a global X lock and instance 2 has a global NULL
lock.

DSI408: Real Application Clusters Internals I-331

Lock Changes in Instance 2


Before

X$BH

STATE MODE_HELD LE_ADDR


CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
--------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253ECED0
1
8
10
0
0

X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
1

X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL
0 [0x200000a]
1 253ECF30 00
00

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSEREX KJUSERNL GRANTED
22FD8B24

X$BH
STATE MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------8
0 253ECED0
1
8
10
1423699
0

After

X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
0
0
0

X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL
8 [0x200000a]
1 253ECF30 00
00

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
24 KJUSERNL
[0x200000a][0x0],[BL]
8 KJUSEREX

11-332

KJBLREQUE
--------KJUSERNL
KJUSERNL

KJBLLOCKST
----------GRANTED
GRANTED

KJBLRESP
-------22FD8B24
22FD8B24

Copyright 2003, Oracle. All rights reserved.

Step 4 (continued): Lock Changes in Instance 2


The X$BH.STATE switches from X to PI mode.
The KJBL.KJBLROLE value of 24 is 8 + 16, indicating PI and GLOBAL, respectively
(that is, G1 mode).

DSI408: Real Application Clusters Internals I-332

Lock Changes in Instance 1

X$BH

Before

no rows selected

X$LE
no rows selected

X$KJBR
no rows selected

X$KJBL
no rows selected

X$BH
STATE MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253F8A80
1
8
10
0
0

X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
0

X$KJBR
no rows selected

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
8 KJUSEREX KJUSERNL GRANTED
00

11-333

Copyright 2003, Oracle. All rights reserved.

Step 4 (continued): Lock Changes in Instance 1


The X$BH.STATE is 1, which is current exclusive.

DSI408: Real Application Clusters Internals I-333

After

Step 5:
Instance 3 Performs SELECT

2 Build CR buffer

Instance 2

Instance 1

(Master)
3 Send CR buffer

1 CRREQ(S)

Instance 3

11-334

Copyright 2003, Oracle. All rights reserved.

Step 5: Instance 3 Performs SELECT


Instance 3 (the requestor) sends a CRREQ(S) to the master (instance 2).
The master (instance 2) chooses the instance CR server as follows:
If the resource role is G0, then the master takes the highest PI belonging to
instance 2 (the example)
Otherwise, if the resource role is G1 then the master takes the instance with a PI
whose SCN is closest to SCN of the requested SCN in CRREQ.
If the resource role is XL0 then the master chooses the instance with the current
buffer.
The master instance (instance 2) forwards the CRREQ to the chosen instance (itself).
The chosen instance (instance 2) builds the CR buffer and ships it to instance 3.
No DLM lock is opened for instance 3.
Note: Step 7 as previously described in the slide on page 5 is very similar to this step,
and therefore not shown in detail later.

DSI408: Real Application Clusters Internals I-334

Lock Changes in Instance 3

Before

X$BH

STATE MODE_HELD LE_ADDR


CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------3
0 00
1
8
10
1423681
0

X$LE
no rows selected

X$KJBR
no rows selected

X$KJBL
no rows selected

X$BH
STATE
MODE_HELD LE_ADDR CLASS
DBARFIL
DBABLK
CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------3
0
00
1
8
10
1423681
0
3
0
00
1
8
10
1423821
0

After

X$LE
no rows selected

X$KJBR
no rows selected

X$KJBL
no rows selected

11-335

Copyright 2003, Oracle. All rights reserved.

Step 5 (continued): Lock Changes in Instance 3


One more CR buffer that is in use on instance 3 came from the CR server (instance 2).

DSI408: Real Application Clusters Internals I-335

Step 6:
Instance 1 Performs WRITE
1 REQW
7 Make
PI buffer
to CR

2 REQW
6 Set role
Local to LE &
DLM lock
5 WNOTIFY

Instance 1
3 WRITE

Instance 2
(Master)

4 NOTIFY

Instance 3

11-336

Copyright 2003, Oracle. All rights reserved.

Step 6: Instance 1 Performs WRITE


Instance 1 (the requestor) sends a W request (request from client to master) to the
master (instance 2).
The master (instance 2) registers the SCN of the block to be written (in the DLM
resource) to remember that there is a pending write. The master not grant another write;
it sends a W request to instance 1, because instance 1 has the highest SCN (current
block).
Instance 1 writes the buffer by linking it in the ping queue. DBWR will do the write.
Instance 1 sends a W notification to instance 2 (master).
Instance 1 (master) sets Local role to resource and sends FLUSH_PI to every instance
containing a PI (in this case, itself). An instance that receives this makes the PI buffer to
a CR buffer and releases the associated LE.

DSI408: Real Application Clusters Internals I-336

Lock Changes in Instance 2


Before

X$BH

STATE MODE_HELD LE_ADDR


CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------8
0 253ECED0
1
8
10
1423699
0

X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
0
0
0

X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL
8 [0x200000a]
1 253ECF30 00
00

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
24 KJUSERNL
[0x200000a][0x0],[BL]
8 KJUSEREX

KJBLREQUE
--------KJUSERNL
KJUSERNL

KJBLLOCKST
----------GRANTED
GRANTED

KJBLRESP
-------22FD8B24
22FD8B24

X$BH
STATE MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------3
0 00
1
8
10
1423699
0

After

X$LE
no rows selected

X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL
0 [0x200000a]
1 253ECF30 00
00

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
0 KJUSERNL
[0x200000a][0x0],[BL]
0 KJUSEREX

11-337

KJBLREQUE
--------KJUSERNL
KJUSERNL

KJBLLOCKST
----------GRANTED
GRANTED

KJBLRESP
-------22FD8B24
22FD8B24

Copyright 2003, Oracle. All rights reserved.

Step 6 (continued): Lock Changes in Instance 2


The X$BH.STATE goes from 8 to 3, that is, from a PI to a CR buffer.
X$LE shows that no LE locks are covering the CR buffer.
The X$KJBL.KJBLROLE goes to 0, indicating that locks are now local.

DSI408: Real Application Clusters Internals I-337

Lock Changes in Instance 1


Before

X$BH

STATE MODE_HELD LE_ADDR


CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253F8A80
1
8
10
0
0

X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
0

X$KJBR
no rows selected

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
8 KJUSEREX KJUSERNL GRANTED
00

X$BH
STATE MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253F8A80
1
8
10
0
0

X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
0

X$KJBR
no rows selected

X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSEREX KJUSERNL GRANTED
00

11-338

Copyright 2003, Oracle. All rights reserved.

Step 6 (continued): Lock Changes in Instance 1


The lock becomes X local, as shown in X$KJBL.KJBLROLE.

DSI408: Real Application Clusters Internals I-338

After

Tables and Views

X$KJBL
Every PCM lock, local or remote
If remote, then associated resource is mastered by
this instance.
Callback routine in kjblftc

X$KJBR
PCM resources mastered by local instance

11-339

Copyright 2003, Oracle. All rights reserved.

Tables and Views


X$KJBL WebIV Note:159906.1
X$KJBR (No WebIV note)

DSI408: Real Application Clusters Internals I-339

Tables and Views (continued)


X$KJBL
Column
KJBLLOCKP
KJBLGRANT
KJBLREQUEST
KJBROLE

KJBRESP
KJBLNAME
KJBLNAME2
KJBLQUEUE
KJBLLOCKST
KJBLWRITING
KJBLREQWRIT
KJBLOWNER
KJBLMASTER
KJBLBLOCKED
KJBLBLOCKER

Type
RAW(4)
VARCHAR2(9)
VARCHAR2(9)
NUMBER

Notes
PCM lock address
lock grant mode
lock request mode if lock CONVERTING state
0x81 if G1, 0x8 if G0, 0x0 if local
0x00 grant NULL; 0x01 grant S; 0x02 grant X;
0x04 lock has been opened at master;
0x08 global role, otherwise local;
0x10 has one or more PI;
0x20 request CR; 0x40 request S;
0x80 request X
RAW(4)
masterized on local instance: resource address
masterized by other instances: 0
VARCHAR2(30) resource name: [id1(hex)][id2(hex)],[BL]
VARCHAR2(30) resource name: id1(decimal),id2(decimal),BL
NUMBER
0 if on grant-queue, 8 if on convert-queue
VARCHAR2(64) lockstate, GRANTED, OPENING, CONVERTING
NUMBER
4 if asking for write
NUMBER
2 if requesting write
NUMBER
owner instance of this lock
NUMBER
master instance of the resource
NUMBER
different from 0 if CONVERTING
NUMBER
if there is a lock L1 at head of convert-queue
and the grant-mode of this lock is conflicting
with L1 request-mode. 0 if the associated
resource is not masterized by this instance

X$KJBR
Column
KJBRRESP
KJBRGRANT
KJBRNCVL

Type
RAW(4)
VARCHAR2(9)
VARCHAR2(9)

KJBRROLE

NUMBER

KJBRNAME
KJBRMASTER
KJBRGRANTQ
KJBRCVTQ
KJBRWRITER

VARCHAR2(30)
NUMBER
RAW(4)
RAW(4)
RAW(4)

Notes
PCM resource address
Resource held mode
Request mode of lock at head of convert-queue
(KJUSERNL if non existent)
mode and role combined bitwise
0x00 if NULL; 0x01 if S; 0x02 if X;
0X08 if G0 (global role, no PI);
0x18 if G1 (global role, one or more PIs)
resource name, format [id1][id2],[BL]
master instance (always local instance)
lock address at head of grant-queue
lock address at head of convert queue
lock address elected for WRITE

DSI408: Real Application Clusters Internals I-340

Summary

In this lesson, you should have learned how to


describe the flow control of Cache Fusion for the CR
server.

11-341

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-341

Cache Fusion Recovery

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Explain Cache Fusion recovery implementation
Examine the recovery/cache interface
Examine the recovery/DLM interface
Describe the basic Cache Fusion recovery
algorithm

12-343

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-343

NonCache Fusion OPS


and Database Recovery

12-344

The on-disk version of a block is always the


starting point for recovery.
Only the changes from a single redo thread must
be applied to the disk version.

Copyright 2003, Oracle. All rights reserved.

NonCache Fusion OPS and Database Recovery


In a pre-Oracle9i OPS system, when a buffer that is modified by instance A is requested
by another instance B, A must write its dirty buffer to the disk before B can read it. This is
disk-based cache coherency. The algorithm implies that a given block can only be
different in one instance (both the cache and the redo log thereof) from the disk version.
Note: In Cache Fusion, the two statements in the slide are invalid.

DSI408: Real Application Clusters Internals I-344

Cache Fusion RAC


and Database Recovery

The starting point for recovery of a block is its


most recent PI version.
Located some place in the global cache
On-disk version used only if no PI is available

12-345

Redo threads for all failed instances must be


merged for instance or crash recovery.

Copyright 2003, Oracle. All rights reserved.

Cache Fusion RAC and Database Recovery


In Oracle9i Cache Fusion, an instance ships the contents of its buffer to the requesting
instance after doing a log force but without writing the block to the disk. The sending
instances buffer will become a past image (PI) and cannot be modified further. The
requesting instance now has the current block in exclusive mode. The on-disk version of
the block does not contain the changes that are made by either instance.
Cache Fusion does not affect media recovery, which starts at the restored backup and
applies changes from the merged redo threads of all instances in the RAC cluster.

DSI408: Real Application Clusters Internals I-345

Overview of Fusion Lock States

Lock state: Two letters and a digit (mode, role, PI


count)
Example: XG1 is an exclusive mode, global role, 1
past image.
Lock Mode Valid Lock Role, PI Count

12-346

L0, G0, G1

L0, G0, G1

G1, G2

Copyright 2003, Oracle. All rights reserved.

Overview of Fusion Lock States


A PCM Fusion lock has three dimensions: lock mode, lock role, and past-image count.
These dimensions together are used to maintain cache coherency in a fusion environment.
The set of lock modes remains unchanged: Exclusive (X), Shared (S), and Null (N). Lock
roles describe local or global interest in the resource. The past-image count indicates the
number of PI buffers that are maintained in the lock. The set of valid lock states is a subset
of the total combination space: XL0, XG0, XG1, SL0, SG0, SG1, NG1, NG2.
Null (N): No examine or modify rights
Share (S): May examine block
Exclusive (X): May modify and create new version of the block
Local (L): Locally managed lock; block can only be dirty in this cache
Global (G): Globally managed lock; may be dirty in more than one cache; must
coordinate with DLM for write
PI count 0: No past image
PI count 1: Past image present. More than one past image can be present.

DSI408: Real Application Clusters Internals I-346

Instance or Crash Recovery

SMON from a surviving instance performs the


thread recovery of the failed instance.
Foreground process performs the recovery when
all instances have failed (crash recovery).
Cache fusion recovery builds on the two-pass log
read recovery mechanisms.
First-pass log read
Recovery claim locking
Second-pass redo application

12-347

Copyright 2003, Oracle. All rights reserved.

Instance or Crash Recovery


The SMON from a surviving instance does thread recovery of the failed instance. If a
foreground process detects instance recovery, it posts SMON; foreground processes no
longer do instance recovery.
Crash recovery may be considered a special case of instance recovery whereby all
instances have failed. In both cases, the threads from failed nodes must be merged. The
distinction is that, in instance recovery, SMON performs the recovery. In crash recovery, a
foreground performs recovery.
Cache fusion recovery builds on the two-pass log read recovery mechanisms.
First-pass log read
- Recovery set
- Block Written Records (BWR)
Recovery claim locking
- IDLM Communication
Second-pass redo application
Note: Recovery claim locking is the RAC component of two-pass log read recovery.

DSI408: Real Application Clusters Internals I-347

SMON Process

SMON performs the instance recovery.


The foreground process performs the crash
recovery.
The PMON or the foreground process performs the
block recovery.

SMON acquires IR enqueue.


This avoids multiple, simultaneous recoveries.
Enqueues are now available before blocks
(optimization).
This allows recovery and remastering to take place
in parallel.

12-348

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-348

First-Pass Log Read

Reads and merges redo threads of the failed


instance
Creates a hash table of blocks that are not known
to have been written to disk
Uses the Block Written Records
Does not use the buffer cache

12-349

Does not advance the checkpoint SCN

Copyright 2003, Oracle. All rights reserved.

First-Pass Log Read


The first-pass log read reads redo threads of the failed instance and merges the results:
By SCN
RBA of last incremental checkpoint for each thread
Modified blocks are added to recovery set
The recovery set contains the first-dirty and last-dirty version information (SCN, Seq#) of
each block.
The process relies on Block Written Records (BWRs):
BWRs identify blocks in the recovery set that can be removed
All holders of flushed PIs write out a BWR
The first pass creates a hash table of blocks that are not known to have been written to the
disk; the hash table is the input for the second pass.

DSI408: Real Application Clusters Internals I-349

Block Written Record (BWR)

DBA information is written to the log stream.

No force of log file


Batched set of DBAs
Written version of DBA (SCN and Seq#)
Written by DBWR in lazy fashion
Recovery not needed if BWR version is greater than
the latest PI
Trims the recovery set

BWRs are logged:


When writing a block by the writing instance
When sending a block by the PI holder;
increases the likelihood of finding BWR for
excluding the block in second recovery pass

12-350

Copyright 2003, Oracle. All rights reserved.

Block Written Record (BWR)


The BWR is placed in the redo buffer. It is usually not flushed to the disk immediately
with the disk writes, but deferred until the next redo buffer flush. A lost BWR, because of
the instance crash, could at most result in a few blocks needlessly being examined for the
redo application in the second pass.
Note: BWRs are logged by the owner instance that did the write and all holder instances.
Because every instance that modified the buffer logs a BWR following the writing of the
buffer, first-pass is more likely to find the BWR when any one of these instances fail.

DSI408: Real Application Clusters Internals I-350

BWR Dump

Special redo record:


Flagged no valid redo
List of block DBA, SCN
REDO RECORD - Thread:1 RBA: 0x000020.0000106a.0010
LEN: 0x04d8 VLD: 0x02
SCN: 0x0000.00103593 SUBSCN: 1 12/19/2002 07:28:00
CHANGE #1 MEDIA RECOVERY MARKER SCN:0x0000.00000000
SEQ: 0 OP:23.1
Block Written - afn: 7 rdba: 0x01c0368a(7,13962)
scn: 0x0000.00103475 seq: 0x01 flg:0x04
Block Written - afn: 7 rdba: 0x01c03689(7,13961)
scn: 0x0000.00103475 seq: 0x01 flg:0x04
Block Written - afn: 5 rdba: 0x01402611(5,9745)
scn: 0x0000.0010346d seq: 0x01 flg:0x06
...

12-351

Copyright 2003, Oracle. All rights reserved.

BWR Dump
The dump in the slide is from a redo log file dump done with:
SQL> ALTER SYSTEM DUMP LOGFILE 'filename';

DSI408: Real Application Clusters Internals I-351

Recovery Set

The recovery set is organized in a table hashed by


DBA.
Each hash chain is sorted by increasing the firstdirty SCN in a doubly linked list.
Specifies the order in which to acquire instance
recovery locks

12-352

Each block entry stores the first-dirty SCN that is


encountered for the block.
Updates the last-dirty version (SCN, Seq#) for
subsequent block changes.

Copyright 2003, Oracle. All rights reserved.

Recovery Set
The first read of a blocks change vector in the redo stream sets the first-dirty and lastdirty SCN values in the recovery set. Subsequent reads from the redo stream that occur on
the same block update the last-dirty SCN value in the recovery set.

DSI408: Real Application Clusters Internals I-352

Recovery Claim Locks

SMON sends a RecoveryClaimLock message to


the IDLM master node for each block entry in the
recovery set.
Each recovery set fusion block maps to a unique
IDLM resource.
If the master node for a resource has failed and
the IDLM remastering has not completed, then the
recovery waits.
Locks granted are used by the IDLM to:
Reconstruct the most restrictive lock that could
have been held by a failed instance
Ship the appropriate copy of the block to the
recovering instance

12-353

Copyright 2003, Oracle. All rights reserved.

Recovery Claim Locks


SMON (in instance recovery) sends a RecoveryClaimLock message to the IDLM
master node for each block entry in the recovery set.
Multiple requests may be batched into one message.
Indicates to the IDLM that recovery takes ownership of the block and lock.
The IDLM response generally consists of a block and a fusion lock grant. If locks are held
in XL or SL modes, then no recovery is needed (hence no IDLM message is sent).

DSI408: Real Application Clusters Internals I-353

IDLM Response to RecoveryClaimLock


Message on PCM Resource

12-354

Lock open on
recovering
instance

Locks open
on other
instances

Lock granted
on recovery
buffer

No lock or NL0

See next slide

(X, S)
Local 0

Dont Care

No lock

Recovery
buffer content

Recovery
action

No recovery
buffer needed

No recovery;
remove entry
from recovery set

(X, S)
Global (0, 1)

Dont Care

Share (X, S)
Global lock;
increment PI
count in lock
state, use zero
SCN tag

Initiate write of
current block
(See note 1)

No recovery;
release recovery
buffer, decrement
PI count when
block write
completes

(N) Global (1, 2)

a) An (X, S)
Global

Share NG lock,
increment PI
count

Same as Case 3: (X, S) Global

b) All (N)
Globals

XG1

Get contents
from highest PI,
based on SCN
tags. If NG2, toss
the higher PI
(See note 2)

Apply redo
changes, write out
recovery buffer
when complete

Copyright 2003, Oracle. All rights reserved.

IDLM Response to RecoveryClaimLock Message on PCM Resource


Note 1: Recovery buffer is used for write notification only (no content) and cannot serve a
past-image.
Note 2: Retains PI that is being written. If lock is NG1, it does not determine if PI is being
written, so it must be retained.

DSI408: Real Application Clusters Internals I-354

No Lock Held by Recovering Instance on


the PCM Resource
Locks open (on other
instances)

12-355

Recovery lock

Recovery
buffer contents

Recovery
process action

No locks open or
all NL0

XL0

Read block from


disk

Apply redo
changes, write out
recovery buffer
when complete

(X, S) Local0

No lock

No recovery
buffer needed

No recovery;
remove block
entry from
recovery set

(X, S) Global (0, 1)

NG1 (with zero SCN tag


because this is not a PI)

Initiate write of
current block;
recovery buffer
used for write
notification only
(no content)

No recovery; write
completion will
release recovery
buffer and lock as
usual

All (N) Global (1, 2)

XG0

Get contents from


highest PI, based
on SCN tags

Apply redo
changes, write out
recovery buffer
when complete

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-355

Recovery Claim Locks

12-356

The shipped block is copied into a recovery buffer


that is covered by the granted lock.
After locks have been acquired on all blocks in the
recovery set, a RecoveryDoneClaiming message
is sent to all DLM master nodes.
After IDLM reconfiguration, only resources that
are locked for recovery are unavailable to the
foreground lock requests.
After a buffer is allocated, an IR buffer cannot be
replaced or aged out except by another recovery
buffer request.

Copyright 2003, Oracle. All rights reserved.

Recovery Claim Locks


After the IDLM completes reconfiguration, only the resources that are locked for recovery
are unavailable to foreground lock requests:
IDLM validates the PCM lock space.
Until RecoveryDoneClaiming message is received, the PCM lock database
remains frozen clusterwide.

DSI408: Real Application Clusters Internals I-356

Recovery Claim Locks

12-357

IR buffers must remain in the cache until they are


released during the second pass of redo
application.
IR locks must be held until the covered block is
fully recovered.
The recovery buffers are held in the recovering
instances' default buffer pool.
Large recovery sets may populate the recovering
instances' buffer cache with nonreusable buffers
Lock down-convert requests for recovery buffers
are serviced after the IR lock is released.

Copyright 2003, Oracle. All rights reserved.

Recovery Claim Locks (continued)


IR buffers must remain in cache until they are released individually during the second pass
of redo application. The exception to this is the spillover scenario, where the recovering
instances' buffer cache cannot hold the entire recovery set; this is described later in this
lesson.
IR locks must be held until the covered block is fully recovered; user lock operations are
not allowed on partially recovered blocks. The buffer cache ensures that IR buffers are not
reused; LEs are tied to buffers through the buffer cache.
Recovery buffers are held in approximately 50% of the recovering instances' default buffer
pool, that is, in the cold half of the lru buffer chain.
Large recovery sets may result in populating the recovering instances' buffer cache with
nonreusable buffers. This impacts the foreground requests for buffers that are unrelated to
recovery and degrades the overall performance of the recovering instances.
Lock down-convert requests (BASTs) for recovery buffers are deferred and serviced only
after the IR lock is released. Locked IR buffers are marked in-recovery to the cache
layer with lock holder SMON. SMON releases the lock only when recovery for the block
is complete.
DSI408: Real Application Clusters Internals I-357

Second-Pass Log Read

The redo threads of failed instances are again read


and merged by SCN.
The recovery hash table is looked up to decide if
changes are for a recovery set block.
If a recovery buffer matches its last-dirty version
in the recovery set, recovery is complete.
SMON posts DBWR to write the recovery buffer
and clear the in-recovery state of the buffer.
After write completion:
SMON recovery lock on that buffer becomes XL
PI holders on remote instances are invalidated

12-358

Copyright 2003, Oracle. All rights reserved.

Second-Pass Log Read


Redo threads of failed instances are again read and merged by SCN. For each redo record
in the merged redo stream, the recovery hash table is looked up to decide if the change is
for a recovery set block.
Redo changes are applied to recovery buffers that are guaranteed to be in the cache and the
IR lock on those buffer acquired.
After applying a redo record, if the resulting recovery buffer matches its last-dirty version
(SCN and Seq#) in the recovery set, then the recovery is complete for that block.
SMON requests a write of the recovery buffer.
The block is released for normal operations.
When a recovery buffer write is requested, SMON posts DBWR to write the recovery
buffer and clear the in-recovery state of the buffer.
A recovery buffer can become current only after write completion, unlike a regular
buffer.
The cache layer can resume processing of lock down-converts (BASTs) for this
buffer after it has been made current.
After write completion, the SMON recovery lock on that buffer goes from XO to XL, and
PI holders on remote instances are invalidated by the IDLM master.
Note: Dump a buffer header and identify the in-recovery state field.

DSI408: Real Application Clusters Internals I-358

Second-Pass Log Read

Recovery locks differ from regular PCM locks only


in their response to BASTs.
There is no distinction between recovery and
regular locks at the IDLM level.

When the last recovery buffer is released,


recovered threads are checkpointed and closed.
IR is complete when all dead threads have been
checkpointed and closed.

12-359

Copyright 2003, Oracle. All rights reserved.

Second Pass Log Read (continued)


When the last recovery buffer is released, recovered threads are checkpointed and closed.
This requires a wait for write completions on the outstanding requests that were issued
during IR lock acquisition.

DSI408: Real Application Clusters Internals I-359

Large Recovery Set


and Partial IR Lock Mode

Buffers and LEs for IR are allocated from the SGA


by using existing mechanisms for allocating
recovery buffers (kcbrra).

For RAC systems that are configured for high


availability, recovery sets are small relative to the
size of the buffer cache of the recovering instance.
The largest recovery set is known at the start of IR.
At the end of the first-pass log read, SMON may
switch to Partial IR lock mode.

12-360

Copyright 2003, Oracle. All rights reserved.

Large Recovery Set and Partial IR Lock Mode


The size of the buffer cache of the recovering instance places a limit on the largest
recovery set that can be completely accommodated (that is, a recovery buffer and lock
allocated for every block in the recovery set at the end of first pass and recovery lock
claim).
For RAC systems that are configured for high availability, recovery sets are normally
small relative to the size of the recovering instances buffer cache.
Based on the buffer cache size, the largest recovery set that the recovering instance's SGA
can accommodate is known at the start of instance recovery.
For example, assume that M blocks are available in the cache of the recovering instance. If,
at the end of the first-pass log read, the recovery set is greater than M, then SMON
switches to Partial IR lock mode.

DSI408: Real Application Clusters Internals I-360

Large Recovery Set


and Partial IR Lock Mode

12-361

Submit RecoveryClaimLock messages for the


first M blocks in the recovery list.
Begin the second-pass log read and redo
application.
If redo is encountered for a block on the recovery
list, a recovery buffer is paged out and reused.
When the reused list is not empty, the recovery list
no longer represents the optimal order to acquire
recovery buffers.
When recovery and reused lists are empty, SMON
issues a RecoveryDoneClaiming message to the
DLM, allowing it to proceed with lock domain
validation.
Copyright 2003, Oracle. All rights reserved.

Large Recovery Set and Partial IR Lock Mode (continued)


Note the difference between the recovery list and the recovery set. The recovery list is a
doubly linked list of recovery set entries that are sorted by increasing first-dirty SCN.
The first-dirty SCN ordering ensures that these are the first M blocks in the merged redo
stream. Remove these M blocks from the recovery list because the recovery list contains
only recovery set blocks for which a buffer and lock have not been acquired.
The PCM lock database remains frozen, because SMON cannot issue the
RecoveryDoneClaiming message. Apply redo changes to the M recovery buffers.
After the buffer is fully recovered and written to disk, issue another
RecoveryClaimLock message for the head block on the recovery list.
When a recovery buffer is reused (and lock released), its recovery set block entry is put on
a reused list. A RecoveryClaimLock request is made for the new block, which is
removed from the recovery list. When redo is encountered for a reused list block, a buffer
and lock are acquired and the block is taken off the reused list.

DSI408: Real Application Clusters Internals I-361

Large Recovery Set and Partial IR Lock Mode (continued)


When the reused list is not empty, the recovery list no longer represents the optimal order
to acquire recovery buffers. So, when a recovery buffer is released after applying the last
redo change, there is no correct choice for the next block; no lock request is made at this
time. If the reused list becomes empty again, recovery can revert to acquiring locks in
recovery list order when a recovery buffer is allocated.
When both recovery and reused lists are empty, SMON issues a
RecoveryDoneClaiming message to the DLM that allows it to proceed with lock
domain validation.

DSI408: Real Application Clusters Internals I-362

Lock Database Availability During


Recovery

When an instance dies, the IDLM initiates lazy


remastering.
The PCM lock space remains invalid while:
IDLM master nodes discard locks that are held by
dead instances
SMON issues a RecoveryDoneClaiming message

12-363

Most PCM lock operations are frozen.


User operations that do not require interaction
with the IDLM can proceed.

Copyright 2003, Oracle. All rights reserved.

Lock Database Availability During Recovery


Lazy remastering means that a minimum subset of resources are remastered to maintain
consistency of the lock database. This occurs in parallel with the first-pass log read where
the recovery set is constructed.
The entire PCM lock space remains invalid while the IDLM and SMON complete the
following:
IDLM master nodes discard locks that are held by dead instances; the space
reclaimed by this operation is used to remaster locks that are held by the surviving
instance for which a dead instance mastered.
SMON issues a RecoveryDoneClaiming message.
While the lock domain is invalid, most PCM lock operations are frozen, making the
database unavailable for users requesting a new or incompatible lock. The following lock
operations are allowed in an invalid lock domain:
Closing of lock held by the recovering instance to use its buffer for instance recovery.
Lock operations for locally partitioned tablespaces on a surviving node, provided that
a dead instance was not the owner.
Note: User operations that do not require interaction with the IDLM can proceed (for
example, a foreground process holding an XL lock).

DSI408: Real Application Clusters Internals I-363

Handling BASTs on Recovery Buffers

12-364

While a recovery buffer still requires redo to be


applied, it is flagged with an in-recovery state.
LCK permits a BAST on an in-recovery buffer to be
suspended indefinitely.
When the in-recovery flag is cleared, normal
down-convert processing is resumed.

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-364

IR of Nonfusion Blocks

12-365

During IR, lock acquisition of a nonfusion block is


treated as a local (XL or SL) fusion block.
If surviving instances hold S/X locks, the failed
instance could not have had the block dirty.
If there are no surviving locks, the block must be
read from disk to determine if recovery is needed.
Blocks are removed from the recovery set if the
on-disk version is more recent than the last-dirty
version.

Copyright 2003, Oracle. All rights reserved.

IR of Nonfusion Blocks
If there are no surviving locks, the block must be read from disk and compared with the
last-dirty version for the block entry to determine if recovery is necessary.
During IR lock acquisition, an X lock is acquired on the block and it is read from disk. If
the on-disk version is more recent than the last-dirty version, then the block is removed
from the recovery set.

DSI408: Real Application Clusters Internals I-365

IR of Nonfusion Blocks

The IDLM response to RecoveryClaimLock messages


for nonfusion blocks is listed in the following table:
Current Lock Mode
Exclusive (X)
or Share (S)
Null or No Lock

12-366

Lock
Recovering Process Action
Granted
No Lock

No recovery needed, delete block


entry from recovery set

Read block from disk, do secondpass recovery if needed

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-366

Failures During Instance Recovery

Instance recovery restart


Recovery fails without the death of the recovering
instance.

12-367

Death of the recovering instance


Death of a nonrecovering instance
I/O errors
Block corruption during redo application

Copyright 2003, Oracle. All rights reserved.

Failures During Instance Recovery


Restart is allowed only while the lock domain is invalid. After all IR locks have been
acquired and the RecoveryDoneClaiming message is issued, the lock domain is
validated. BASTs are queuing up on recovery locks, so it is not possible for SMON to
release its locks and restart the recovery. Recovery errors that occur after lock domain
validation must either fail the recovering instance or allow the recovery to complete.
If there is a surviving instance, it grabs the IR enqueue and starts the recovery. Crash
recovery is necessary if all instances are down.
Death of a Nonrecovering Instance
If the failure is during lock acquisition, it is detected during the
RecoveryDoneClaiming message that is broadcasted to all IDLM masters. A change
in the lock domain, caused by instance death, is communicated to the recovering SMON
by the IDLM. SMON aborts recovery and releases the IR enqueue. The next live instance
that detects a dubious lock will reattempt the instance recovery.

DSI408: Real Application Clusters Internals I-367

Failures During Instance Recovery (continued)


Death of a Nonrecovering Instance (continued)
In case of I/O errors, the file is taken offline and IR is restarted. If the I/O error is on a
system tablespace datafile, the recovering instance crashes; eventually, all instances in the
cluster crash. Media recovery is required if the I/O error is not transient.
Online block recovery attempts to clean up corrupted blocks to allow IR to proceed. If
block recovery succeeds, the block should not need further recovery (IR should find
recovery done up to the last-dirty SCN and drop it from the recovery list). If block
recovery fails, the recovering instance crashes and IR is restarted.

DSI408: Real Application Clusters Internals I-368

Memory Contingencies

Fusion recovery needs additional memory from:


SGA of the recovering instance
PGA of the recovering process (SMON for instance
recovery, foreground for crash recovery)

This memory is needed for:


The recovery set
Log buffers
Instance recovery locks

12-369

Copyright 2003, Oracle. All rights reserved.

Memory Contingencies
The recovery set (hash table and block entries) is stored in the PGA of the recovering
process. There must be enough virtual memory to construct the recovery set in PGA to
complete the first pass.
There must be at least one buffer per thread being recovered in the buffer cache for the
first- and second-pass log reads.
LEs correspond to recovery buffers. If a recovery block is not in the cache, then there is no
lock storage associated with it.

DSI408: Real Application Clusters Internals I-369

Code References

The main code routines for recovery are:


kcratr: Thread redo application
kcratr1: Pass one: construct recovery set
kcratr_claim: Claim recovery buffers
kcbrbuf: Get a recovery buffer
This is the Buffer Cache Interface.
Call tree: kclclaim, kclcfusion,
KCL_CONVERT_RECOVERY_LOCK
This is the entry into the IDLM Interface.

kcratr2: Pass two: apply change vectors


If not all buffers were claimed in kcratr_claim,
then kcratr2 calls kcratr_claim recursively.

12-370

Copyright 2003, Oracle. All rights reserved.

Code References
A more detailed list that indicates calling depth:
ktm.c
kcv.c
kct.c
kcra.c

kcrp.c
kcb.c
kcl.c

1. ktmmon - smon loop


1. kcvirv - Instance RecoVery (called by SMON, db is open)
1. kctrec - RECover threads - recover and close threads
1. kcratr - Thread Redo application
1. kcratr1 - Pass one of two pass recovery processing
2. kcratr_claim - Claim recovery buffers
1. kcrpclaim - Claim recovery buffer
1. kcrpsend_claim - send recovery buffer claim message
2. kcbrbuf - get a Recovery Buffer BUFFER CACHE INTERFACE
1. kclclaim - Claim a recovery lock
1. kclcfusion - Claim Fusion lock
1. kclcsfusion - start fusion recovery request
1. KCL_CONVERT_RECOVERY_LOCK IDLM INTERFACE
this is kjbrecoveryopen/kjbrecoverconvert.

DSI408: Real Application Clusters Internals I-370

Code References (continued)


Note, we also issue kjbrecoveryassume when we get the PI.
kcra.c
kcrfr.c
kcrp.c
kcb.c

3.
1.
1.
1.
1.

kcratr2 - Pass two of two pass recovery processing


kcrfrgv - get change Vector header/data
kcrpap - APply change vector
kcbtema - Thread recovery Exam and Maybe Apply
kclrdone - Recovey is Done so clean up buffer

DSI408: Real Application Clusters Internals I-371

Summary

In this lesson, you should have learned how to:


Explain Cache Fusion recovery implementation
Examine the recovery/cache interface
Examine the recovery/DLM interface
Describe the basic Cache Fusion recovery
algorithm

12-372

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-372

SQL
SQL Layer
Layer

SQL
SQL Layer
Layer

Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS

Section
III
II
II
P
P
P
P
Platforms
C
C
C
C

Node
Node Monitor
Monitor

Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS
Node
Node Monitor
Monitor

Cluster
Cluster Manager
Manager

Copyright 2003, Oracle. All rights reserved.

Linux Platform

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Outline the distinguishing features of RAC on the
Linux platform
Install, start, and stop RAC on the Linux platform
List the Linux-specific software components

13-377

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-377

Linux RAC Architecture

Hardware
Intel-based hardware
Externally shared SCSI or Fiber Channel disks
Interconnected via NIC
Software
OS versions supported:
RedHat 7.1 (9.0.1 and 9.2)
Suse 7.2 and Suse SLES7 (9.0.1 and 9.2)

Oracle-supplied CM, NM, and Watchdog (different


with each version of Oracle)

13-378

Copyright 2003, Oracle. All rights reserved.

Linux RAC Architecture


RAC on Linux requires the following:
Two or more 32-bit Intel servers, maximum 32 nodes
A separate and dedicated intracluster network among the nodes with NICs. If the
cluster has more than two nodes, then a switch or hub in the intracluster network
might be necessary.
An external shared SCSI disk array or external Fiber Channel disk array with
shared disk partitions
At present, Linux is limited to eight nodes. The limitation is in the interconnected disk
system, the CM, the NM, and the Watchdog. These components have different limits.
The component with the lowest limit sets the limit for a RAC system.

DSI408: Real Application Clusters Internals I-378

Storage: Raw Devices

The supportable storage for RAC is raw devices.


Raw devices are usually named /dev/raw[0-9].
Up to 255 raw devices are possible.

The tool that is used to set up and query raw


devices is raw.

To make a SCSI disk partition a raw device:

raw /dev/raw1 /dev/sda3

13-379

Oracle Cluster File System can be used.

Copyright 2003, Oracle. All rights reserved.

raw Command
Usage: raw /dev/raw<N> /dev/<blockdev>
On Redhat, it is /dev/rawctl - raw io control device (it is in /usr/sbin/raw).
On Suse, it is /dev/raw - raw io control device (it is in /usr/local/bin/raw).
In the slide example sda3 means third partition of SCSI disk 1.
Note: You can store the commands at /etc/rc.d/boot.local. The commands
are executed immediately after booting. Or, store the commands in a file and execute
that file from boot.local.
For example, rawsetup is a file with all the commands for configuring the raw
devices and /etc/rc.d/boot.local contains the line:
. /etc/init.d/rawsetup
After creating raw partitions, you must give correct permissions on /dev/raw*.

DSI408: Real Application Clusters Internals I-379

Extended Storage

13-380

Logical Volume Manager (LVM), only available on


SuSe
Xraw
Cluster File Systems (CFS)

Copyright 2003, Oracle. All rights reserved.

Extended Storage
LVM
The LVM hides the details about where data is stored: on what hardware as well as
where on that hardware. The management of volume groups and logical volumes can be
done while they are being used by the system. For example, you can increase the size of
a logical volume while it is being mounted; you do not have to unmount.
Cluster File Systems
Linux does not have its own cluster file system. Various third-party suppliers (like
Polyserve) supply a CFS. Oracle supplies its own CFS. This is the only supported option.

DSI408: Real Application Clusters Internals I-380

Linux Cluster Software

Extended with the Oracle-supplied Cluster


Manager (OCMS)
Kernel tuned with parameter settings:
/proc/sys/kernel/shmmax - 2147483647
/proc/sys/fs/file-max - 81920
config_watchdog_nowayout set to Y.

13-381

Copyright 2003, Oracle. All rights reserved.

Linux Cluster Software


OCMS
Unlike the Oracle Real Application Clusters versions on UNIX platforms, you rely on
any Linux vendor to provide the clusterware layer (the operating systemdependent
modules or the equivalents). OCMS is included with Oracle9i for Linux.
Kernel Settings
echo 2147483647 > /proc/sys/kernel/shmmax.
The config_watchdog_nowayout parameter cannot be changed dynamically. It
should be changed during installation of the OS.

DSI408: Real Application Clusters Internals I-381

OCMS

13-382

OCMS is included with Oracle for Linux.


OCMS is layered above the operating system and
provides all the clustering services that Oracle
RAC needs to function as a high-availability and
high-scalability solution.
OCMS provides cluster membership services,
global view of clusters, node monitoring, and
cluster reconfiguration.

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-382

OCMS Components

OCMS consists of:

13-383

Watchdog daemon (WDD) in Oracle9i and Oracle8i


Hangcheck module (Oracle9i Release 2)
Node monitor (NM)
Cluster Manager (CM)

The binaries are in:


$ORACLE_HOME/lin_nm/latest

Copyright 2003, Oracle. All rights reserved.

OCMS Components
Version Note
The Linux OCMS is ported from the Windows NT/2000 version.
Oracle version 9.0.x and 8.1.x architecture used an Oracle-written watchdog daemon to
monitor for system hangs, running as a process in user-space.
Oracle9i releases 9.2.0.1 and earlier use the Linux supplied softdog module to reset the
node in case of hangs.
Oracle9i release 9.2.0.2 uses a new Oracle-written, loadable kernel module, hangchecktimer, that runs in kernel space. The NM and CM functionality is combined into the
oracm background process (no more nm.log).
The older watchdog (Oracle9i release 1and earlier) could be starved for CPU by heavy
load and high kernel activity, causing many unnecessary node resets (false evictions).

DSI408: Real Application Clusters Internals I-383

WDD, NM, and CM Flow


(Up to version 9.2.0.1)
Oracle instance

Watchdog
service

Kernel mode

13-384

Instance-level
cluster information
Cluster Manager
Node-level
cluster information
Node Monitor
Watchdog
service
Watchdog daemon
User mode
Watchdog service
Watchdog timer

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-384

Watchdog Daemon

13-385

The watchdog daemon monitors the NM and the


CM and passes notifications to the watchdog
timer at defined intervals.
Watchdog services are documented at:
/usr/src/linux/Documentation/watchdog.txt
The WDD is replaced by the hangcheck-timer
kernel module as of Oracle release 9.2.0.2.0.

Copyright 2003, Oracle. All rights reserved.

Watchdog Daemon
The important kernel configuration parameter for the watchdog daemon is
config_watchdog_nowayout.
After you create /dev/watchdog by using mknod, you get a watchdog daemon.
That is, subsequently opening the file and then failing to write to it for longer than one
minute results in rebooting the machine.
The watchdog can stop the timer if the process managing it closes the
/dev/watchdog file, provided that the parameter
config_watchdog_nowayout is set to N. The watchdog cannot be stopped after it
has been started if config_watchdog_nowayout is set to Y. On Redhat, it is N by
default, and on SuSe it is Y by default.

DSI408: Real Application Clusters Internals I-385

Hangcheck, NM, and CM Flow


(After version 9.2.0.2)
Oracle instance

Cluster Manager (including Node Monitor)

Oracm maintains both node


status view and Oracle
instance status view.

User mode

The hangcheck-timer monitors


the kernel for hangs, and
resets the node if needed.

Kernel mode

Hangcheck-timer

13-386

Copyright 2003, Oracle. All rights reserved.

Hangcheck, NM, and CM Flow


For version 9.2.0.2 and later.
Hangcheck-timer monitors heartbeats from oracm I/O capable clients. A node reset will
occur when the following is true:
(system hang time) > (hangcheck_tick + hangcheck_margin)

DSI408: Real Application Clusters Internals I-386

Hangcheck Module

Loaded as a kernel module


Specified by the parameter KernalModuleName in
the CMCFG.ORA file

$ cd $ORACLE_HOME/oracm/admin
$ grep KernalModuleName cmcfg.ora
KernalModuleName=hangcheck-timer

13-387

Copyright 2003, Oracle. All rights reserved.

Hangcheck Module
The hangcheck module is implemented from version 9.2.0.2 and later.
This module is not required for the CM operation, but its use is highly recommended.
This module monitors the Linux kernel for long operating system hangs that could
affect the reliability of a RAC node and cause corruption of a RAC database. When such
a hang occurs, this module sends a signal to reset the node.
Node resets are triggered from within the Linux kernel, making them much less affected
by the system load.
The CM on a RAC node can be easily stopped and reconfigured, because its operation is
completely independent of the kernel module.
The features that are provided by the hangcheck-timer module closely resemble the
features found in the implementation of the CM for RAC on the Windows platform, on
which the CM on Linux was based.

DSI408: Real Application Clusters Internals I-387

Node Monitor (NM)

13-388

Maintains a consistent view of the cluster


Reports the node status to the cluster manager
Uses a heartbeat mechanism
Works with WDD and takes action depending on
the type of failure

Copyright 2003, Oracle. All rights reserved.

Node Monitor (NM)


The node monitors on all nodes send heartbeat messages to each other. Each node
maintains a database that contains the status information on other nodes. The NMs in a
cluster mark a node inactive if the node fails to send a heartbeat message within a
defined time interval.
The heartbeat message from the NM on a remote server can fail for the following
reasons:
Termination of the NM on the remote server
Network failure
Heavy load on the remote server
The NM reconfigures the cluster to terminate the isolated nodes, ensuring that the
remaining nodes in the reconfigured cluster continue to function properly.

DSI408: Real Application Clusters Internals I-388

Cluster Manager

13-389

The CM maintains the process-level cluster status.


The CM accepts the registration of Oracle
instances to the cluster and provides a consistent
view of Oracle instances.
When an Oracle process that writes to the shared
disk quits abnormally, the CM on the node detects
it and requests WDD to take appropriate action.

Copyright 2003, Oracle. All rights reserved.

Cluster Manager (CM)


If /a:1 is set, and if LMON terminates abnormally, then the CM daemon on the node
detects it and requests the watchdog daemon to stop the node completely. This stops the
node from issuing physical I/O to the shared disk before CM daemons on the other
nodes report the cluster reconfiguration to the Oracle instances on the nodes. This action
prevents database corruption.

DSI408: Real Application Clusters Internals I-389

Linux Port-Specific Code

rdbms/src/generic/osds/skgxpu.c
rdbms/src/generic/osds/sskgxpu.c
libcmdll.so - rdbms/src/port/cm/dll/
Has one-to-one mapping for skgxn functionality

Cluster implementation is similar to NT


implementation.

13-390

Copyright 2003, Oracle. All rights reserved.

Linux Port-Specific Code


The operating systemdependent (OSD)modules are:
skgxp for communicating with the nodes
libskgxn for communicating with the cluster

DSI408: Real Application Clusters Internals I-390

Cluster Manager

CM source code is available at:


rdbms/src/port/cm
rdbms/src/port/nm
rdbms/src/port/wdd

Sharable object libraries:


libcmdll.so
libnmdll.so
libwddapi.so

13-391

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-391

skgxpt and skgxpu

13-392

Oracle version 9.0.1 (and earlier) has TCP/IP


implementation.
Oracle version 9.2 has UDP implementation.
TCP/IP is not supported in version 9.2 (and later
versions).
skgxpu.c, sskgxpu.c are the same as base
version, except for changes in skgxp_ipcluster.
libskgxpu.a includes skgxpu.o and sskgxpu.o.
libskgxpt.a includes skgxpt.o and sskgxpt.o.

Copyright 2003, Oracle. All rights reserved.

skgxpt and skgxpu


In release 9.0.1, libskgxpt.a and libskgxp9.a will have *skgxpt*.o objects
archived.
In release 9.2, only libskgxpu.a has *skgxpu.o objects.

DSI408: Real Application Clusters Internals I-392

Installing RAC on Linux

See WebIV note Step-By-Step Installation of RAC


on Linux 184821.1.
Linking of Oracle:
Version 9.0.1 and earlier uses nmliblist while
linking with Oracle RAC. (nmliblist contained
lcmdll.so)
Version 9.2: libcmdll.so copied to libskgxn9.so

13-393

Copyright 2003, Oracle. All rights reserved.

Installing RAC on Linux


In order for cluster detection to occur during the 9.2.0.1 database installation, you must
configure the older watchdog daemon (WDD). Without this configured, the Installer
will not install RAC and will not rcp the Oracle S/W to the other cluster nodes.
You must still set up the WDD as described in the note 184821.1 as an interim step on
your way up to 9.2.0.2 and the new hangcheck timer-based oracm. Do not run the old
WDD in production.

DSI408: Real Application Clusters Internals I-393

Installing RAC on Linux (continued)


OraCM version 9.2.0.1
Load OS softdog module (all as root). Softdog is included with AS 2.1 (no build
necessary like on RH 7.1):
$> insmod softdog soft_margin=15 nowayout=1
Start watchdog from $OH/oracm/bin.
$> ./watchdogd -g dba d /dev/null l 0
Edit cmcfg.ora file.
WatchdogSafetyMargin=1000
WatchdogTimerMargin=1500
Start oracm.
$> ./oracm /a:0
OraCM version 9.2.0.2
9.2.0.2 uses hangcheck-timer. Latest version is 0.4-2 for IA32 and 0.5-1 for IA64.
Download software from:
http://kernel.us.oracle.com/software/
Install correct version of hangcheck-timer based on your kernel release (uname a):
# rpm ivh hangcheck-timer-2.4.9-e.10-0.4.0-2.i686.rpm
Configure cmcfg.ora & ocmargs.ora according to note 222746.1.Cmcfg.ora
recommended settings:
MissCount must be set to a large value and must be greater than the sum of
hangcheck_tick + hangcheck_margin. Recommended value is 215 seconds.
Load hangcheck-timer at boot via rc.local:
/sbin/insmod hangcheck-timer hangcheck_tick=30
hangcheck_margin=180
Start up oracm as root:
export ORACLE_HOME=/u01/app/oracle/product/9.2.0
sh $ORACLE_HOME/oracm/bin/ocmstart.sh

DSI408: Real Application Clusters Internals I-394

Installing RAC on Linux (continued)


Example cmcfg.ora
HeartBeat=15000
ClusterName=Oracle Cluster Manager, version 9i
KernelModuleName=hangcheck-timer
PollInterval=1000
MissCount=215
PrivateNodeNames=heartbeat3 heartbeat4
PublicNodeNames=rcbstint3 rcbstint4
ServicePort=9998
CmDiskFile=/ocfsdisk1/quorum/quorumfile
HostName=heartbeat3
Example ocmargs.ora
oracm
norestart 1800
To verify what interface the CM traffic is using:
[rcbstint3 ~]$ netstat -a | grep 9998
udp
0
0 heartbeat3:9998
[rcbstint4 ~]$ netstat -a | grep 9998
udp
0
0 heartbeat4:9998

*:*
*:*

Check hosts file:


$ grep heart /etc/hosts
10.1.1.3
heartbeat3
10.1.1.4
heartbeat4

DSI408: Real Application Clusters Internals I-395

Running RAC on Linux

Scripts for starting and stopping the cluster:


Startclu
Stopclu
oracm/bin/ocmstart.sh

13-396

ps -efl | egrep 'watchdogd|oranm|oracm'

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-396

Starting CM

Starting OCMS involves the following:


WDD
Configuring NM
Starting NM
Starting CM

13-397

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-397

Starting WDD

Starting WDD:
watchdogd -g dba

13-398

Copyright 2003, Oracle. All rights reserved.

Starting WDD
WDD is used only in Oracle9i before release 9.2.0.2.
Options to the watchdog command are:
-l: If 0, then no resources are registered for monitoring. This can be used while
debugging system configuration problems.
-t <number>: default 1000 ms (range: 0 ms to 3000 ms). This is the time
interval at which the WDD checks the heartbeat messages from its clients.
The default log file is $ORACLE_HOME/oracm/log/wdd.log.

DSI408: Real Application Clusters Internals I-398

Starting NM

The defined nodes and CmHostName are defined


in C_OH/oracm/admin/nmcfg.ora.

You must check that WDD is running first.


oranm </dev/null >$OH/oracm/log/nm.out 2>&1 &

13-399

Copyright 2003, Oracle. All rights reserved.

Start Options in NM
nmcgf.ora parameters:

pollinterval: Sends heartbeat messages at this interval. Default value 1000;


range 10 ms to 180000 ms.

watchdogMarginWait: Specifies the delay between a node failure and the


commencement of Oracle RAC cluster reconfiguration. Default value 70000.

autojoin: If 1, NM joins the cluster when NM starts, If 0, then it joins when


CM requests to join. Default value 0.
Switches for oranm
/?: Prints help text
/v: Verbose mode. Prints detailed info about every activity of the NM.
/s: Prints information about NM network traffic info
/r: Shows help NM parameters. NM does not start with this option.
/c: Prints messages sent from CM to NM

DSI408: Real Application Clusters Internals I-399

Starting CM

1. Check if WDD and NM have started.


2. Confirm that the host name in CmHostName
parameter of nmcfg.ora is in /etc/hosts.
oracm </dev/null> $OH/oracm/log/cm.out 2>&1 &

13-400

Copyright 2003, Oracle. All rights reserved.

Options for oracm


/?: help text
/a: Defines the action taken when the LMON process or any other Oracle process that
can write to the shared disk terminates abnormally. If action is 0, no action is taken. If
action is 1 (default), the CM requests the WDD to stop the node completely. Set /a to 0.
/v : Prints detailed information on every activity of CM
/d : Prints more trace information for debug

DSI408: Real Application Clusters Internals I-400

Debugging

13-401

For general debugging, use gdb.


For skgxp debugging use IPC tracing.
sskgxp provides dump routines that can be used
for debugging.
Examine cluster code debug, log files, and out
files.

Copyright 2003, Oracle. All rights reserved.

Debugging
sskgxp_dmpsspt - port: dumps port structure.
sskgxp_dmpsspid

DSI408: Real Application Clusters Internals I-401

Summary

In this lesson, you should have learned how to:


Outline the distinguishing features of RAC on the
Linux platform
Install, start, and stop RAC on the Linux platform
List the Linux-specific software components

13-402

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-402

References

www.sistina.com/lvm
linux.oracle.com

Administrators Guide for Oracle9i for UNIX


Sys-admin: Scott Forten
Cluster-related: Takiba

13-403

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-403

HP-UX Platform

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Outline the distinguishing features of RAC on the
HP-UX platform
Install, start, and stop RAC on the HP-UX platform
List the HP-UXspecific software components

14-405

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-405

HP-UX RAC Architecture

Clusters are called Multi Computer (MC).


Interconnects can be:
LAN, normal Ethernet architecture and protocols
HyperFabric, a proprietary protocol Cluster
Interconnect (CLIC)
Copper-based, fiber-based, or mixed
Direct node-to-node or via switch (hub)

14-406

Depending on the choice of interconnect, up to


eight nodes can be clustered together.

Copyright 2003, Oracle. All rights reserved.

HP-UX Architecture
For more information on HP-UX hardware variations, refer to
http://docs.hp.com/hpux/onlinedocs/B6257-90031/B625790031_top.html.

DSI408: Real Application Clusters Internals I-406

HP-UX Cluster Software

HP cluster services are required for:


MC/Service Guard (MCSG), RAC Edition
Nmapi2: implementation of the SKGXN interface

14-407

Shared volume group services

Copyright 2003, Oracle. All rights reserved.

HP-UX Cluster Software


nmapi2 is HPs implementation of the SKGXN interface, which is located in
/opt/nmapi2/lib//libnmapi2.sl.
The Shared Volume Group Service provides the shared volume group services (for raw
devices) along with the volume group services.

DSI408: Real Application Clusters Internals I-407

HP-UX Port-Specific Code

The following three SKGXP implementations are


present:
TCP: Not recommended and tested
UDP: Most common used
lowfat: Provided by HP, used with CLIC

14-408

Copyright 2003, Oracle. All rights reserved.

HP-UX Port-Specific Code


The lowfat SKGXP implementation is the default if the CLIC interface and software are
present from Oracle, release 9.2 and later versions. Otherwise, the UDP version is used.
The lowfat SKGXP implementation requires a relink because it is supplied to the
customer by HP.

DSI408: Real Application Clusters Internals I-408

SKGXP (UDP Implementation)

14-409

SKGXP provides a failover mechanism from the


primary network to a secondary network.
The primary network is always a CLIC interface. It
is NULL if CLIC is not present.
The secondary interface is the interface that is
bound to the host name (uses gethostbyname).

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-409

SKGXP: Lowfat

HP provides the software directly to customers:


Proprietary protocol by HP
Failover within CLIC interfaces
No failover from CLIC-to-LAN interfaces

14-410

Copyright 2003, Oracle. All rights reserved.

SKGXP: Lowfat
The HP Cluster Interconnect (CLIC) protocol is proprietary and is part of the HyperFabric
cluster system.

DSI408: Real Application Clusters Internals I-410

Installing RAC on HP-UX

See WebIV note Step-By-Step Installation of RAC on


HP-UX 182177.1.

14-411

Copyright 2003, Oracle. All rights reserved.

Installing RAC on HP-UX


The OS-specific steps are:
Configuring the cluster hardware, including OS patches
Installing and configuring disk arrays
Installing and configuring Cluster Interconnect and Public Network Hardware
Creating a cluster
- Modifying the /etc/lvmrc file
- Creating a Shared Logical Volume
- Installing the cluster software
- Forming a one-node cluster, performing basic cluster administration
Finally, install Oracle RAC software.

DSI408: Real Application Clusters Internals I-411

Running RAC on HP-UX

Cluster commands:
Cmhaltcl: Stop the cluster.
Cmrunnode: Join the node with the cluster.
Cmhaltnode: Remove the node from the cluster.
Cmviewcl: View the status of the cluster.
Cmruncl: Bring up the cluster.

14-412

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-412

Debugging on HP-UX

14-413

Do not pick services from NIS.


Ensure that Lock PV is accessible from all the
nodes.
Check the required permissions in cmclnodelist.
Check the cluster services.
Ensure that SKGXP is using the same network
on all nodes.

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-413

Summary

In this lesson, you should have learned about the


platform-specific details of RAC on HP-UX.

14-414

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-414

Tru64 Platform

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Outline the distinguishing features of RAC on the
Tru64 platform
Install, start, and stop RAC on the Tru64 platform
List the Tru64-specific software components

15-417

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-417

Tru64 RAC Architecture

15-418

Memory Channel Interconnect


Native Cluster File System

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-418

Shared Disk Systems

LSMs are easy to create and manage (such as


volume groups, logical volumes).
Distributed Raw Devices (DRD)
Cluster File System (CFS) is on top of AdvFS
Client Server mode
A node can be a client, as well as a server, for
different file systems of the cluster.

15-419

Node failure is handled by Device Request


Dispatcher, which routes to different controller for
shared disks.

Copyright 2003, Oracle. All rights reserved.

Shared Disk Systems


DRD usage is being replaced by CFS, due to the ease of use of CFS.
Logical storage managed disks (LSMs) or partitions have an initial offset of 64 KB.

DSI408: Real Application Clusters Internals I-419

Tru64 Cluster Software

Native cluster file system


Raw devices
Connection Manager
Cluster Application Availability (CAA)
Resource monitoring
Application Restart Capability

15-420

Cluster alias
Distributed Lock Manager (DLM)
Expanded process IDs

Copyright 2003, Oracle. All rights reserved.

Tru64 Cluster Software


The native cluster file system includes the /usr and /var file systems.
Both CFS and raw devices can be used.
The cluster alias allows TCP/UDP applications to address the cluster as a single system.
Expanded process IDs have 32-bit values and are unique across a cluster. Each cluster
has a block of numbers that it assigns as PIDs.

DSI408: Real Application Clusters Internals I-420

Tru64 Port-Specific Code

15-421

Node monitor: SKGXN


Inter Process Communication: SKGXP
Other platform-specific code in #ifdef A_OSF
blocks

Copyright 2003, Oracle. All rights reserved.

Tru64 Port-Specific Code


SKGXN uses the Tru64 Cluster Manager; it is based on the reference implementation,
as in Solaris. But instead of using the Oracle CM code for cluster membership, calls are
made to the Tru64 DLM and cluster API. The libskgxn9[8].a library archives the
files.
SKGXP interfaces to the Memory Channel Interconnect.

DSI408: Real Application Clusters Internals I-421

Node Monitor: SKGXN

The libskgxn9.a library contains modules


skgxn.o, skgxnr.o.
The skgxn0.h source module contains True64specific comments.

15-422

Copyright 2003, Oracle. All rights reserved.

Node Monitor: SKGXN


The programs compiled in an earlier version of the operating system work on later
versions, even if the libraries have changed.
If the libskgxn9.o library contains skgxns.o, then the RAC option was not
installed properly.
The clu_get_info calls get the information about the nodes in the cluster. The
cluster includes cluster_defs.h and /usr/shlib/libclu.so library from
Tru64, version 5.1 and later. The link command line should include -ldlm -lssn
lclu.
The Tru64 DLM library is used for actions such as creating or joining a global
namespace, or finding the condition of a node. Typical calls are: dlm_nsjoin,
dlm_nsleave, dlm_lock, dlm_unlock, dlm_notify, dlm_cvt, and
dlm_get_rsbinfo.
The library used before Oracle9i is libskgxn8.a.

DSI408: Real Application Clusters Internals I-422

IPC: SKGXP

Tru64 supports two types of IPCs:


Low-latency Reliable Data Gram (RDG)
implementation in skgxpm. This is the default.
UDP implementation in skgxpu

15-423

TCP implementation (skgxpt) is not supported in


Tru64.

Copyright 2003, Oracle. All rights reserved.

IPC: SKGXP
The cluster_interconnects initialization parameter defines which interface is
used.
When set to an IP address, the parameter uses that address and thus disables
processing in the sskgxp module.
When unset, the parameter uses the first available ics0 or mc0 interface (in that
order). ics0 is the name of the memory channel for Tru64, version 5.1 and later.
cluster_interconnects is ignored if the default RDG implementation is used.
skgxp is stored in libskgxpu.a (contains modules skgxpu.o and sskgxpu.o)
and is copied over to libskgxp9.a, if UDP implementation selected.

DSI408: Real Application Clusters Internals I-423

SKGXPM: RDG

Is part of the OS kernel: rdg.mod

Was developed jointly by Compaq and Oracle


Is on memory channel only
Has the same functionality as skgxpu

Is adjustable through subsystem kernel


parameters

15-424

Copyright 2003, Oracle. All rights reserved.

SKGXPM: RDG
The RDG IPC is one of the most widely tested and proven IPC versions. It is used in the
SAP benchmark for Oracle, release 9.2.
The RDG IPC uses the rdg* kernel calls to create or initialize endpoints. Typical calls
are RdgInit, RdgNodeLookup, RdgEpCreate, RdgEpDestroy,
RdgShutdown, RdgIoCancel, and RdgEpLookup.
The RDG IPC uses the cfg_subsys_query call to find the RDG subsystem
information. Link commands should include -lrdg lcfg.
The RDG subsystem kernel parameters must be set as follows:
max_objs = 5120
msg_size = 32768
max_async_req = 512
rdg_max_auto_msg_wires = 0
rdg_auto_msg_wires = 0
Use the sysconfig -q rdg to verify these values (RDG version : RDG
V39.24b_BL17_BCGM623Z3).

DSI408: Real Application Clusters Internals I-424

SKGXPM : RDG (continued)


Setting the environment variable SKGXP_TRACE to 1 to trace can yield far too much
data, up to gigabytes in size.
The libskgxpm.a library contains skgxpm.o. It is copied over to libskgxp9.a.
The skgxpm0.h module contains some comments.
The /usr/ccs/lib/librdg.a library is the wrapper for kernel calls from
/usr/opt/TruCluster/sys/rdg.mod.
The /usr/lib/libcfg.a library has the configuration API to query subsystems.

DSI408: Real Application Clusters Internals I-425

Installing RAC on Tru64

See WebIV note Step-By-Step Installation of RAC on


HP/Compaq Tru64 175480.1.

15-426

Copyright 2003, Oracle. All rights reserved.

Installing RAC on Tru64


The OS-specific installation involves:
Checking the hardware
Configuring the cluster, including the shared mounts
Shared mounts are the clusterwide file systems. One more disk is needed for the cluster
quorum disk; this cannot be used for any other purpose.

DSI408: Real Application Clusters Internals I-426

Debugging on Tru64

Use the SKGXNTRCFLG OS environment variable


set to TRUE to enable tracing in the SKGXN layer.

Normal SKGXN_TRACE[0-3] skgxn_qry_group,


skgxn_print_bitmap, and so on are available.
Compile the time options -DDEBUG, DSKGXN_DEBUG for more tracing.

15-427

Copyright 2003, Oracle. All rights reserved.

Debugging on Tru64
The value TRUE for SKGXNTRCFLG must be uppercase.

DSI408: Real Application Clusters Internals I-427

Useful Tru64 Commands

ladebug
cfsstat
volprint

15-428

Copyright 2003, Oracle. All rights reserved.

Useful Tru64 Commands


Debug
ladebug (similar to dbx but more stable and advanced; also has a good GUI
interface): Prints out the line number of files when attached to a process
ladebug $ORACLE_HOME/bin/oracle -pid xxxxx

dis <objectfile>: To disassemble code


dis kcl.o

odump -Dl / ldd: For information about shared libraries linked with the
executable, section headers and so on
/usr/local/bin/trace: To trace and log the executable
/usr/local/bin/truss: Same as /usr/local/bin/trace but better
trace output

DSI408: Real Application Clusters Internals I-428

Useful Tru64 Commands (continued)


CPU and System Information
Psrinfo: To retrieve information about the processors and state
Psradm: To bring offline/online processors
sizer v: To find the OS release level
setld i: To find all the software patches and packages
sysconfig q <item>: To retrieve information about the subsystem. Items
can be, for example, rdg, proc, or ipc.
sysconfig q ipc

Hwmgr: To view or modify any hardware


Cluster Commands
clu_get_info: To view information about cluster state, cluster ID, IPC name
cnxshow (older form from Tru 4.0) : To view information about the cluster state
sysman: For graphical picture of the cluster
Disk Commands: CFS Related
Cfsstat: To view statistics of CFS subsystem and internode communication
system (ICS) subsystem
cfsmgr: To view status of CFS or find out who is server; to get statistics for a
particular filesystem, use cfsmgr -a statistics.
showfdmn -k <domainname>: To view space left in CFS (more accurate
than du k)
mkcdsl: To make context-dependent symbolic link (available only in clusters
and AdvFS)
mkcdsl /usr/testfile
Disk Commands: Others
volprint: To print information about LVs, including their size in 512-KB block
Showfile: To display the attributes of AdvFS directories and files
advfsstat: To display statistics of AdvFS
asemgr: To maintain DRDs (no command-line interface)

DSI408: Real Application Clusters Internals I-429

Summary

In this lesson, you should have learned about the


platform-specific details of RAC on Tru64.

15-430

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-430

AIX Platform

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Outline the distinguishing RAC features on the AIX
platform
List the AIX-specific software components

16-433

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-433

AIX RAC Architecture

The following two cluster configurations are


available:
SP clusters
High Availability Cluster Management Program
(HACMP) clusters

16-434

There is a RAC-supported cluster file system


(GPFS) available from IBM.

Copyright 2003, Oracle. All rights reserved.

AIX RAC Architecture


The following two cluster configurations are available:
Shared Nothing cluster
Shared Disk cluster
The Shared Nothing cluster, which is called SP, uses Parallel System Support Programs
(PSSP) as the Cluster Manager. Because the Oracle server needs a shared disk, it uses the
Virtual Shared Disk (VSD) software to make the disks shared.
The Shared Disk cluster uses HACMP (High Availability Cluster Management Protocol)
as the Cluster Manager.
The SP-series computers are called the P or X series in some versions.

DSI408: Real Application Clusters Internals I-434

AIX SP Clusters

Highly scalable: up to 128 nodes


High/wide/thin node configurations
Cluster software: PSSP or HACMP
If both PSSP and HACMP are present on the same
machine, Oracle by default uses PSSP. If
PGSD_SUBSYS=grpsvcs, then HACMP is selected.

16-435

IPC traffic: High Performance Switch (HPS)


Raw devices: Virtual Shared Disk (VSD), Hashed
Shared Disk (HSD)

Copyright 2003, Oracle. All rights reserved.

AIX SP Clusters
The term Parallel System Support Programs (PSSP) is also used for SP clusters.

DSI408: Real Application Clusters Internals I-435

AIX HACMP Clusters

16-436

Scalability is limited due to Concurrent Logical


Volume (CLV) (<= eight nodes)
Nodes: RS6000 machines
Cluster software: HACMP
IPC Traffic: HPS, Ethernet, FDDI
Hard disks must be physically connected to each
node.
Raw device: CLV

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-436

AIX Cluster Software

AIX can have the operating kernel extended.


The Oracle server makes use of the kext version
of the Post/Wait (PW) service:
Provides the facility for generic and IPC PW
Has placeholders for extending PW service to I/O
events and miscellaneous events
Uses the loadext.c facility for loading, unloading,
or status check (status_chk) of kernel extensions

16-437

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-437

AIX Cluster Layer

Commands to check subsystems on AIX:


Group Services:
On PSSP: hags
On HACMP grpsvcs

Event Management (EM) Services (on HACMP


only) : emsvcs

SRC commands:

16-438

startsrc -s <sname>
stopsrc -s <sname>
lssrc -ls <sname>
lssrc -a

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-438

AIX Port-Specific Code

Object and archive files


libskgxnr.a contains the NM code.
libha_gs64_r.a has to be linked to the Oracle
server.

16-439

Implementation uses pthread condition


variables to synchronize between the threads.

Copyright 2003, Oracle. All rights reserved.

AIX Port-Specific Code


Before Oracle release 9.2, libskgxnr.a was two object files: skgxn(r).o and
sskgxn.o.
libha_gs64_r.a is IBMs client library for Group Services.
On AIX, NM(skgxn) is implemented by Oracle using the grpsvcs API - GSAPI
(IBMs group services that runs as part of the cluster).
In versions before Oracle, release 9.0.1, the files skgxn.o, skgxnr.o, and sskgxn.o
were located in $ORACLE_HOME/rdbms/bin. Starting with Oracle, version 9.2, these
files are archived in libskgxnr.a. When linking, you must use IBMs
libha_gs64_r.a to make the group service functions available.
Because the basic services come from the cluster or the OS, you need not perform a
preinstallation.

DSI408: Real Application Clusters Internals I-439

RAC on AIX Stack


Node 1
Instance

Clusterwide
disks

Node n
Instance

LCK0
LCK0
LMON
SKGP

SKGXP

SKGXN

SKGFR

HAGS
EM
Group Services services
AIO
VSD/CLV

KEXT

NET

LMON

Cluster layer,
CM
Operating
system
Net 1 and Net 2

16-440

Copyright 2003, Oracle. All rights reserved.

RAC on AIX Stack


The Oracle instance has the LMON process as the primary communication process. Other
Oracle processes also contain the SKG-routines. For easier understanding, only LMON is
shown in the slide.
EM and HAGS are cluster layer components that implement the vendor-supplied CM.
The Advanced I/O (AIO) component handles the Virtual Storage Device (VSD) or Cluster
Logical Volumes (CLV) storage. This is connected from the clusterwide to the shared
disk (connection not shown).
KEXT is the kernel extension.

DSI408: Real Application Clusters Internals I-440

Node Monitor (NM)

The NM uses the AIX Group Services API (GSAPI).


GSAPI is supported on both HACMP and PSSP
platforms.
Logical flow:
The primary member initializes and joins the group,
monitors slaves joining the group, and checks the
status of the slaves.
Slaves join the group.

16-441

Copyright 2003, Oracle. All rights reserved.

Node Monitor (NM)


The flow of the Node Monitor is the same as for other platforms. The list on the
following Notes pages shows the IBM AIX calls that are used in the AIX Group
Services.

DSI408: Real Application Clusters Internals I-441

Node Monitor (NM) (continued)


NM Flow Logic
Primary Member Primary Thread Logic:
Initializes the connection with Group Services (ha_gs_init) and Spawn of the
GS thread, which waits for responses on the GS socket
Joins the public Group, in which I is the sole provider (ha_gs_join)
Publishes my Public Data (ha_gs_change_state_value)
Subscribes to the RVSD group, if PSSP (ha_gs_subscribe)
Joins the Process Group, which is joined by ALL Primary members, mounting the
same database (ha_gs_join)
Publishes my Private Data (ha_gs_send_message)
Spawns the Primary Accept Thread, which is used by slave members to detect
primary members death
Monitors membership changes (skgxnpstat)
Primary Member GS Thread Logic:
Loops on a select() on GS socket
Calls ha_gs_dispatch, which calls one of the Global Callback functions based
on the response:
- sskgxn_gs_delayed_error_cb: To process Async Error notification
- sskgxn_gs_subscription_cb: To process changes in the subscribed
group
- sskgxn_gs_approved_cb: To process any proposal that has been
approved in the Process group
- sskgxn_gs_announcement_cb
- These callbacks in turn call Local Call back function based on the current
state of SKGXN.
Primary Member Accept Thread Logic:
Creates UNIX Domain Socket; Loop indefinitely
Inside the loop; Wait on accept
If a new slave connects, handshake member information, then add the
connection to the Array of Slave Connections.
Check All Slave Connections to see if they are Alive. If any Slave has died, remove
the Connection from the Array.
Go and Wait in accept.
Slave Member Logic: (Main & Read Thread)
Connect to Primary Member Socket; Handshake member information
Subscribe to Primary Members Process Group (ha_gs_subscribe)
Spawn of Slave Read Thread which blocks on read() on the socket
If read() returns and no EINTR, then (error) exit.

DSI408: Real Application Clusters Internals I-442

Installing RAC on AIX

For information on installing RAC on AIX:


Refer to the WebIV note Step-By-Step Installation
of RAC on AIX 199457.1
Refer to the RAC-Pack public folder:
http://files.oraclecorp.com/content/AllPublic/
Workspaces/RAC%20Pack-Public/Technical%20
Papers/CookBook%20AIX%20V2_2.pdf

16-443

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-443

Installing RAC on AIX

Identifying the domain to Group Services:


If PSSP, set
HA_SYSPAR_NAME=`/usr/lpp/ssp/bin/spget_syspar n`

If HACMP, set
HA_DOMAIN_NAME=`/usr/sbin/cluster/utilities/cldomain`
PGSD_SUBSYS=grpsvcs

16-444

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-444

Debugging on AIX

More named dump routines available


Tracing can be done by:
Setting SKGXNTRCFLGS to any non-zero number
Turning skgxn tracing on

Trace macros available:


sskgxn_log(ctx, "", )
sskgxn_trace(ctx, mask, "", )
where ctx is the skgxn context pointer,
mask=1, and sskgxn tracing is turned on

16-445

Copyright 2003, Oracle. All rights reserved.

Debugging on AIX
There are many more dump routines in addition to the standard X$TRACE/KST and
DIAG. Refer to the source code for a list.

DSI408: Real Application Clusters Internals I-445

Summary

In this lesson, you should have learned about the


platform-specific details of RAC on AIX.

16-446

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-446

References

Product availability, support, required IBM patches


(incomplete) Web site:
http://ibmhome.us.oracle.com/

Oracle functionality specific to Aix:


http://mercury.us.oracle.com/merc/owa/MCGUI.dis
play_object?objectID=3531

External documentation and notes on AIX, SP


PSSP, and HACMP:
http://www.rs6000.ibm.com/resource/aix_resource/
Pubs/

16-447

Copyright 2003, Oracle. All rights reserved.

References
Contacts
Oracle on Aixrelated: Vijay.Sridharan@oracle.com
System-related issues: File ES1 Ticket
SP Sys Admin: David.Ong@oracle.com
HACMP Sys Admin: John.Tomicich@oracle.com
IBM-specific queries: Dennis Massanari: massanar@us.ibm.com

DSI408: Real Application Clusters Internals I-447

Other Platforms

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Outline the distinguishing features of RAC on
Windows, Solaris, and OpenVMS platforms
List the specific software components for these
platforms

17-449

Copyright 2003, Oracle. All rights reserved.

Objectives
The platforms covered in this lesson are:
Windows
Solaris
OpenVMS

DSI408: Real Application Clusters Internals I-449

RAC Architecture: Solaris

Cluster is limited to a maximum of four nodes.

17-450

Copyright 2003, Oracle. All rights reserved.

RAC Architecture: Solaris


Solaris has a clusterwide file system (GFS). This is not supported by RAC.

DSI408: Real Application Clusters Internals I-450

RAC Architecture: Windows

Cluster is limited to a maximum of four nodes.

17-451

Copyright 2003, Oracle. All rights reserved.

RAC Architecture: Windows


The Microsoft Cluster System (MCSC) is not required for RAC implementation. It is
required for RAC Guard or RAC high availability.

DSI408: Real Application Clusters Internals I-451

RAC Architecture: OpenVMS

Native clusterwide file system


Disc Cluster Configurations can be:
Hardware-shared
Host-shared

17-452

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-452

Port-Specific Code

The VMS IPC uses the TCP reference implementation.

17-453

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-453

Installing RAC

For information about installing RAC, refer to:


WebIV note Step-By-Step Installation of RAC on
Solaris 175465.1
WebIV note Step-By-Step Installation of RAC on
Windows NT or 2000 178882.1
WebIV note Step-By-Step Installation of RAC on
OpenVMS 180012.1

17-454

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-454

Summary

In this lesson, you should have learned how to:


Outline the distinguishing features of RAC on
Windows, Solaris, and OpenVMS platforms
List the specific software components for these
platforms

17-455

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-455

SQL
SQL Layer
Layer

SQL
SQL Layer
Layer

Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS

Section
IV
II
II
P
P
P
P
Debug
C
C
C
C

Node
Node Monitor
Monitor

Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS
Node
Node Monitor
Monitor

Cluster
Cluster Manager
Manager

Copyright 2003, Oracle. All rights reserved.

V$ and X$ Views
and Events

Copyright 2003, Oracle. All rights reserved.

Objectives

This lesson provides a reference of useful dictionary


views and tables.

18-459

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-459

V$ and GV$ Views

18-460

V$ views are instance-specific.


GV$ views retrieve the V$ content from all instance
members by using the Parallel Query subsystem.
PARALLEL_MAX_SERVERS must be large enough
on all instances.

Copyright 2003, Oracle. All rights reserved.

V$ and GV$ Views


Note: For the purpose of brevity, all views are shown as V$<name>, and it is assumed
that there is also a corresponding GV$<name> (except where noted otherwise).

DSI408: Real Application Clusters Internals I-460

List of Views
See documentation for column descriptions.
V$ACTIVE_INSTANCES
V$BH
V$CACHE
V$CACHE_LOCK/_TRANSFER
V$CR_BLOCK_SERVER
V$ENQUEUE_LOCK/_STAT
V$FALSE_PING
V$FILE_CACHE_TRANSFER
V$GC_ELEMENT
V$GC_ELEMENTS_WITH_COLLISION
S
V$GCSHVMASTER_INFO
V$GCSPFMASTER_INFO
18-461

V$GES_BLOCKING_ENQUEUE
V$GES_CONVERT_LOCAL
V$GES_CONVERT_REMOTE
V$GES_ENQUEUE/_RESOURCE
V$HVMASTER_INFO
V$INSTANCE
V$LIBRARYCACHE
V$LOCK
V$LOCK_ELEMENT/_ACTIVITY
(V$PQ_SESSTAT, V$PX_*)
V$RESOURCE_LIMIT
V$ROWCACHE_PARENT

Copyright 2003, Oracle. All rights reserved.

List of Views
The slide lists the views that are documented in the manuals. Views marked with are
created with the script CATCLUST.SQL. The V$GES_* views are synonyms for
V$DLM_* views and are also created with the script CATCLUST.SQL. Other internal
views are listed in V$FIXED_TABLE and expanded in X$KQFVI/X$KQFVT. Additional
views are:
V$DLM_ALL_LOCKS: Shows every DLM lock in the instance (PCM or not)
V$DLM_CONVERT_LOCAL: See V$GES_CONVERT_LOCAL
V$DLM_CONVERT_REMOTE: See V$GES_CONVERT_REMOTE
V$DLM_LOCKS: Blocked or blocking locks; a subset of V$DLM_ALL_LOCKS
V$DLM_MISC
V$DLM_RESS: See V$GES_RESOURCE
V$DLM_TRAFFIC_CONTROLLER
V$PING
V$FILE_PING
V$TEMP_PING

For columns and meanings, use WebIV folder Server.Internals.General.V$Views.


DSI408: Real Application Clusters Internals I-461

Old and New Views

18-462

Old View

New View (bigger/better)

V$LOCK_ELEMENT

V$GC_ELEMENT

V$DLM_CONVERT_LOCAL

V$GES_CONVERT_LOCAL

V$DLM_CONVERT_REMOTE

V$GES_CONVERT_REMOTE

Copyright 2003, Oracle. All rights reserved.

Old and New Views


The naming changes from DLM to GRD, non-PCM to GES, and PCM to GCS are
partially reflected in the newer views. The newer views have the proper newer names
and may also have more columns. The older views remain available for backward
compatibility.

DSI408: Real Application Clusters Internals I-462

V$ Views for Lock Information

18-463

V$DLM_ALL_LOCKS: All locks in the DLM


V$DLM_CONVERT_LOCAL: Statistics on local lock
conversions
V$DLM_CONVERT_REMOTE: Statistics on remote
lock conversions
V$DLM_LOCKS: All blocking or blocked locks
V$DLM_MISC: DLM statistics
V$DLM_RESS: All DLM resources
V$RESOURCE_LIMIT: SGA resources

Copyright 2003, Oracle. All rights reserved.

V$ Views for Lock Information


V$DLM_LOCKS is useful in diagnosing RAC hangs because the output is similar to that
dumped by lkdebug O.
V$DLM_RESS has one record for every DLM resource.
V$RESOURCE_LIMIT is useful for determining whether DLM LM_% resources have
been set correctly, by looking at INITIAL, CURRENT, and MAXIMUM.

DSI408: Real Application Clusters Internals I-463

X$ Tables

x$bh
x$kccfe
x$kcfio
x$kclcrst
x$kglst
x$kjbr
x$kjdrhv
x$kjdrpcmhv
x$kjdrpcmpf
x$kjicvt
x$kjirft

18-464

x$kqrfp
x$ksimsi
x$ksqeq
x$ksqrs
x$ksqst
x$ksurlmt
x$ksuse
x$ksuxsinst
x$kvit
x$le
x$quiesce

Copyright 2003, Oracle. All rights reserved.

X$ Tables
The X$ tables listed in the slide are the ones used by the V$ views on the previous slide.
WebIV note 208093.1 shows a good relation between V$ views and X$ tables.
WebIV note 22241.1 gives a reasonably complete listing of X$ tables.
Additional useful RAC X$ tables are x$kjbrfx.

DSI408: Real Application Clusters Internals I-464

Events

10704, level 10: ksq


10706, level 10: ksi

Kernel Service enQueue


Kernel Service Instance Locks

10254, level 1

Trace Cross Instance Calls

18-465

Copyright 2003, Oracle. All rights reserved.

Events
Triggering events for DLM:
29700 Enable lock convert statistics
29712-29713 Lock open convert cancel close operations
29714 DLM state object
29715 Reconfiguration
29716 Post wait and AST
29717 GRD or DLM freeze/unfreeze
29718 CGS or DLM CM interface
29720 GES or DLM SCN service
29722 GES or DLM process death

DSI408: Real Application Clusters Internals I-465

KST and X$TRACE

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to do


the following:
Explain how KST gathers information for X$TRACE

19-467

Explain the DIAG architecture

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-467

KST: X$TRACE

KST is kernel service tracing.

19-468

Copyright 2003, Oracle. All rights reserved.

KST: X$TRACE
Background
The Kernel Service Tracing (KST) facility was an existing component in the VOS layer
that was used by a few components for limited tracing. In Oracle9i, this mechanism has
been reworked to provide simpler yet more powerful interfaces for recording the execution
history of interesting components. This reworked mechanism also provides extensible
interfaces that allow the clients to customize instrumentation to satisfy their tracing needs.
KST output can be examined in the X$TRACE table.

DSI408: Real Application Clusters Internals I-468

KST Concepts

Kernel service tracing focuses on the execution


history of a component.
SGA provides an in-memory circular buffer to
each process:
Each buffer is associated with a unique ID matching
the Oracle PID.
Time-ordering of traces is guaranteed.
Trace buffers are released upon process exit.
Reuse of trace buffer with the same Oracle PID

19-469

Copyright 2003, Oracle. All rights reserved.

KST Concepts
The KST facility provides a mechanism to log the execution history of a component with
minimum performance impact. This is done by providing an in-memory trace buffer to
each Oracle process, because tracing with an in-memory buffer has less performance
impact than logging traces on disk.
Each Oracle process (whether foreground or background) is assigned its own trace buffer
that is allocated from the SGA. The buffer is accessible by other Oracle processes if any
process dies unexpectedly, increasing the availability of trace information for later
diagnosis.
Circular buffers are used to minimize the memory usage for tracing purposes by removing
stale data. However, users must specify a large enough buffer so that wrapping does not
cause data loss. Note that the faster a process generates tracing data, the larger the buffer
size that must be specified.

DSI408: Real Application Clusters Internals I-469

KST Concepts (continued)


When a process is created (ksucrp) during instance startup time, a trace buffer is
assigned to this process. Each trace buffer is associated with a unique ID that matches
the Oracle process ID and is never shared among processes. The unique ID guarantees
the trace isolation among processes and the time ordering of tracing within a process.
When a process exits, its trace buffer is still kept in SGA, retaining trace information in
case it is needed for diagnosis of any problem that may occur later. The retention also
reduces the overhead of repeated memory allocation and deallocation of trace buffers, if
processes are created and exited frequently.
Any unused trace buffer is reassigned to a new process whose Oracle PID matches the
assigned buffer ID. If no such buffer exists, it is allocated from SGA.

DSI408: Real Application Clusters Internals I-470

KST Concepts

Multilevel, event-based tracing


Supports up to 256 levels
1000 event IDs (1000010999) available for RDBMS

19-471

256 opcodes to further categorize the traces within


an event ID
Always-on minimal tracing
Support for optional trace archiving

Copyright 2003, Oracle. All rights reserved.

KST Concepts (continued)


The KST facility uses event-based tracing with event IDs ranging from 10000 to 10999.
To further control the extent or detail of tracing with the same event ID, 256 levels (0 to
255) can be used.
In addition, KST supports opcode filtering in each trace of the same event ID. This adds a
second dimension to tracing, so that a single event ID is used for a component and each
functionality of the component is categorized by different opcodes. Furthermore, a level
can be used to control details of the tracing that was logged by the facility.
One of the features in the KST facility is the support of always-on minimal tracing.
Trace instrumentation with level 0 is always-on tracing, and all the level-0 traces are
always logged when KST is enabled through the initialization parameter
trace_enabled. Note that the event ID is not required to enable the always-on
feature. However, it can be disabled through the command ALTER TRACING DISABLE
<event-spec> that disables tracing for the specified event at all levels.
The KST facility also provides optional trace archiving to users so that traces in memory
buffer are logged to files during run time when the buffer wraps around. This increases the
amount of data that is available for diagnosis if the size of the allocated buffer is not large
enough to cover tracing for a longer period of execution. This feature is not recommended
for production systems. However, it is very useful for diagnosing problems during
development.
DSI408: Real Application Clusters Internals I-471

Circular Buffer

X$TRACE

SGA
P1
Trace Buffer Process 1
Pn
Trace Buffer Process n

19-472

Copyright 2003, Oracle. All rights reserved.

Circular Buffer
All trace buffers reside in the SGA, and each buffer is assigned to a single Oracle process.
During run time, trace data from each process is logged to its own buffer. Users can query
the content of trace buffers and the status of tracing behavior through some fixed table
views, that is, X$-tables.

DSI408: Real Application Clusters Internals I-472

Data Structure kstrc

Fixed-size, fixed-format trace records


Metadata section
Trace data section; maximum of 48 bytes

19-473

Trace records populated with KSTRC0, KSTRC1,


KSTRC6, and KSTRCX
Formatting callback registered with kstdfcb

Copyright 2003, Oracle. All rights reserved.

Data Structure kstrc


The KST mechanism uses the data structure kstrc of 64 bytes to record a trace. Each
trace record has a fixed format for the header or metadata and trace data.
The metadata contains the time stamp, the sequence number (unique for an instance), the
Oracle process ID, the user session ID, the event ID, and the opcode.
The trace data has a maximum size of 48 bytes. KST supports two types of data for tracing:
Up to six ubig_ora numbers. Handled by the KSTRC0, KSTRC1 macros. The
number suffix defines the number of ubig_ora trace values in the argument list.
One specially defined data structure that has a size of 48 bytes to fill up the data
structure of the record. Handled by the KSTRCX macro.
All the macros require the event_id, level and opcode apart from the data.
The callback is registered with the kstdfcb function tying a specific event_id to a
formatting routine. The callback is used only when data is examined in X$TRACE or
written to file. If no callback exists, then all trace data of an event_id is output as six
ubig_oras in hexadecimal mode. Defined formatting callback should be present for a
special trace data structure. The code avoids pointer dereferencing, as invalid pointers
would show illegal values or possibly crash the trace.
DSI408: Real Application Clusters Internals I-473

Trace Control Interfaces

You can control tracing characteristics with:


Initialization parameters
trace_enable
Underscore parameters

SQL statements
ALTER TRACING
ALTER SYSTEM SET

19-474

Copyright 2003, Oracle. All rights reserved.

Trace Control Interfaces


Users can specify the controls either through the initialization parameters during instance
startup or SQL statements during run time.
During instance startup, tracing behavior can be configured through initialization
parameters. Only the trace_enabled parameter is visible to customers, enabling or
disabling the tracing mechanism.
Use the ALTER TRACING statement or ALTER SYSTEM SET statement to change the
value of initialization parameters whose scope is dynamic for altering the tracing with
SQL.

DSI408: Real Application Clusters Internals I-474

KST Initialization Parameters


<event-string> = <event-spec>:<level>:<proc-spec>
<event-spec>
= <event>|<event>,<event-spec>
<event> = ALL|<event-id>|<event-id>-<event-id>
<proc-spec> = <proc>|<proc>,<proc-spec>
<proc> = ALL|BGS|FGS|<pid>|<pid>-<pid>|<procname>
<level> = 0-255
trace_enabled
= {TRUE|FALSE}
_trace_archive
= {TRUE|FALSE}
_trace_events
= <event-string>
_trace_processes = {<proc-spec>|ALL}
_trace_buffers
= <proc-spec>:<size>
_trace_flush_processes = {<proc_spec>|ALL}
_trace_file_size = {<integer>|64K}
_trace_options
= {text|binary},
{multiple|single}

19-475

Copyright 2003, Oracle. All rights reserved.

KST Initialization Parameters


Initialization parameters that control KST behavior are used during instance startup.
trace_enabled: Turn on/off KST tracing mechanism
_trace_archive: Turn on/off KST trace archiving
_trace_events: Events, level, and processes to be traced
_trace_processes: Which process tracing is enabled
_trace_buffers: Buffer size on per-process basis (default 256:ALL)
_trace_flush_processes: Processes with trace archiving enabled
_trace_file_size: Maximum size for archive/flush trace file
_trace_options: Output in binary or text format, and per-process (multiple) or
per-instance (single) file mode (default: text, multiple)
Note: _trace_events can be specified multiple times in the same block in the
init.ora file. Note that its setting is overwritten by the last entry of this parameter if it is
specified separately from the block.

DSI408: Real Application Clusters Internals I-475

KST Initialization Parameters


Parameter

Class

Scope

Trace enabled

Dynamic

Global

_trace_archive

Dynamic

Global

_trace_events

Dynamic

Local

_trace_processes

Static

Local

_trace_buffers

Static

Local

Dynamic

Local

_trace_file_size

Static

Local

_trace_options

Static

Global

_trace_flush_processes

19-476

Copyright 2003, Oracle. All rights reserved.

KST Initialization Parameters (continued)


The class of a parameter defines whether the parameter is static or dynamic. The scope of
a parameter defines the coverage of a parameter in RAC instances. For global parameters,
all instances have the same value of the parameter.
Some parameters can be modified by the ALTER TRACING statement, although they have
the class static.

DSI408: Real Application Clusters Internals I-476

KST Trace Control Interfaces

Use SQL to modify tracing behavior at run time:


ALTER TRACING [ON|OFF]
[ENABLE <event-string>|DISABLE <event-spec>]
[FLUSH <proc-spec>]

19-477

Copyright 2003, Oracle. All rights reserved.

KST Trace Control Interfaces


SQL statements provide users the means to modify the tracing behavior of KST at run time.
ALTER TRACING ON: Enables tracing at run time (trace_enabled must be set to
TRUE for this to take effect)
ALTER TRACING OFF: Disables tracing at run time (regardless of the value of
trace_enabled)
ALTER TRACING ENABLE <event-string>: Enables trace events at run time
ALTER TRACING DISABLE <event-spec>: Disables trace events at run time. This
also disables level-0 tracing for the specified event.
ALTER TRACING FLUSH <proc-spec>: Flushes trace to file immediately. Note
that flushing is performed at a delayed mode if multiple file mode is used in the
current release (Oracle9i, release 1).

DSI408: Real Application Clusters Internals I-477

KST Trace Control Interfaces

ALTER SYSTEM SET

19-478

trace_enabled
_trace_archive
_trace_flush_processes
_trace_events

Copyright 2003, Oracle. All rights reserved.

KST Trace Control Interfaces (continued)


The SQL command ALTER SYSTEM SET can also be used to alter the trace parameters
that are marked dynamic.

DSI408: Real Application Clusters Internals I-478

KST Fixed Table Views

Dynamic views for tracing characteristics and


trace buffers
X$TRACE_EVENTS
EVENT, TRCLEVEL, STATUS, PROCS

X$TRACE
EVENT, OP, TIME, SEQ#, SID, PID, DATA

19-479

Copyright 2003, Oracle. All rights reserved.

KST Fixed Table Views


There are two fixed table views that are related to the KST mechanism. They are used for
online monitoring of tracing characteristics and viewing the contents of the trace buffers in
the SGA.
Attributes for X$TRACE_EVENTS (table for trace characteristics): Event, trclevel,
status, procs
Attributes for X$TRACE (trace buffers in SGA): event, opcode, time, seq#, SID,
PID, data
Note that X$TRACE shows the current trace data in all trace buffers and can be used as an
online tool to view traces and spot problems during run time.

DSI408: Real Application Clusters Internals I-479

KST Trace Output

Output trace data to file as required


User request
Process state dump
Crash dump in RAC instances

19-480

Output in binary or text format


Circular trace files with .trw as extension

Copyright 2003, Oracle. All rights reserved.

KST Trace Output


The trace data that is recorded in memory buffers can be output to files for future reference.
There are three situations in which traces are output to files:
Users can request trace flushing from memory to files either through the initialization
parameter _trace_archive or the SQL statement ALTER TRACING FLUSH.
Trace data is dumped to files along with process state dump when an exception
occurs for any fatal process.
Trace data is dumped to files across all RAC instances when one of the instances
crashes.
The trace data can be written to a file in either binary or text format. If binary format is
used, the data is written in hexadecimal format to the files. If text format is used, all data is
output either as six ubig_oras or in a user-defined format if a formatting callback is
specified for the event ID. Binary format is preferred if performance is a concern during
dumping.
All KST trace files have .trw as their extension to distinguish them from regular process
trace files (which have .trc as their extension). Also, these trace files are circular (similar
to the memory buffers to limit the file size).

DSI408: Real Application Clusters Internals I-480

KST Trace Output

Trace files can be on either a per-process or a


per-instance basis.
Name for per-process trace file:
<SID>_<proc_name>_<pid>.trw

Name for per-instance trace file:


trace_<SID>.trw

DIAG writes traces to per-instance file; otherwise


each process outputs its own traces to files.

19-481

Copyright 2003, Oracle. All rights reserved.

KST Trace Output (continued)


KST trace files are on either a per-process or a per-instance basis. For per-process files,
each process has its own trace output file and the process itself writes its traces from
memory to file. For a per-instance file, there is only a single trace file used for trace output
for all processes of the instance, and DIAG performs trace writing for all processes. In
case of process death, traces of the dead process are dumped to the trace file along with the
process state dump.
Naming convention for per-process trace files:
<SID>_<proc_name>_<pid>.trw

Example: db_lmon_1010.trw for LMON with SID=db


Naming convention for per-instance trace files:
trace_<SID>.trw

Example: trace_db.trw for SID=db


These trace files are created in the directory defined by the initialization parameter
background_dump_dest.

DSI408: Real Application Clusters Internals I-481

KST Trace Output

Trace data can be output in two modes:


Archiving
Flushing

19-482

KST uses an initialization parameter to enable


archiving.
Trace buffers are flushed to the file system with
the ALTER TRACING FLUSH statement.

Copyright 2003, Oracle. All rights reserved.

KST Trace Output (continued)


There are two modes of outputting trace data to files:
Archiving mode: Set through _trace_archive. Archiving remains active until it is
turned off.
Flushing mode: Performed when users issue an ALTER TRACING FLUSH statement.
In archiving mode, traces are written to files whenever the number of unarchived traces in
the buffer is half the size of the buffer or the buffer wraps around. However, flushing
occurs only when users issue the SQL statement. Note that flushing is performed in a
delayed mode in Oracle9i, release 1.

DSI408: Real Application Clusters Internals I-482

KST Current Instrumentation

Oracle9i components with trace instrumentation:

19-483

DLM layer
IPC layer
Space management layer
Shared servers (MTS)
PQ module
Transaction layer

Level-0 (always-on) tracing is enabled as default.

Copyright 2003, Oracle. All rights reserved.

KST Current Instrumentation


In version 9.0.1, trace instrumentation was done in several kernel components by using the
KST tracing facility.
Event numbers used by various components:
DLM: 10425 to 10435
IPC: 10401
Space management: 10907
Shared servers (MTS): 10249
PQ: 10371
Transaction layer: 10810 to 10812
For RAC production systems, KST tracing is enabled for all events with level 0 as the
default behavior.

DSI408: Real Application Clusters Internals I-483

KST Performance

19-484

Tracing affects the overall performance.


Customers agree on a 5% to 10% trade-off in
overall performance.
Global tracking uses 58 extra cycles for disabled
events versus 176 extra cycles for enabled events.
About 3% overhead (or less) for enabling tracing
at level 6 or lower.
Overhead is a function of instrumentation.

Copyright 2003, Oracle. All rights reserved.

KST Performance
Tracing definitely affects the overall performance of a system, regardless of any tracing
mechanism or design. The question is: How much performance degradation are users
willing to sacrifice in exchange for enhancing the diagnosability of the system when a
problem occurs?
In general, most customers are willing to have about 5% to 10% for the trade-off between
diagnosability and system performance.
In version 9.0.1, CPU instruction cycles used by KST tracing were measured. Regardless
of whether a trace event is enabled or not, some extra cycles are used after global tracing is
enabled (trace_enabled is TRUE) because certain cycles are required to perform the
event checking.
When global tracing is enabled, 58 extra cycles are used for event checking of a disabled
event and 176 extra cycles are used for an enabled event.
An average of less than 3% overhead was found when the regression test for RAC was run
with all events enabled at level 6 or less. Note that only a few components use KST
tracing in Oracle9i, release 1. Tracing overhead increases as instrumentation is done in
more RDBMS components.
Note that tracing overhead is a function of instrumentation. The performance may vary in
different releases.
DSI408: Real Application Clusters Internals I-484

KST: Examples

Sample instrumentation
Sample usage for KSTRC[0-6] in kju.c
Sample usage for KSTRCX in kjdd.c
Sample format callback in kji.c (kjdgtfmt)

19-485

Sample trace file

Copyright 2003, Oracle. All rights reserved.

KST: Examples
Following are the code examples on KSTRC[0-6], KSTRCX, formatting callback, and
kstdfcb. Note that the formatting callbacks should be registered at the notifier function
of the component.

DSI408: Real Application Clusters Internals I-485

KST: Examples (continued)


KSTRC[0-6]
/* kjuef - End function, a.k.a. Convert completion funct */
void
kjuef(cookie,endstat)
kjuvoidp cookie;
kjustat endstat;
{
kjatsst *stat;
text
rbuf[64];
kjuresn *rn;
if (cookie)
{
stat = (kjatsst *)cookie;
rn = &stat->resname_kjatsst;
KJDGTRACEBYTYPE(rn, (ub4)8, KJDGTT_AST, 0, 0,
("[AST][kjuef][%s][ast fired]\n",
(char*)kjqfrn(rbuf, rn)));
KSTRC3(KJDGTT_LKEVT, KJDGTT_ASTFIRED, 8, KJURN_ID1(rn),
KJURN_ID2(rn), rn->nam_kjurn[2]);
stat->ast_fired_kjatsst = TRUE;
stat->cookie_kjatsst
= cookie;
stat->endstat_kjatsst
= endstat;
}
return;
}

kstdfcb
void kjinfy(nfytype, ctx)
ub4 nfytype;
dvoid *ctx;
{
...
else if (nfytype == KSCNOPCR)
{
...
/* Register KST trace format callback */
kstdfcb(KJDGTT_LKEVT, (KSTFPTR)kjdgtfmt);
/* Register KST trace format callback for kjdd layer */
kstdfcb(KJDGTT_DD, (KSTFPTR)kjddtfmt);
/* Register KST trace format callback for IPC layer */
kstdfcb(KJDGTT_IPC, (KSTFPTR)kjdgfmtipc);
/* Register KST trace format callback for TRFC layer */
kstdfcb(KJDGTT_TRFC, (KSTFPTR)kjdgfmttrfc);
}
...
}

DSI408: Real Application Clusters Internals I-486

KST: Examples (continued)


KSTRCX
/*
** Validate the deadlock by traversing the clusterwide deadlock graph
*/
STATICF word
kjddvald(bp)
kjddb *bp;
{
kjl *lockp;
kjr *resp ;
ub4 lkver;
word level = ksepec(OER(KJDGTT_DD));
kjsolk *sghead = &(kjiudb->dd_stat_kjga.sgh_kjddstat);
kjsolk *pqhead = &(kjiudb->dd_stat_kjga.prq_kjddstat);
/* node originating the deadlock search */
ub2 origin = bp->req_kjddb.dd_master_node_kjxmddi;
/* node responsible for printing the graph to the trace file */
ub2 prnode = KJGA_FDTONODE(0); /* the lowest node */
boolean
kjftnid
kjddsg
kjddsg
kjsolk
boolean
kjddtrc

dd_invalid = FALSE;
lk_node = kjiudb->node_id_kjga;
*pp = KJSOSTRUC(kjsolfs(sghead), kjddsg, link_kjddsg);
*pp2; /* to check for duplicate locks in the wait for graph */
*qp;
dd_victim = FALSE;
trcctx;

/* Prepare the KST trace record */


CLRSTRUCT(trcctx);
trcctx.ddtyp_kjddtrc = ((kjiudb->dd_stat_kjga.txs_kjddstat) ? 1:0);
KJDEF_SETQUAD(trcctx.time_kjddtrc, kjiudb->dd_stat_kjga.t_kjddstat);
trcctx.snode_kjddtrc = kjiudb->node_id_kjga;
/* Log a trace record */
KSTRCX(KJDGTT_DD, KJDD_DDFND, 5, (void *)&trcctx);
...
}

DSI408: Real Application Clusters Internals I-487

KST: Examples (continued)


Formatting Callback
/*
** NAME
**
kjdgtfmt - LK Trace format callback
**
** DESCRIPTION
**
A format callback function for KST trace data
*/
void kjdgtfmt(action, op, data, buf, len)
uword
action;
ub1
op;
dvoid *data;
char
*buf;
ub4
len;
{
ubig_ora *darray = (ubig_ora *)data;
switch(op)
{
case KJDGTT_ASTFIRED:
{
text
buf1[64];
kjuresn rn;
KJDG_SET_RESN(&rn, darray[0], darray[1], darray[2]);
(void) sprintf(buf, "kjuef: %s", (char*)kjqfrn(buf1, &rn));
break;
}
case KJDGTT_SYNCCVT:
{
text
buf1[64];
kjuresn rn;
KJDG_SET_RESN(&rn, darray[0], darray[1], darray[2]);
(void) sprintf(buf, "kjuscv: %s[lockp " KPPTPTRFMT "][level %d]",
(char*)kjqfrn(buf1, &rn), KPPTPTRWRP(darray[3]),
(word)darray[4]);
break;
}
...
}

DSI408: Real Application Clusters Internals I-488

KST Sample Trace File


1
1020304 1 2048 384 32 1
Oracle9i Enterprise Edition Release 9.0.1.0.0 - Production
With the Partitioning and Real Application Clusters options
JServer Release 9.0.2.0.0 - Beta
ORACLE_HOME = /ade/ilam_rdbms_lrg/oracle
System name:
SunOS
Node name:
dlsun1932
Release:
5.6
Version:
Generic_105181-14
Machine:
sun4u
Instance name: lrg
Oracle process number: 13
Unix process pid: 20723, image: oracle@dlsun1932 (TNS V1-V3)
8392EACE:0000000E
83BEBE73:0000000F
83BEBE97:00000010
83BED062:00000011
83BED107:00000012
83BED2C2:00000013
83D32C1A:00000042
83D32C30:00000043
83D32C32:00000044
83D32C33:00000045
83D32C34:00000046
83D32C46:00000047
83D32C47:00000048

19-489

5
5
5
5
5
5
5
5
5
5
5
5
5

0
0
0
4
4
4
4
4
4
4
4
4
4

10280
10401
10401
10429
10427
10401
10429
10429
10429
10429
10429
10429
10429

1
28
27
7
10
14
2
2
2
2
2
2
2

0x00000005
KSXPUNMAP: client 1
KSXPMAP: client 1 base 0x80048000 size
MB SO Al: Allocated MBSO 82b5eac4
Init ctx: Initialize ksxp for 1 ports
KSXPTIDCRE: tid(1,1,0x83bed2b6)
AllocBuf: buf 824bf624, pool 800084b0,
AllocBuf: buf 824bfe44, pool 800084b0,
AllocBuf: buf 824c0664, pool 800084b0,
AllocBuf: buf 824c0e84, pool 800084b0,
AllocBuf: buf 824c16a4, pool 800084b0,
AllocBuf: buf 824c1ec4, pool 800084b0,
AllocBuf: buf 824c26e4, pool 800084b0,

0x37b8000

size
size
size
size
size
size
size

2080,
2080,
2080,
2080,
2080,
2080,
2080,

out(i)
out(i)
out(i)
out(i)
out(i)
out(i)
out(i)

1,
2,
3,
4,
5,
6,
7,

out(s)
out(s)
out(s)
out(s)
out(s)
out(s)
out(s)

0
0
0
0
0
0
0

Copyright 2003, Oracle. All rights reserved.

KST Sample Trace File


The very first line of the trace file contains the metadata about trace information:
Binary or text is indicated by 0 or 1, respectively
File
Magic number (4 bytes)
Version number of trace file (4 bytes)
File block size (4 bytes)
Data record size (4 bytes)
Wrapping (4 bytes)
This is followed by the general information about the tracing process and the machine in
the standard trace file header.
The actual trace data is in the following format:
time stamp, sequence #, process id, level, event #, opcode, data

DSI408: Real Application Clusters Internals I-489

KST Demonstration

Trace control manipulation

19-490

Copyright 2003, Oracle. All rights reserved.

KST Demonstration
Demonstration on user interfaces for modifying tracing behavior of KST mechanism:
Initialization parameters
Alter tracing
Alter system set
X$TRACE and X$TRACE_EVENTS

DSI408: Real Application Clusters Internals I-490

DIAG Daemon

RAC instance #1
SGA

RAC instance #2

Trace
buffer

Trace
buffer

Tracing

Tracing
Process

SGA

DIAG

DIAG

Process

Communication

19-491

Copyright 2003, Oracle. All rights reserved.

DIAG Daemon
The diagram in the slide shows the architecture of the DIAG daemon in an RAC
environment.
Note that there is a difference between DIAGs in RAC and those in a single instance,
although both processes have the same name:
DIAG in a single instance is responsible for trace archiving and flushing only.
DIAG in a RAC instance provides other diagnosability services, in addition to trace
archiving and flushing.

DSI408: Real Application Clusters Internals I-491

DIAG Daemon: Features

DIAG Daemon:
Is an integrated service for all the diagnosability
needs of an instance
Provides a scalable framework for RAC
diagnosability
Works independently from an instance
Relies only on services provided by underlying OS
Is a lightweight daemon process, one per instance

19-492

Copyright 2003, Oracle. All rights reserved.

DIAG Daemon: Features


The design goal of the DIAG process is to be an integrated service for all the
diagnosability needs of an RAC instance. Although there are several debugging and
diagnostic tools in versions before Oracle9i, they do not provide a single interface for a
cluster environment and are not cluster-ready, making diagnosis across multiple instances
difficult before Oracle9i.
The DIAG process is designed to meet the following requirements:
The framework scales in cluster environments in which the number of nodes or
instances vary, accommodates variation, and works seamlessly without interrupting
any service provided.
The framework does not interfere with or affect the normal operation of the system.
In any condition, the framework should not adversely affect the performance of a
system regardless of the state of the system. Therefore, the DIAG daemon does not
use any service or resource from the RDBMS kernel. Optimally, the DIAG process
uses the services that are provided by the underlying OS.
DIAG is a lightweight daemon process that does not affect overall system performance,
although it is integrated with the RDBMS kernel for startup and shutdown, and its need to
access the SGA for trace buffers.

DSI408: Real Application Clusters Internals I-492

DIAG Daemon: Features

DIAG Daemon:
Is highly available and is tolerant of common
failures
Monitors the health of a local RAC instance
Coordinates the collection of diagnosability data
from all the nodes in a RAC server
Services clusterized ORADEBUG
Provides an extensible interface for future projects

19-493

Copyright 2003, Oracle. All rights reserved.

DIAG Daemon: Features (continued)


The DIAG process is resilient to failures, as its goal is to diagnose errors, problems, or
failures that have occurred in the system. DIAG does not share any resource with other
Oracle process and it has no dependency on the RDBMS kernel (except the VOS layer for
bare OS services, to minimize the possibility of a crash due to other processes). PMON
restarts a new DIAG process to continue its service if the DIAG process dies.
Another feature of the DIAG daemon is to monitor the health of the local RAC instance.
On failure of an essential process, DIAG can capture the system state and other useful
information for later diagnosis, and notify DIAG on the other instances to capture similar
information. This provides a snapshot view of the entire cluster environment. In addition,
the DIAG process serves as the base framework to execute clusterized oradebug
commands in RAC instances.
Improvements and new diagnostic projects will be adopted into the DIAG infrastructure in
future versions. An example of a planned extension is hang management for a highavailability (HA) configuration; DIAG will be responsible for monitoring the liveliness of
operations of the local RAC instance and performing any necessary recovery, if an
operational hang is detected.

DSI408: Real Application Clusters Internals I-493

DIAG Daemon: Design

DIAG process group:


Is analogous to Cluster Group Service (CGS)
group
Provides peer-to-peer communication among
DIAGs
Identifies a master DIAG for synchronization and
coordination
Reconfigures for membership change
Rolls back partial operations when reconfiguring

19-494

Copyright 2003, Oracle. All rights reserved.

DIAG Daemon: Design


The DIAG process group is analogous to the CGS group but is independently registered
with the same cluster node monitor for cluster services. The DIAG process group provides
an abstraction of group services to the registered DIAG processes on different nodes.
These services include communication, synchronization, and coordination among
members on different instances.
There is a single DIAG process group for each database cluster, and only the DIAG
process can register as a member in each instance. In the group, the node with the lowest
node ID (as defined by the node monitor) is elected to be the master of the group. Master
DIAG is responsible for all synchronization and coordination among members. For
example, all multicast messages are first sent to the master DIAG, which then forwards
them to all destination nodes to guarantee global message ordering.
Reconfiguration occurs in the DIAG process group when there is a membership change. If
any member joins or leaves the process group, then all existing members synchronize their
local membership information. This synchronization is also coordinated by the master
DIAG.
In the case of a DIAG group reconfiguration, all ongoing tasks are aborted and rolled back
to a previous consistent state. All tasks are then resubmitted as a new request.
DSI408: Real Application Clusters Internals I-494

DIAG Daemon: Design

Orthogonal to instance:
Does not use latches or locks
Does not use shared resources from the database
kernel
Does not affect the instance and is not affected by
the instance
Does not share the communication channel with
other processes

19-495

Copyright 2003, Oracle. All rights reserved.

DIAG Daemon: Design (continued)


Orthogonality is another key feature in the design of the DIAG daemon. All services
provided by DIAG do not interfere with or allow interference from any operations
performed by other Oracle processes. This creates a protected or isolated domain in the
DIAG process for diagnosability.
To be orthogonal to the instance, the DIAG process does not use any shared resource from
the RDBMS kernel, such as latch or lock. Also, the implementation does not have any
dependency on rdbms components, except the VOS layer, which provides an abstraction of
basic OS functionalitythe fundamental building block for the DIAG process.
To prevent any interference with the RDBMS kernel, the DIAG daemon creates its own
communication model to isolate itself from potential issues in the shared model of
communication used by other Oracle processes. The DIAG process of each instance owns
its own IPC port for messaging and has a different implementation of message protocol
instead of sharing the common IPC channel provided by CGS. This design provides an
alternative communication channel in case of a problem occurring in CGS because of
database operations.

DSI408: Real Application Clusters Internals I-495

DIAG Daemon: Design

Communication model:
Based on the IPC service from the OSD layer
Owns unique IPC port and message protocol
Supports multicast messaging
Supports memory-mapped copy for large data
transfer

19-496

Copyright 2003, Oracle. All rights reserved.

DIAG Daemon: Design (continued)


Characteristics of the communication model in the DIAG daemon are:
It is based on the preliminary IPC service from the OSD layer to eliminate any
potential problem or contention with the RDBMS kernel.
It has separate communication channels (IPC port and memory-mapped region
privately defined by the DIAG process, instead of those used by the cache fusion
layer) based on the OSD IPC service.
The DIAG process has its own message protocol (flow control and message
semantics) on multicasting and memory-mapped copying.

DSI408: Real Application Clusters Internals I-496

DIAG Daemon: Design

Master DIAG:
Coordinates message ordering
Coordinates DIAG group reconfiguration
Synchronizes all DIAG group communications

19-497

Copyright 2003, Oracle. All rights reserved.

DIAG Daemon: Design (continued)


The master DIAG is located at the node with the lowest node ID defined in the node
monitor of clusterware. Its responsibilities include task synchronization, guarantee of
message ordering for multicasting among DIAGs at different nodes, and performing group
reconfiguration in case membership changes in the DIAG process group.
If the master DIAG leaves or dies, the DIAG process with the next-lowest node ID
becomes the new master. Here it is assumed that the node monitor provides a consistent
view of membership in the DIAG process group among all nodes.
All group-related communications are synchronized through the master DIAG. For
example, a multicast message must be first sent to the master DIAG, which then forwards
the message to the destination DIAGs. When DIAGs receive the message and finish
processing the message, they send an acknowledgment to the message sender. When the
originating DIAG receives acknowledgments from all receivers, it then sends a complete
message to the master DIAG so that the next multicast message can be forwarded from the
master DIAG. Through this protocol, the message ordering can be guaranteed and
synchronization can be achieved. Also, memory-mapped copying can happen only after a
DIAG receives a multicast message and before it sends the acknowledgment back to the
message sender. This is required because no semantics of synchronization (overhead for
this infrequent operation) are enforced for memory-mapped copying among the DIAG
processes.
DSI408: Real Application Clusters Internals I-497

DIAG Daemon: Startup and Shutdown

Instance startup brings up DIAG.


Second process (after PMON) to start

Instance shutdown terminates DIAG.


Failure resilience
Restarted by PMON in case of failure

19-498

Copyright 2003, Oracle. All rights reserved.

DIAG Daemon: Startup and Shutdown


Although DIAG works independently from the instance, it is integrated with the RDBMS
kernel so that it can access the SGA for diagnosability purposes. DIAG is the second
process to be brought up during an instance startup. Being the second process to start up, it
can ensure that diagnosability service is available as soon as possible for any potential
startup problem.
DIAG terminates gracefully during normal shutdown of a RAC instance. Ordering is not
important during normal shutdown.
The DIAG process is resilient to failure. Upon discovery of its death, PMON starts a new
DIAG process, enhancing the availability of the diagnosability framework in a RAC
database. Note that DIAG is a nonfatal process for the instance so that its termination, for
any reason, does not affect any operation of the instance.

DSI408: Real Application Clusters Internals I-498

DIAG Daemon: Crash Dumping

Performs a crash dump (clusterwide) by DIAGs


upon detecting the death of an essential Oracle
process (FG or BG)
Survives RAC instance crashes:
Penultimate process to terminate
Five seconds (adjustable) allowed to dump traces

19-499

Flushes KST data to files on demand in RAC

Copyright 2003, Oracle. All rights reserved.

DIAG Daemon: Crash Dumping


Crash dump is one of the most important features of the DIAG daemon. DIAG dumps
KST traces to file and notifies the remote DIAGs after it discovers the death of an essential
Oracle process in the local instance. During the instance cleanup procedure, it is the
penultimate process to be terminated because it needs to perform trace flushing to the file
system. By default, the terminating process, usually PMON, gives five seconds to DIAG
for dumping.
The allowed time to dump traces on shutdown is controlled by the
_ksu_diag_kill_time parameter.
DIAG flushes KST trace data to files on demand with the ALTER TRACING FLUSH
statement. DIAG performs the flushing when per-instance (single) file mode is used.

DSI408: Real Application Clusters Internals I-499

DIAG Daemon: Crash Dumping

Coordinates the dumping of trace buffers on all


nodes:
Notifies peer DIAGs to dump traces
Instance freeze is not required; the interest is the
execution history captured in buffers within a time
interval that includes the crash moment.

19-500

cdmp_<timestamp> is the directory for dumping


trace during crash.

Copyright 2003, Oracle. All rights reserved.

DIAG Daemon: Crash Dumping (continued)


During an instance crash, DIAG sends out a dump message to peer DIAGs in the cluster
and then dumps traces to file.
When a DIAG process receives a dump message, it dumps the local trace data to the file
system so that a snapshot of the entire cluster can be obtained for diagnosis later.
Instance freezing is not required to obtain the snapshot of traces across all instances. The
reason is that all traces with execution history required for diagnosis are already stored in
the memory buffer and are dumped to the file after the DIAG process receives the crash
notification. Traces for the moment of crash are likely to be in the history.
A dump directory named cdmp_<timestamp> is created in the
background_dump_dest location, and all trace dump files are placed in this directory.

DSI408: Real Application Clusters Internals I-500

Summary

In this lesson, you should have learned about:


KST and X$TRACE

19-501

DIAG architecture

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-501

ORADEBUG
and Other Debugging Tools

Copyright 2003, Oracle. All rights reserved.

Objectives

After completing this lesson, you should be able to


use ORADEBUG for flash freeze, tracing, and hang
analysis.

20-503

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-503

ORADEBUG

ORADEBUG is RAC-aware.

Commands can execute in one or several


instances:
SETINST to list instances to affect
G or R to debug in parallel
SQL> ORADEBUG SETINST "ALL"
SQL> ORADEBUG -G "1 2" LKDEBUG -A LOCK

20-504

Copyright 2003, Oracle. All rights reserved.

ORADEBUG
You can use the options -G or -R to execute ORADEBUG across instances.
-G means the debugging data and result are written to the trace file of the
executing DIAG daemon at each participating instance.
-R means the same data is returned to the initiator DIAG daemon, which then
outputs them to its trace file.

DSI408: Real Application Clusters Internals I-504

ORADEBUG: List of Commands


SQL> ORADEBUG HELP
SETMYPID
SETOSPID
<ospid>
SETORAPID <orapid> ['force']
DUMP
<dump_name> <lvl> [addr]
DUMPSGA
[bytes]
DUMPLIST
EVENT
<text>
SESSION_EVENT <text>
DUMPVAR
<p|s|uga> <name> [lev]
SETVAR
<p|s|uga> <name> <value>
PEEK
<addr> <len> [level]
POKE
<addr> <len> <value>
WAKEUP
<orapid>
SUSPEND
RESUME
FLUSH
CLOSE_TRACE
TRACEFILE_NAME
LKDEBUG
NSDBX
-G
<Inst-List|def|all>
-R
<Inst-List|def|all>

Debug current process


Set OS pid of process to debug
Set Oracle pid of process to debug
Invoke named dump
Dump fixed SGA
Print a list of available dumps
Set trace event in process
Set trace event in session
Print/dump fixed PGA/SGA/UGA variable
Modify a fixed PGA/SGA/UGA variable
Print/Dump memory
Modify memory
Wake up Oracle process
Suspend execution
Resume execution
Flush pending writes to trace file
Close trace file
Get name of trace file
Invoke global enqueue service debug
Invoke CGS name-service debug
Parallel oradebug commands prefix
Parallel oradebug prefix (return
output)
SETINST
<instance# .. | all>
Set instance list in double quotes
SGATOFILE <SGA dump dir>
Dump SGA to file; dirname in "-quotes
DMPCOWSGA <SGA dump dir>
Dump&map SGA as COW; dir in "-quotes
MAPCOWSGA <SGA dump dir>
Map SGA as COW; dirname in "-quotes
HANGANALYZE [level]
Analyze system hang
FFBEGIN
Flash Freeze the Instance
FFDEREGISTER
FF deregister instance from cluster
FFTERMINST
Call exit and terminate instance
FFRESUMEINST
Resume the flash frozen instance
FFSTATUS
Flash freeze status of instance
SKDSTTPCS <ifname> <ofname>
Helps translate PCs to names
WATCH
<address> <len> <self|exist|all|target> Watch a region of memory
DELETE
<local|global|target> watchpoint <id>
Delete a watchpoint
SHOW
<local|global|target> watchpoints
Show watchpoints
CORE
Dump core without crashing process
IPC
Dump ipc information
UNLIMIT
Unlimit the size of the trace file
PROCSTAT
Dump process statistics
CALL
<func> [arg1] ... [argn] Invoke function with arguments

DSI408: Real Application Clusters Internals I-505

Flash Freeze

Use ORADEBUG commands to stop the activity in


instances in order to examine SGA content.
ffbegin: Freezes an instance
ffderegister: Deregisters an instance from the
cluster
ffterminst: Exits and terminates the instance
ffresumeinst: Resumes normal running on a
frozen instance
ffstatus: Checks the status of the instance
(frozen or not)

20-506

Copyright 2003, Oracle. All rights reserved.

Flash Freeze
Flash freeze permits the freezing of an entire instance. This permits the dumping of any
normal dumps via ORADEBUG without moving the system. Other instances may time
out or hang as a result of freezing one instance. Output for flash freeze commands
(including ffstatus) is written to the alert log. When ffbegin is issued, each
process notification is put in the alert log, as is the response from each process.
Likewise, messages appear in the alert log for ffresumseinst.
Use the SETINST command to specify which instances to freeze; default is the local
instance only.

DSI408: Real Application Clusters Internals I-506

LKDEBUG

Global Enqueue Service debugger (Lock debug):


Invoked with ORADEBUG LKDEBUG <items>
ORADEBUG LKDEBUG HELP for some commands

20-507

Copyright 2003, Oracle. All rights reserved.

LKDEBUG
Output is to the trace file (except for the help list).
SQL> oradebug lkdebug help
Usage:lkdebug [options]
-l [r|p] <enqueue pointer>
-r <resource pointer>
-b <gcs shadow pointer>
-p <process id>
-P <process pointer>
-O <i1> <i2> <types>
-a <res/lock/proc/pres>
-a <res> [<type>]
-a convlock
-a convres
-a name
-a hashcount
-t
-s
-k

Enqueue Object
Resource Object
GCS shadow Object
client pid
Process Object
Oracle Format resname
all <res/lock/proc/pres> pointers
all <res> pointers by an optional type
all converting enqueue (pointers)
all res ptr with converting enqueues
list all resource names
list all resource hash bucket counts
Traffic controller info
summary of all enqueue types
GES SGA summary info

DSI408: Real Application Clusters Internals I-507

NSDBX

CGS Name Service debugger:


Invoked with ORADEBUG NSDBX <items>
ORADEBUG NSDBX HELP for some commands

20-508

Copyright 2003, Oracle. All rights reserved.

NSDBX
Output is to the trace file (except for the help command).
SQL> oradebug nsdbx help
Usage:nsdbx [options]
-h
Help
-p <owner> <namespace> <key> <val> <nowait>
Publish a name-entry
-d <owner> <namespace> <key> <nowait>
Delete a name-entry
-q <namespace> <key>
Query a namespace
-an <namespace>
Print all entries in namespace
-ae
Print all entries
-as
Print all namespaces

DSI408: Real Application Clusters Internals I-508

HANGANALYZE

Attempts to search through the state objects and


dump a hang tree for a hung instance or cluster.
Invoked with ORADEBUG HANGANALYZE <level>

For a first pass, use level 3.


Use SETINST to perform the analysis across
multiple instances.

20-509

Copyright 2003, Oracle. All rights reserved.

HANGANALYZE
This is similar in intent to what is performed manually through system states.
The level is between 1 and 10. Level 3 is good for a first pass.
Lev. Description
1,2 Only HANGANALYZE output, no process dump at all
3
Level 2 + Dump only processes thought to be in a hang (IN_HANG state)
4
Level 3 + Dump leaf nodes (blockers) in wait chains
(LEAF,LEAF_NW,IGN_DMP state)
5
Level 4 + Dump all processes involved in wait chains (NLEAF state)
10
Dump all processes (IGN state)
Remember to use SETINST to make it a clusterwide hang analysis.

DSI408: Real Application Clusters Internals I-509

Summary

In this lesson, you should have learned about the


following ORADEBUG commands:
FLASHFREEZE
HANGANALYZE
LKDEBUG
NSDBX

20-510

Copyright 2003, Oracle. All rights reserved.

DSI408: Real Application Clusters Internals I-510

References

ORADEBUG usage notes in WEBIV


149691.1 FlashFreeze
175006.1 HANGANALYZE
70032.1
ORADEBUG on Windows
154670.1 Debug Events for 9iRAC GES and GCS

20-511

Copyright 2003, Oracle. All rights reserved.

References
See Note 178683.1 Tracing GSD, SRVCTL, GSDCTL, and SVRCONFIG for details
about tracing on the RAC utilities.

DSI408: Real Application Clusters Internals I-511

You might also like