DSI408 Real Application Clusters Internals

DSI408: Real Application Clusters
Internals
Electronic Presentation
D16333GC10
Production 1.0
April 2003
D37990
Authors
Copyright 2003, Oracle. All rights reserved.
Xuan Cong-Bui
John P. McHugh
Michael Mller
This documentation contains proprietary information of Oracle Corporation. It is provided under a

license agreement containing restrictions on use and disclosure and is also protected by copyright
law. Reverse engineering of the software is prohibited. If this documentation is delivered to a U.S.
Government Agency of the Department of Defense, then it is delivered with Restricted Rights and the
following legend is applicable:
Restricted Rights Legend
Technical Contributors and

Reviewers
Michael Cebulla
Lex de Haan
Bill Kehoe
Frank Kobylanski
Roderick Manalac
Sundar Matpadi
Sri Subramaniam
Harald van Breederode
Jim Womack
Publisher
Glenn Austin
Use, duplication or disclosure by the Government is subject to restrictions for commercial computer
software and shall be deemed to be Restricted Rights software under Federal law, as set forth in
subparagraph (c)(1)(ii) of DFARS 252.227-7013, Rights in Technical Data and Computer Software
(October 1988).
This material or any portion of it may not be copied in any form or by any means without the express
prior written permission of the Education Products group of Oracle Corporation. Any other copying is
a violation of copyright law and may result in civil and/or criminal penalties.
If this documentation is delivered to a U.S. Government Agency not within the Department of
Defense, then it is delivered with Restricted Rights, as defined in FAR 52.227-14, Rights in DataGeneral, including Alternate III (June 1987).
The information in this document is subject to change without notice. If you find any problems in the
documentation, please report them in writing to Worldwide Education Services, Oracle Corporation,
500Oracle Parkway, Box SB-6, Redwood Shores, CA 94065. Oracle Corporation does not warrant
that this document is error-free.
Oracle and all references to Oracle Products are trademarks or registered trademarks of Oracle
Corporation.
All other products or company names are used for identification purposes only, and may be
trademarks of their respective owners.
DSI408: Real Application

Clusters Internals
Volume 1 - Student Guide
D16333GC10
Edition 1.0
April 2003
37988
Authors
Xuan Cong-Bui
John P. McHugh
Michael Mller
This documentation contains proprietary information of Oracle Corporation. It is

provided under a license agreement containing restrictions on use and disclosure and
is also protected by copyright law. Reverse engineering of the software is prohibited.
If this documentation is delivered to a U.S. Government Agency of the Department of
Defense, then it is delivered with Restricted Rights and the following legend is
applicable:
Technical Contributors
and Reviewers
Michael Cebulla
Lex de Haan
Bill Kehoe
Frank Kobylanski
Roderick Manalac
Sundar Matpadi
Sri Subramaniam
Jim Womack
Publisher
Use, duplication or disclosure by the Government is subject to restrictions for

commercial computer software and shall be deemed to be Restricted Rights software
under Federal law, as set forth in subparagraph (c)(1)(ii) of DFARS 252.227-7013,
Rights in Technical Data and Computer Software (October 1988).
This material or any portion of it may not be copied in any form or by any means
without the express prior written permission of Oracle Corporation. Any other copying
is a violation of copyright law and may result in civil and/or criminal penalties.
If this documentation is delivered to a U.S. Government Agency not within the
Department of Defense, then it is delivered with Restricted Rights, as defined in
FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).
The information in this document is subject to change without notice. If you find any
problems in the documentation, please report them in writing to Education Products,
Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065.
Oracle Corporation does not warrant that this document is error-free.
Oracle and all references to Oracle Products are trademarks or registered trademarks
of Oracle Corporation.
Glenn Austin
All other products or company names are used for identification purposes only, and
may be trademarks of their respective owners.
DSI408: Real Application

Clusters Internals
Volume 2 - Student Guide
D16333GC10
Edition 1.0
April 2003
D37989
Authors
Xuan Cong-Bui
John P. McHugh
Michael Mller
This documentation contains proprietary information of Oracle Corporation. It is

provided under a license agreement containing restrictions on use and disclosure and
is also protected by copyright law. Reverse engineering of the software is prohibited.
If this documentation is delivered to a U.S. Government Agency of the Department of
Defense, then it is delivered with Restricted Rights and the following legend is
applicable:
Technical Contributors
and Reviewers
Michael Cebulla
Lex de Haan
Bill Kehoe
Frank Kobylanski
Roderick Manalac
Sundar Matpadi
Sri Subramaniam
Jim Womack
Publisher
Use, duplication or disclosure by the Government is subject to restrictions for

commercial computer software and shall be deemed to be Restricted Rights software
under Federal law, as set forth in subparagraph (c)(1)(ii) of DFARS 252.227-7013,
Rights in Technical Data and Computer Software (October 1988).
This material or any portion of it may not be copied in any form or by any means
without the express prior written permission of Oracle Corporation. Any other copying
is a violation of copyright law and may result in civil and/or criminal penalties.
If this documentation is delivered to a U.S. Government Agency not within the
Department of Defense, then it is delivered with Restricted Rights, as defined in
FAR 52.227-14, Rights in Data-General, including Alternate III (June 1987).
The information in this document is subject to change without notice. If you find any
problems in the documentation, please report them in writing to Education Products,
Oracle Corporation, 500 Oracle Parkway, Box SB-6, Redwood Shores, CA 94065.
Oracle Corporation does not warrant that this document is error-free.
Oracle and all references to Oracle Products are trademarks or registered trademarks
of Oracle Corporation.
Glenn Austin
All other products or company names are used for identification purposes only, and
may be trademarks of their respective owners.
Contents
Preface
I
Course Overview DSI 408: RAC Internals

Prerequisites I-2
Course Overview I-3
Practical Exercises I-5
Section I: Introduction
1
Introduction to RAC
Objectives 1-2
Why Use Parallel Processing? 1-3
Scaleup and Speedup 1-5
Scalability Considerations 1-7
RAC Costs: Synchronization 1-9
RAC Costs: Global Resource Directory 1-10
RAC Costs: Cache Coherency 1-12
RAC Terminology 1-14
Terminology Translations 1-16
Programmer Terminology 1-18
History 1-19
History Overview 1-20
Internalizing Components 1-21
Oracle7 1-22
Oracle8 1-23
Oracle8i 1-24
Oracle9i 1-25
Summary 1-26
Introduction to RAC Internals

Objectives 2-2
Simple RAC Diagram 2-3
One RAC Instance 2-4
Internal RAC Instance 2-5
Oracle Code Stack 2-6
RAC Component List 2-7
Module Relation View 2-8
Alternate Module Relation View 2-9
Module, Code Stack, Process 2-10
Operating System Dependencies (OSD) 2-11
Platform-Specific RAC 2-12
OSD Module: Example 2-13
Summary 2-15
References 2-16
iii
Section II: Architecture

3
Cluster Layer: Cluster Monitor

Objectives 3-2
RAC and Cluster Software 3-3
Generic CM Functionality: Distributed Architecture 3-4
Generic CM Functionality: Cluster State 3-5
Generic CM Functionality: Node Failure Detection 3-6
Cluster Layer and Cluster Manager 3-7
Oracle-Supplied CM 3-8
Summary 3-9
Cluster Group Services and Node Monitor

Objectives 4-2
RAC and CGS/GMS and NM 4-3
Node Monitor (NM) 4-4
RDBMS SKGXN Membership 4-5
NM Groups 4-6
NM Internals 4-7
Node Membership 4-8
Instance Membership Changes 4-10
NM Membership Death 4-12
Starting an Instance: Traditional 4-13
Starting an Instance: Internal 4-14
Stopping an Instance: Traditional 4-15
Stopping an Instance: Internal 4-16
NM Trace and Debug 4-17
Cluster Group Services (CGS) 4-18
Configuration Control 4-19
Valid Members 4-20
Membership Validation 4-23
Membership Invalidation 4-24
CGS Reconfiguration Types 4-26
CGS Reconfiguration Protocol 4-27
Reconfiguration Steps 4-28
IMR-Initiated Reconfiguration: Example 4-30
Code References 4-32
Summary 4-33
RAC Messaging System

Objectives 5-2
RAC and Messaging 5-3
Typical Three-Way Lock Messages 5-4
Asynchronous Traps 5-5
AST and BAST 5-6
Message Buffers 5-7
Message Buffer Queues 5-8
iv
Messaging Deadlocks 5-9

Message Traffic Controller (TRFC) 5-10
TRFC Tickets 5-11
TRFC Flow 5-13
Message Traffic Statistics 5-15
IPC 5-18
IPC Code Stack 5-19
Reference Implementation 5-20
KSXP Wait Interface to KSL 5-21
KSXP Tracing 5-22
KSXP Trace Records 5-23
SKGXP Interface 5-24
Choosing an SKGXP Implementation 5-25
SKGXP Tracing 5-26
Possible Hang Scenarios 5-27
Other Events for IPC Tracing 5-28
Summary 5-30
6
System Commit Number

Objectives 6-2
System Commit Number 6-3
Logical Clock and Causality Propagation 6-4
Basics of SCN 6-5
SCN Latching 6-7
Lamport Implementation 6-8
Lamport SCN 6-9
Limitations on SCN Propagation 6-10
max_commit_propagation_delay 6-11
Piggybacking SCN in Messages 6-12
Periodic Synchronization 6-13
SCN Generation in Earlier Versions of Oracle 6-14
Summary 6-16
Global Resource Directory: Formerly the Distributed Lock Manager

Objectives 7-2
RAC and Global Resource Directory (GRD) 7-3
DLM History 7-4
DLM Concepts: Terminology 7-5
DLM Concepts: Resources 7-6
DLM Concepts: Locks 7-7
DLM Concepts: Processes 7-8
DLM Concepts: Shadow Resources 7-9
DLM Concepts: Copy Locks 7-10
Resource or Lock Mastering 7-11
Basic Resource Structures 7-12
v
DLM Structures 7-13

Lock Mode Changes 7-16
Simple Lock Changes on a Resource 7-17
Changes on a Resource with Deadlock 7-18
DLM Functions 7-19
DLM Functionality in Global Enqueue Service Daemon (LMD0) 7-20
DLM Functionality in Global Enqueue Service Monitor (LMON) 7-22
DLM Functionality in Global Cache Service Process (LMS) 7-23
DLM Functionality in Other Processes 7-24
Configuring GES Resources 7-25
Configuring GES Locks 7-26
Configuring GCS Resources 7-27
Configuring GCS Locks 7-28
Configuring DLM processes 7-29
Logical to Physical Nodes Mapping 7-30
Buckets to Logical Nodes Mapping 7-31
Mapping for a New Node Joining the Cluster 7-32
Remapping When Node Joins 7-34
Mapping Broadcast by Master Node 7-35
Master Node Determination for GES 7-36
Master Node Determination for GCS 7-37
Dump and Trace of Remastering 7-38
DLM Functions 7-39
kjual Connection to DLM 7-40
kjual Flow 7-42
kjpsod Flow 7-43
DML Enqueue Handling Flow: Example 7-44
Step 1: P1 Locks Table in Share Mode 7-45
Step 3: P2 Does Rollback 7-47
Step 4: P1 Locks Table in Exclusive Mode 7-48
Step 6: P1 Does Rollback 7-50
Steps 1 and 2: Code Flow 7-51
Step 1: kjusuc Flow Detail 7-52
Step 2: kjusuc Flow Detail 7-54
Step 3: Code Flow 7-55
Step 3: kjuscl Flow Detail 7-56
Step 4: Code Flow 7-57
Step 4: kjuscv Flow Detail 7-58
Step 5: kjuscv Flow Detail 7-60
Step 6: kjuscl Flow Detail 7-61
Summary 7-64
References and Further Reading 7-65
vi
Cache Coherency (Part One): Enqueues/Non-PCM

Objectives 8-2
Cache Coherency: Enqueues 8-3
Enqueue Types 8-6
Enqueue Structure 8-7
Examining Enqueues 8-8
Enqueues and DLM 8-9
Source Tree for Non-PCM Lock Flow 8-10
Lock Modes 8-11
Lock Compatibility 8-12
Deadlock Detection: The Classic Deadlock 8-13
Deadlock Detection: A More General Example 8-15
Deadlock Detection and Resolution 8-16
Timeout-Based Deadlock Detection 8-17
Deadlock Graph Printout 8-18
Deadlock Flow 8-19
Deadlock Flow: One Node 8-21
Deadlock Flow: Two Nodes 8-22
Parallel DML (PDML) Deadlocks 8-23
Deadlock Detection Algorithm 8-24
Deadlock Validation Steps 8-27
Summary 8-29
Cache Coherency (Part Two): Blocks/PCM Locks

Objectives 9-2
Cache Coherency: Blocks 9-3
Block Cache Contention 9-4
Earlier Cache Coherency: Oracle8 Ping Protocol 9-5
Earlier Cache Coherency: Oracle8i CR Server 9-6
Earlier Cache Coherency: Oracle8i CR Server 9-7
Oracle9i Cache Fusion Protocol 9-8
GCS (PCM) Locks 9-9
PCM Lock Attributes 9-10
Lock Modes 9-11
Lock Roles 9-12
Past Image 9-13
Local Lock Role 9-14
Global Lock Role 9-15
Block Classes 9-16
Lock Elements (LE) 9-17
Allocation of New LE 9-18
Hash Chain of LE 9-19
Block to LE Mapping 9-20
Queues of LE for LMS 9-21
LMSn Free of LE 9-22
Cache Fusion Examples: Overview 9-23
vii
Cache Fusion: Example 1 9-25

Views 9-36
Parameters 9-39
Summary 9-40
10 Cache Fusion 1: CR Server
Objectives 10-2
Cache Fusion: Consistent Read Blocks 10-3
Consistent Read Review 10-4
Getting a CR Buffer 10-5
Getting a CR Buffer in Oracle9i Release 2 10-7
CR Server in Oracle9i Release 2 10-8
CR Requests 10-9
Light Work Rule 10-11
Fairness 10-12
Statistics 10-13
Wait Events 10-14
Fixed Table X$KCLCRST Statistics 10-15
CR Requestor-Side Algorithm 10-16
CR Requestor-Side AST Delivery 10-21
CR Requestor-Side CR Buffer Delivery 10-22
CR Server-Side Algorithm 10-23
Summary 10-27
11 Cache Fusion 2: Current Block: XCUR
Objectives 11-2
Cache Fusion: Current Blocks 11-3
PCM Locks and Resources 11-4
Fusion: Long Example 11-5
Initial State 11-7
Step 1: Instance 3 Performs SELECT 11-8
Lock Changes in Instance 3 11-9
Step 3: Instance 2 Performs UPDATE 11-13
viii

Step 4: Instance 1 Performs UPDATE 11-16
Step 6: Instance 1 Performs WRITE 11-21
Tables and Views 11-24
Summary 11-26
12 Cache Fusion Recovery
Objectives 12-2
NonCache Fusion OPS and Database Recovery 12-3
Cache Fusion RAC and Database Recovery 12-4
Overview of Fusion Lock States 12-5
Instance or Crash Recovery 12-6
SMON Process 12-7
First-Pass Log Read 12-8
Block Written Record (BWR) 12-9
BWR Dump 12-10
Recovery Set 12-11
Recovery Claim Locks 12-12
IDLM Response to RecoveryClaimLock Message on PCM Resource
No Lock Held by Recovering Instance on the PCM Resource 12-14
Recovery Claim Locks 12-15
Second-Pass Log Read 12-17
Large Recovery Set and Partial IR Lock Mode 12-19
Lock Database Availability During Recovery 12-22
Handling BASTs on Recovery Buffers 12-23
IR of Nonfusion Blocks 12-24
Failures During Instance Recovery 12-26
Memory Contingencies 12-28
Summary 12-31
Section III: Platforms
13 Linux Platform
Objectives 13-2
Linux RAC Architecture 13-3
Storage: Raw Devices 13-4
Extended Storage 13-5
Linux Cluster Software 13-6
OCMS 13-7
OCMS Components 13-8
ix
12-13
WDD, NM, and CM Flow (Up to version 9.2.0.1) 13-9

Watchdog Daemon 13-10
Hangcheck, NM, and CM Flow (After version 9.2.0.2) 13-11
Hangcheck Module 13-12
Cluster Manager 13-14
Linux Port-Specific Code 13-15
Cluster Manager 13-16
skgxpt and skgxpu 13-17
Installing RAC on Linux 13-18
Running RAC on Linux 13-21
Starting CM 13-22
Starting WDD 13-23
Starting NM 13-24
Starting CM 13-25
Debugging 13-26
Summary 13-27
References 13-28
14 HP-UX Platform
Objectives 14-2
HP-UX RAC Architecture 14-3
HP-UX Cluster Software 14-4
HP-UX Port-Specific Code 14-5
SKGXP (UDP Implementation) 14-6
SKGXP: Lowfat 14-7
Installing RAC on HP-UX 14-8
Running RAC on HP-UX 14-9
Debugging on HP-UX 14-10
Summary 14-11
15 Tru64 Platform
Objectives 15-2
Tru64 RAC Architecture 15-3
Shared Disk Systems 15-4
Tru64 Cluster Software 15-5
Tru64 Port-Specific Code 15-6
Node Monitor: SKGXN 15-7
IPC: SKGXP 15-8
SKGXPM: RDG 15-9
Installing RAC on Tru64 15-11
Debugging on Tru64 15-12
x
Useful Tru64 Commands 15-13

Summary 15-15
16 AIX Platform
Objectives 16-2
AIX RAC Architecture 16-3
AIX SP Clusters 16-4
AIX HACMP Clusters 16-5
AIX Cluster Software 16-6
AIX Cluster Layer 16-7
AIX Port-Specific Code 16-8
RAC on AIX Stack 16-9
Installing RAC on AIX 16-12
Debugging on AIX 16-14
Summary 16-15
References 16-16
17 Other Platforms
Objectives 17-2
RAC Architecture: Solaris 17-3
RAC Architecture: Windows 17-4
RAC Architecture: OpenVMS 17-5
Port-Specific Code 17-6
Installing RAC 17-7
Summary 17-8
Section IV: Debug
18 V$ and X$ Views and Events
Objectives 18-2
V$ and GV$ Views 18-3
List of Views 18-4
Old and New Views 18-5
V$ Views for Lock Information 18-6
X$ Tables 18-7
Events 18-8
19 KST and X$TRACE
Objectives 19-2
KST: X$TRACE 19-3
KST Concepts 19-4
KST Concepts 19-6
Circular Buffer 19-7
xi
Data Structure kstrc 19-8

Trace Control Interfaces 19-9
KST Initialization Parameters 19-10
KST Trace Control Interfaces 19-12
KST Fixed Table Views 19-14
KST Trace Output 19-15
KST Current Instrumentation 19-18
KST Performance 19-19
KST: Examples 19-20
KST Sample Trace File 19-24
KST Demonstration 19-25
DIAG Daemon 19-26
DIAG Daemon: Features 19-27
DIAG Daemon: Design 19-29
DIAG Daemon: Startup and Shutdown 19-33
DIAG Daemon: Crash Dumping 19-34
Summary 19-36
20 ORADEBUG and Other Debugging Tools
Objectives 20-2
ORADEBUG 20-3
Flash Freeze 20-5
LKDEBUG 20-6
NSDBX 20-7
HANGANALYZE 20-8
Summary 20-9
References 20-10
Appendix A: Practices
Appendix B: Solutions
xii
Course Overview
DSI 408: RAC Internals
Prerequisites
Before taking this course, you should have:

Taken DSI 401, 402, and 403 so that you know
about the server internals on crashes, dumps,
transactions, block handling, and recovery
systems
Taken the Real Application Clusters (RAC)
administration course so that you know about the
external view of RAC
Performed at least one RAC installation and
assisted in at least one RAC debugging case
I-2I-2
Prerequisites
The prerequisites ensure that the course is useful to you, instead of being too hard, and that
the instructor need not cover basic material.
You must have your TAO account ready for examining source code.
DSI408: Real Application Clusters Internals I-2
Course Overview
The course includes the following four sections:
Introduction
Architecture
Platforms
Debug
Subjects that are not covered include:

Utilities (srvctl, OCFS, HA)
Performance tuning
Pre-Oracle9i versions (OPS)
I-3I-3
Course Overview
This course contains four sections. It is scheduled to take four days but does not require
one day per section. Most of the time is spent on the Architecture section.
Introduction
The Introduction section provides a summary of the public RAC architecture and its
accurate terminology. An overview of architecture changes between versions is also given.
Architecture
The Architecture section covers the theory of operation of RAC. The RAC code stack is
examined from the bottom up. There are many references to the source code.
Platforms
The Platforms section covers the differences and architectural details of RAC
implementation on different platforms. Installation issues and known gotchas are
included.
Course Overview (continued)

Debug
The Debug section provides a detailed explanation of the trace and dump mechanisms that
are placed inside RAC for fault location. A number of practical exercises use these
mechanisms.
Subjects not Covered
This course does not cover utility modules that are not part of the primary core RAC
functionality. It also does not cover some of the external programs that RAC depends on.
Performance is not covered as a separate topic. The knowledge from this course should be
sufficient to identify performance bottlenecks that are purely relevant to RAC; otherwise,
tuning is the same as for a single instance.
For versions of Oracle Parallel Server, you should review earlier courses. In earlier courses,
the differences between RAC and OPS are pointed out, whereas the RAC knowledge in
this course is not applicable to OPS.
Practical Exercises
I-5I-5
The course includes practical exercises.

Exercises run on a shared Solaris cluster.
Practical Exercises
The cluster hardware is shared between students and other classesthis prevents practices
that involve node shutdown, or breaking the interconnect.
SQL
SQL Layer
Layer
SQL
SQL Layer
Layer
Buffer
Buffer Cache
Cache
Buffer
Buffer Cache
Cache
Section
I
CGS
II
II
CGS
CGS
CGS
P
P
P
P
GES/GCS
GES/GCS Introduction
GES/GCS
GES/GCS
C
C
C
C
Node
Node Monitor
Monitor
Node
Node Monitor
Monitor
Cluster
Cluster Manager
Manager
Introduction to RAC
Objectives
After completing this lesson, you should be able to do

the following:
Review the design objectives of Real Application
Clusters (RAC)
Relate Oracle9i RAC to its predecessors
1-10
Why Use Parallel Processing?
1-11
Scaleup: Increased throughput

Speedup: Increased performance or faster
response
Higher availability
Support for a greater number of users
Why Use Parallel Processing?

Scaleup: Increased Throughput
Parallel processing breaks a large task into smaller subtasks that can be performed
concurrently. With tasks that grow larger over time, a parallel system that also grows (or
scales up) can maintain a constant time for completing the same task.
Speedup: Increased Performance
For a given task, a parallel system that can scale up improves the response time for
completing the same task.
For decision support system (DSS) applications and parallel queries, parallel
processing decreases the response time.
For online transaction processing (OLTP) applications, speedup cannot be expected
due to the overhead of synchronization. Depending on the precise circumstances, a
decrease in performance can occur.
Why Use Parallel Processing? (continued)

Higher Availability
Because each node running in the parallel system is isolated from other nodes, a single node
failure or crash should not cause other nodes to fail. Other instances in the parallel server
environment remain up and running.
The operating systems failover capabilities and fault tolerance of the distributed cluster
software are an important infrastructure component.
Support for a Greater Number of Users
Each node can support several users because each node has its own set of resources, such as
memory, CPU, and so on. As nodes are added to the system, more users can also be added,
allowing the system to continue to scale up.
Scaleup and Speedup

Original system
Hardware
Time
Cluster system scaleup

Hardware
1-13
Time
Hardware
Time
Hardware
Time
Up to
200%
of
task
100% of task
Cluster system speedup
Up to
300%
of
task
Hardware
Time
Hardware
Time
50% of task
50% of task
Scaleup and Speedup

Scaleup
Scaleup is the capability of providing continued increases in throughput in the presence of
limited increases in processing capability while keeping the time constant:
Scaleup = (volumeparallel) / (volumeoriginal) time for interprocess communication
For example, if 30 users consume close to 100% of the CPU during their normal
processing, adding more users would cause the system to slow down due to contention for
limited CPU cycles. By adding CPUs, however, extra users can be supported without
degrading performance.
Speedup
Speedup is the capability of providing continued increases in speed in the presence of
limited increases in processing capability while keeping the task constant:
Speedup = (timeoriginal) / (timeparallel) time for interprocess communication
Speedup results in resource availability for other tasks. For example, if queries normally
take 10 minutes to process, and running in parallel reduces the time to 5 minutes, then
additional queries can run without introducing the contention that might occur if they were
to run concurrently.
Scaleup and Speedup (continued)

Speedup (continued)
Example 1: A particular application might take N seconds to fully scan and produce a
summary of a 1 GB table
With scaleup, if the table doubles in size, then doubling hardware resources should allow
the query to still complete in N seconds.
With speedup, if the table does not grow in size, doubling the hardware resources should
allow the query to complete in N/2 seconds.
Example 2: A particular application might have 100 users, each getting a three-second
response on queries.
With scaleup, if the number of users doubles in size, then doubling hardware resources
should allow response time to remain at three seconds.
With speedup, if the number of users remains the same, doubling the hardware resources
should reduce the response time. This occurs only if the three-second activity can be
broken down into two separate activities that can run independently of each other.
A Success Example of Scaleup
The following testimonial is from the internal RAC mailing list. This was a response to
a question about the ease of changing a single instance to an RAC system.
Just yesterday, we tested with a customer a migration from single instance to two-node
RAC on Solaris. They were using Veritas DBE/AC for the cluster system.
These are the steps we took:
1. Node 1 Server running 9i single instance at approx 80% CPU load.
2. Connection through Transparent Application Failover with 40 retries and a delay of
five seconds.
3. Alter shared initialization file to set Cluster Database = true and add extra
parameters for the second node (bdump location and so on).
4. Shut down Database on Node 1.
5. Start up Database on Node 2 using new initialization file.
6. Start up Database on Node 1 using new initialization file.
At this point we had 85% of users on Node 1 and 15% on Node 2.
7. Run a script to disconnect sessions on Node 1 to allow them to load balance across
to Node 2.
At this point we had 50% of users on Node 1 and 50% on Node 2. The database was no
longer highly loaded and we were able to add more (now load-balanced) users.
The application was written in Java and was TAF-aware (i.e., it knew to retry transactions
with certain warning messages). Once we added the second node, the TPMs per Node
remained approximately the same so we had over 1.9 x improvement in TPMs, which was
pretty good scaling.
Scalability Considerations
1-15
Hardware: Disk I/O

Internode communication: High bandwidth and
low latency
Operating system: Number of CPUs (for example,
SMP)
Cache Coherency and the Global Cache Service
Database: Design
Application: Design
Scalability Considerations
It is important to remember that if any of these six areas are not scalable (no matter how
scalable the other areas are), parallel cluster processing may not be successful.
Hardware scalability: High bandwidth and low latency offer the maximum scalability.
A high amount of remote I/O may prevent system scalability, because remote I/O is
much slower than local I/O.
Bandwidth of the communication interface is the total size of messages that can be
sent per second. Latency of the communication interface is the time required to place
a message on the interconnect. It indicates the number of messages that can be put on
the interconnect per unit of time.
Operating system: Nodes with multiple CPUs and methods of synchronization in the
OS can determine how well the system scales. Symmetric multiprocessing can
process multiple requests to resources concurrently.
Scalability Considerations (continued)

The processes that manage local resource coordination in a cluster database are
identical to the local resource coordination processes in single instance Oracle. This
means that row and block level access, space management, system change number
(SCN) creation, and data dictionary cache and library cache management are the
same in Real Application Clusters as in single instance Oracle. If the resource is
modified by more than one instance, then RAC performs further synchronization on a
global level to permit shared access to this block across the cluster. Synchronization
in this case requires intranode messaging as well as the preparation of consistent read
versions of the block and the transmission of copies of the block between memory
caches within the cluster database." (See Oracle9i Real Application Clusters
Concepts Release 2 (9.2), Part Number A96597-01, Chapter 5,Real Application
Clusters Resource Coordination.)
Database scalability: Database scalability depends on how well the database is
designed (for example, how the data files are arranged, how well the locks are
allocated, and how well the objects are partitioned).
Scalability of the application: Application design is one of the keys to taking
advantage of the other elements of scalability. Regardless of how well the hardware
and database scale, parallel processing does not work as desired if the application
does not scale.
A typical cause for the lack of scalability is one common shared resource that must be
accessed often. This causes the otherwise parallel operations to serialize on this bottleneck.
A high latency in the synchronization increases the cost of synchronization, counteracting
the benefits of parallelization. This is a general limitation and not a RAC-specific
limitation.
RAC Costs: Synchronization
To scale, there is a cost in synchronization:

Scalability = Synchronization
Less synchronization = Speedup and scaleup
1-17
Synchronization is necessary to maintain cache

coherency in RAC.
RAC Costs: Synchronization

Synchronization is a necessary part of parallel processing, but for parallel processing to be
advantageous, the cost of synchronization must be determined.
Synchronization provides the coordination of concurrent tasks and is essential for parallel
processing to maintain data integrity or correctness. Proper locking between disjoint SGAs
(Oracle instances) must be maintained to ensure correct data. This is cache coherency.
Partitioning can help reduce synchronization costs because there are fewer
concurrent tasks (that is, fewer concurrent users modifying the same set of data).
An application that modifies a small set of data can cause a high overhead for
synchronization if performed in disjoint SGAs.
Contention occurs between instances using a single block or row, such as a table with
one row that is used to generate sequence numbers.
Two ways to synchronize:
Locks: Latches, enqueues, locks
Messages: Send/wait for messages
Synchronization = Amount Cost
Amount: How often do you need to synchronize?
Cost: How expensive is it to synchronize?
Levels of Syncronization
Row-Level (Database)
Oracle Row-Locking feature
Maximize concurrency
SCN coerency
Local Cache Level (intra-instance)
Every buffer in cache is protected by logical semaphores (spin latches)
Access to buffers is synchronized
Global Cache Fusion (inter-instance DLM)
Every buffer in every cache is tracked by GCS
cache coherency / cache consistency
1-18
- CACHE BUFFERS CHAINS , CACHE BUFFER HANDLES , CACHE

BUFFER HANDLES
- Global Resource Directory managed by Global Cache Services (GCS) .
(Old DLM in pre9i)
- cache coherency
The synchronization of data in multiple caches so that reading a memory
location
by way of any cache will return the most recent data written to that location
by way
of any other cache. Sometimes called cache consistency.
Levels of Syncronization Row Level

Global Cache
(iDLM)
Instance
fg1
Update row1
fg2
Database
Update row2
Block 100
1-19
Block 101
Enqueues are local locks that serialize access to various resources. This
wait event indicates a wait for a lock that is held by another session (or
sessions) in an incompatible mode to the requested mode. See
<Note:29787.1> (about V$LOCK) for details of which lock modes are
compatible with which. Enqueues are usually represented in the format
"TYPE-ID1-ID2" where
"TYPE" is a 2 character text string
"ID1" is a 4 byte hexadecimal number
"ID2" is a 4 byte hexadecimal number
Levels of Syncronization Local Cache

Global Cache
(iDLM)
Instance
BCache
Update
row1
fg1
fg2
Update
row2
Database
Block 100
1-20
Block 101
Levels of Syncronization Global Cache

Global Cache
(iDLM)
Global Resource Directory
Instance
BCache
BCache
Update
row1
Update
row2
fg1
fg1
Database
Block 100
1-21
Block 101
global resources
Inter-instance synchronization mechanisms that provide cache coherency for
Real
Application Clusters. The term can refer to both Global Cache Service (GCS)
resources and Global Enqueue Service (GES) resources.
We need a cache
Serialize
Serialization is the easiest method to manage

concurrency, But, conversely cost in term of
system througput
Evolutions of
Oracle
minimize the
set of tasks
that are
serialized
Sequencing
operations
guarantee
consistency of
data
But : Minimize the

level of
concurrency of the
system
And : time to
complete
sequence of
operations
depends by the
slower element :
disks
fg
fg
fg
fg
fg
Serialize
Database
Block
1-22
Block
Block
Give a set of Tasks: [T1,T2,Tn] that arrive at the times [t1 <t2 <<tn],
Suppose that the system have a number of processing unit that allow the
potentially the maximum level of parallelism for such tasks .
You can approach the problem of run all the task minimizing the time (max
throughput)
At least in two modes .
1) executes the task sequentially , as they came, the last arrived wait
Until the previous ones are terminated . This not use the potential
parallelism
Of your machine .
Good : easy to implement
Bad : performances
2) implement a LOCK/WAIT infrastructure an allow all the task to run
freely until the
Are blocked by some other task(s) . The effective degree of parallelism
will be maximum
When the set of points of synchronization is minimal .
Coerency
Res: 1,0x100
S
S
BCache
BCache
select
row1
fg1
select
row2
scn:900
fg1
Scn: 1010
Start SC#
1010
Start SC#
900
scn: 800
Block 100
1-23
The systems reach a

maximum level of
concurrency
Ex: ALTER SYSTEM DUMP DATAFILE 5 BLOCK 4690;

ALTER SYSTEM DUMP DATAFILE {'filename'}|{filenumber}
|---BLOCK MIN {blockno} BLOCK MAX {blockno}|-->
|---BLOCK {blockno}-----------------------|
Note : blockdump report the BC block if block is CURRENT/Dirty in
current instance
alter session set events 'immediate trace name BUFFER level <RDBA>';
Coerency costs of locks
1-24
Fixed*/Releasable 1:M lock model (static)

Global Cache
(iDLM)
Instance
(*)starting 9i
removed fixedlocking mode
Database
Block 100
Block 101
1-25
Block 102
Block 103
Block 104
GC_FILES_TO_LOCKS = 1=100:2=0:3=1000:4-5=0EACH
GC_FILES_TO_LOCKS ={ file_list= lock_count[! blocks][EACH][:...]}
PCM lock names

type is always BL (because PCM locks are buffer locks)
ID1 is the block class (described in Classes of Blocks)
ID2 For fixed locks, ID2 is the lock element (LE) index number obtained by hashing the block address
(see the GV$LOCK_ELEMENT/ GV$GC_ELEMENT fixed view) For releasable locks, ID2 is the database address of
the block.
Non PCM locks

CF Controlfile Transaction
IV Library Cache Invalidation
CI Cross-Instance Call Invocation L[A-P] Library Cache Lock

DF Datafile
N[A-Z] Library Cache Pin
DL Direct Loader Index Creation
Q[A-Z] Row Cache
DM Database Mount
PF Password File
DX Distributed Recovery
PR Process Startup
FS File Set
PS Parallel Slave Synchronization
KK Redo Log Kick
RT Redo Thread
IN Instance Number
SC System Commit Number
IR Instance Recovery
SM SMON
IS Instance State
SN Sequence Number
MM Mount Definition
SQ Sequence Number Enqueue
MR Media Recovery
SV Sequence Number Value
ST Space Management Transaction
TT Temporary Table
TA Transaction Recovery
TX Transaction
False Pinging
Global Cache
(iDLM)
LE: 23
Instance
updating
fg1
BCache
dba:10
dba:103
dba:105
Database
Block 100
1-26
Block 101
Block 102
Block 103
Block 104
Another instance need access to dba:100, the owning instance must ping all the dirty blocks
that are covered by LE
Releasable 1:1 lock model (dynamic)

Global Cache
(iDLM)
LE: 100
LE: 105
Instance
updating
fg1
BCache
dba:101
dba:103
dba:105
Database
Block 100
Block 101
1-27
Block 102
Block 103
Block 104
break on GC_ELEMENT_NAME
select inst_id,GC_ELEMENT_NAME,CLASS,MODE_HELD
from gv$gc_element where GC_ELEMENT_NAME>20970000
order by GC_ELEMENT_NAME;
INST_ID GC_ELEMENT_NAME
CLASS
MODE_HELD
---------- --------------- ---------- ---------1
20971522
20971523
20971913
20971914
20976209
20976210
3
0
-V SPLIT ==> DBA

(Hex)
= File#,Block# (Hex
File#,Block#)DSI408: Real Application Clusters Internals I-27
=
=====
===
=====
============
Scalability
Scaleup
Scaleup is the capability to provide continued increases in
throughput in the presence of limited increases in processing
capability while keeping time constant:
Scaleup = (volume parallel) / (volume original)
Speedup
Speedup is the capability to provide continued increases in speed in
the presence of limited increases in processing capability, while
keeping the task constant:
Speedup = (time original) / (time parallel)
1-28
RAC Costs: Global Resource Directory
1-29
Single instance: Synchronization of concurrent

tasks and access to shared resources
Global Resource Directory (GRD) to record
information about how resources are used within
a cluster database. The Global Cache Service
(GCS) and Global Enqueue Service (GES) manage
the information in this directory. Each instance
maintains part of the global resource directory in
the System Global Area (SGA).
RAC Costs: Global Resource Directory

In single-instance environments, locking coordinates access to a common resource, such as
a row in a table. Locking prevents two processes from changing the same resource (or row)
at the same time.
In RAC environments, internode synchronization is critical because it maintains proper
coordination between processes on different nodes, preventing them from changing the
same resource at the same time. Internode synchronization guarantees that each instance
sees the most recent version of a block in its buffer cache.
RAC Costs: Global Resource Directory (continued)

Resource coordination within Real Application Clusters occurs at both an instance level
and at a cluster database level. Instance level resource coordination within Real
Application Clusters is referred to as local resource coordination. Cluster level
coordination is referred to as global resource coordination.
The processes that manage local resource coordination in a cluster database are identical to
the local resource coordination processes in single instance Oracle. This means that row
and block level access, space management, system change number (SCN) creation, and
data dictionary cache and library cache management are the same in Real Application
Clusters as in single instance Oracle.
If the resource is modified by more than one instance, then RAC performs further
synchronization on a global level to permit shared access to this block across the cluster.
Synchronization in this case requires intranode messaging as well as the preparation of
consistent read versions of the block and the transmission of copies of the block between
memory caches within the cluster database." (See Oracle9i Real Application Clusters
Concepts Release 2 (9.2), Part Number A96597-01, Chapter 5, Real Application Clusters
Resource Coordination.)
Note: Global Cache Service (GCS) and Global Enqueue Service (GES) do not interfere
with row-level locking and vice versa. Row-level locking is a transaction feature.
RAC Costs: Cache Coherency
Cache coherency is the technique of keeping multiple

copies of an object consistent between different
Oracle instances.
1-31

Maintaining cache coherency is an important part of a cluster. Cache coherency is the
technique of keeping multiple copies of an object consistent between different Oracle
instances (or disjoint caches) on different nodes.
Global cache management ensures that access to a master copy of a data block in an SGA
is coordinated with the copy of the block in other SGAs.
Therefore, the most recent copy of a block in all SGAs contains all changes that are made
to that block by any instance in the system, regardless of whether those changes have been
committed on the transaction level. Full redo protection of the block changes is maintained.
Node 1
Node 2
Instance A
1-32
Instance B
Node 3
Instance C
SGA
SGA
SGA
GES/GCS
GES/GCS
GES/GCS
RAC Costs: Cache Coherency (continued)

The cost (or overhead) of cache coherency is the need before any access to a specific
shared resource to first check with the other instances whether this particular access is
permitted. The algorithms optimize the need to coordinate on each and every access, but
some overhead is incurred.
The GCS tracks the locations, modes, and roles of data blocks. The GCS therefore also
manages the access privileges of various instances in relation to resources. Oracle uses the
GCS for cache coherency when the current version of a data block is in one instance's
buffer cache and another instance requests that block for modification. If an instance reads
a block in exclusive mode, then in subsequent operations multiple transactions within the
instance can share access to a set of data blocks without using the GCS. This is true,
however, only if the block is not transferred out of the local cache. If the block is
transferred out of the local cache, then the GCS updates the Global Resource Directory
that the resource has a global role; whether the resources mode converts from exclusive to
another mode depends on how other instances use the resource.
RAC Terminology
1-33
Cache coherency
Resources and locks
Global and local
GCS and GES, or PCM and non-PCM
GRM or DLM
Node, instance, cluster, and process
RAC Terminology
Cache coherency means that the contents of the caches in different nodes are in a welldefined state with respect to each other. Cache coherency identifies the most up-to-date
copy of a resource, which is also called the master copy. In case of node failure, no vital
information is lost (such as committed transaction state), and atomicity is maintained. This
requires additional logging or copying of data but is not part of the locking system.
A resource is an identifiable entity; that is, it has a name or reference. The entity referred
to is usually a memory region, a disk file, or an abstract entity; the name of the resource is
the resource. A resource can be owned or locked in various states, such as exclusive or
shared.
By definition, any shared resource is lockable. If it is not shared, there is no access
conflict. If it is shared, access conflicts must be resolved, typically with a lock. The terms
lock and resource, although they refer to entirely separate objects, are therefore
(unfortunately) used interchangeably.
A global resource is one that is visible and used throughout the cluster. A local resource
is used by only one instance. It may still have locks to control access by the multiple
processes of the instance, but there is no access to it from outside the instance.
RAC Terminology (continued)

Data buffer cache blocks are the most obvious and most heavily used global resource.
There are other data item resources that are global in the cluster, such as transaction
enqueues and database data structures. The data buffer cache blocks are handled by the
Global Cache Service (GCS), and Parallel Cache Management (PCM). The nondata
block resources are handled by Global Enqueue Services (GES), also called NonParallel Cache Management (non-PCM).
The Global Resource Manager (GRM) keeps the lock information valid and correct
across the cluster.
From the module skgxn.h:
Node: An individual computer with one or more CPUs, some

memory, and access to disk storage (generally capable of
running an instance of OPS).
Cluster: A collection of loosely coupled nodes that

support a parallel Oracle database.
Cluster Membership: The set of active nodes in a

cluster. These are the nodes that are "alive" and have
access to shared resources (that is, shared disk). Nodes
that are not in the current cluster membership must not
have access to shared resources.
Instance: Distributed services typically are made up of

several identical components, one on each node of a
cluster. One of these components will be called an
"instance." For example, an OPS database will have an
Oracle instance running on each node.
Process: For the purposes of this interface, a process

is a unit of execution. On some operating systems, this
may be equivalent to an OS process. On others, it may be
equivalent to an OS thread. A process is considered
terminated when it can no longer execute, pending OS
requests are completed/canceled, and any process-local
resources are released.
Note that the older OPS terms are used in the code, but the terms are also valid for RAC.
Terminology Translations
Terminology depends on the speaker

Product managers to sales or marketing
Support, technical teams, development
Terminology depends on the version

Older terms tend to stay in code
Variable names and prefixes reflect the older name
Newer names reflect newer application or
functionality
1-35
Terminology Translations
RAC = OPS. OPS is the older term. See the History slide (#19) in this lesson.
Row Cache = Dictionary Cache. Row Cache is the older term. It is the SGA area to cache
database dictionary information. It is a global resource.
Distributed Lock Manager (DLM) = Global Resource Manager (GRM). DLM is the older
term; GRM has slightly more functionality. The terms are used for any locking system that
can handle several processes, typically (but not necessarily) on several nodes.
DLM = IDLM = UDLM. The DLM term is a very general term, but also refers to the
external operating systemsupplied DLM used by Oracle7. IDLM refers to the Integrated
DLM introduced in Oracle8. UDLM is the Universal DLM, that is, the reference
implementation of a DLM made on the Solaris platform. It is often called by its code
reference skgxn-v2.
Some of the RAC processes have retained their old names but are described with a
different purpose:
LMON: Global Enqueue Service Monitor, previously Lock Monitor
LMD: Global Enqueue Service Daemon, previously Lock Monitor Daemon
LMS: Global Cache Service Processes, previously Lock Manager Services
Terminology Translations (continued)

Terminology in This Course
This course reflects the mixed usage of similar terms and aligns more with the terminology
of code than with the externalized names.
Programmer Terminology
1-37
Client or user: calling code

Callback: routine to execute when the called
program has new information
Programmer Terminology
Inside the code, comments often refer to the programmers point of view.
Client and User are used interchangeably, and refer to the calling code.
Client code can register interest in a service by giving a pointer to a data structure that is to
be updated or a routine that is to be called, when the service has completed the required
action.
History
Real Application Clusters (RAC) is the current

product.
RAC has some similarity to Oracle Parallel Server
(OPS)
Has same end-user capability; a clustered database
Scales better because of better internal handling of
cache coherency
Has some internal, fundamental changes in the
global cache
1-38
History
Oracle Parallel Server (OPS) historically had a bad reputation; it was not scalable. Most
applications ran slower on an OPS system than on a single instance. There was a need to
carefully determine which instance performed DML on which tables or (more accurately)
on which blocks. With RAC this need has been eliminated, resulting in true scalability.
Although RAC borrows much code from OPS, the official policy is not to mention that
RAC is an evolved version of OPS. Oracle does not want the bad reputation of OPS to
adversely affect the reputation of RAC in the market. Internally (in the code), the OPS
heritage in RAC is evident.
History Overview
1-39
OPS 6 was not in production and was available

only on limited platforms.
OPS 7 was platform generic, relying on external
DLM.
OPS 8 had Integrated Distributed Lock Manager.
OPS 8i had Cache Fusion Stage 1.
RAC 9i has Cache Fusion Stage 2.
The database layout for different versions has not
changed.
History Overview
Some components have undergone changes in scope and name. The system that ensures
that access to a block is coherent is the Global Cache Manager in Oracle9i. In Oracle8i and
Oracle8, this was the Integrated Distributed Lock Manager. Earlier it was an external
operating systemsupplied service that the Oracle processes called. The Cluster Group
Service of Oracle9i and Oracle8i was the Group Membership Services module in Oracle8
and (before that) part of the external Distributed Lock Manager.
Although there have been many changes to the architecture in the instance, the database
structure has changed only marginally. Separate redo threads and undo spaces are still
used.
Internalizing Components
Oracle7
Simulated
callback,
enqueue
translation
RDBMS
DLM API
No local
state in
instance
Oracle8
DLM,
CM
&
Op.Sys.
RDBMS
IDLM
Callbacks,
enqueues
CM
&
Op.Sys.
Local state in
SGA memory
1-40
Internalizing Components
The development of RAC has internalized more operating system components for each
version. As an example, the diagram on the slide shows the internalization of the
Distributed Lock Manager (DLM) in the development of Oracle7 to Oracle8. Instead of
calling the external operating system whenever any lock status needed checking by the
DLM API module, the IDLM module in the Oracle server only needs to examine its SGA.
The RDBMS routines did not in principle need to reflect the change.
The earlier versions had the DLM external, which limited the functionality (lowest
common denominator effect) that the Oracle server could rely on, and the need to pass
data to external services. Data transfer used pipes or network communication to the
external processes; control for I/O completion used Asynchronous Trap (AST)
mechanisms, polling mechanisms, or blocked waits. Internal communication inside the
Oracle servereven between the various background processescan use the common
SGA memory area that includes latches and enqueues.
This is merely illustrative and is not an accurate summary of the changes made.
The Oracle8 to Oracle9i development similarly internalized the GMS interface (that is, the
Node Monitor (NM) functionality), relying on only the Cluster Manager (CM) interface
routines.
Oracle7
The differences between a non-OPS server and an

OPS-enabled Oracle server were few:
Database structure changes
Separate redo per instance
Separate undo per instance
1-41
Addition of LCK process in instance
Oracle7
OPS in Oracle7 consisted of the database structural changes for cluster operation (as in all
versions) and the addition of the LCK process that communicated with the external DLM.
The instances not only coordinated global cache coherency through the DLM but also used
the DLM as the communication channel for registering into the OPS cluster.
The method for sending the SCN or other messages was platform specific.
External DLM
The external DLM usage had the following characteristics:
It had to be running before any instance started.
Resources and locks had to be adequately configured.
Death of the DLM on a node implied death of all its clients on the node.
OPS/DLM diagnostics had to have port-specific lock dumps.
Internode parallel query code had to be port specific.
Oracle8
First stage in internalizing cluster communications:

Oracles own lock manager in Oracle server
New communication path for clusterwide
messages
New background processes LMD and LMON
Cluster state communication through external
Group Membership Service (GMS)
1-42
Oracle8
The internal DLM meant that resource allocation was inside the Oracle server. Diagnostic
lock dumps no longer needed to be port specific. The Oracle server, version 8 (and later),
started communicating with the cluster services of the operating system. The interface
consisted of the GMS that was an Oracle-specified API. The GMS functionality included:
Supplying each instance with the current set of registered members, clusterwide
Notifying other members when a member joins or leaves
Automatically deregistering dead processes/instances from their groups
Interfacing with the node monitor for cluster events
Oracle8i
Cache Fusion Stage 1

Read/write blocks sent via interconnect and not
through the disk
CR server process BSP
More cluster communication functions as part of

Oracle server code
GMS functionality split into Cluster Group Services
(CGS) and Node Monitor (NM) in the skgxnv2
Lock Manager structures in shared pool
1-43
Oracle8i
The Cache Fusion Stage 1 satisfied some types of block requests across the cluster
communication paths (rather than via disk) and made use of the messaging services.
The Oracle8 GMS has been split into OSD and Oracle kernel components. Node monitor
OSD skgxn is extended from monitoring a single client per node to arbitrarily named
process groups. The rest of the GMS functionality is moved into Oracle as CGS. A
distributed name service is added to CGS.
LMON executes most of the CGS functionality:
Joins the skgxn process group representing the instances of the specified group
Connects to other members and performs synchronization to ensure that all of them
have the same view of group membership
Oracle9i
Cache Fusion Stage 2

Write/write blocks handled concurrently
GCS and GES instead of IDLM
Enhanced instance availablity

Instance Member Reconfiguration (IMR)
New recovery features
1-44
Enhanced messaging for inter-instance

communication
Oracle9i
The remainder of this course is based on Oracle9i.
Summary
In this lesson, you should have learned how to:

Determine whether to use RAC in application
design
Describe RAC improvements over its predecessor
1-45
Introduction to RAC Internals
Objectives

the following:
Outline the RAC architecture with internal
references
Relate the RAC-related modules to the Oracle
code stack
2-47
Simple RAC Diagram

High-speed interconnect
Node
Node
Instance
(SGA,
processes)
Node
Instance
(SGA,
processes)
Instance
(SGA,
processes)
Cluster
disk/file
system
2-48
Simple RAC Diagram

The node contains more than just the instance. It includes the operating system, network
stacks for various protocols, disk software, and a number of Oracle noninstance processes:
Listener, Intelligent Agent, and the foreground/shadow server processes.
The instance has its usual complement of background processes (more so with the RAC
configuration). They connect to the disk system, the network, and the high-speed
interconnect.
The cluster disk or file system may be mirrored, RAID-based, SAN/Fiber-based, or JBOD
(just a bunch of disks). If it is a clusterwide file system, it can contain the Oracle home
code. The clusterwide disks can be host-managed (that is, the controller is part of the node)
but are serviced to the cluster and equivalent to clusterwide disks. Local disks are of little
interest to RAC but are used for noncommon files where the common disks are raw disks.
Note: There are some issues with node-specific files of the Intelligent Agent or password
file orapw when using a cluster file system. The solution varies with the platform and the
CFS that are used.
One RAC Instance
Node
Instance
SGA contains (but is not limited

to):
Library, row, and buffer caches
SGA
DBW0
PMON
DIAG
LMS
LMD
LCK
LMON
Other background processes are:

LGWR, SMON, and so on
PQ, Jobs, and so on
Dispatchers and servers
Foreground processes not shown
CM
2-49
One RAC Instance

This is the traditional view of an instance and its background processes. All processes are,
however, the same programoracle.exe or oraclejust instantiated with different
startup parameters (see source opirip and WebIV Note:33174.1). On Windows, this is
more apparent; there is clearly only one Oracle process showing in the Task Manager, but
with a number of threads.
All caches in the SGA are either global and must be coherent across all instances, or they
are local. The library, row (also called dictionary), and buffer caches are global. The large
and Java pool buffers are local. For RAC, the Global Resource Directory is global in itself
and also used to control the coherency.
The LMON process communicates with its partner process on the remote nodes. Other
processes may have message exchanges with peer processes on the other nodes (for
example, PQ). The LMS and LMD processes, for example, may directly receive requests
from remote processes.
The Cluster Monitor (CM) system communicates with the other CMs on other nodes and is
not part of the Oracle RAC instance. But it is a necessary component.
Internal RAC Instance

kqlm: Library cache (fusion)
kqr: Dictionary/row cache
kcl: Buffer cache
ksi: Instance locks
kjb: Global Cache Service
kju: Global Enqueue
Service
CGS: Cluster Group Services
NM: Node Monitor
Node
Instance
kql
kqr
kqlm
ksi
GCS kjb/GES kju
CGS kjxg
NM skgxn.v2
IPC: Interprocess
Communication
2-50
kcl
s
I k
Pg
Cx
p
CM
Internal RAC Instance

This is an internal view of some of the instance code stack and the RAC-relevant sections
and modules.
The NM layer is the communication layer to the CM. The IPC services facilitate other
process to process communication on different instances.
The CGS maintains the state of the RAC-cluster, knowing which instances are in the
cluster and which are not. Contrast this with the node availability.
The GRD is the data structure that stores Global Enqueue and Global Cache objects; it is
aware of every clusterwide resource. Resources are typically a buffer element, like a data
buffer, or a data file, but can also be abstract entities, such as an enqueue or NM resource.
The three buffer caches are used by the various user foreground processes by calling
handling routines (kqlm, lqr, kcl) for allocation, deallocation, and locking. The
handling routines maintain coherency by using kcl. The data buffer cache is the sole user
of the GCS.
Note: Other skg-interfaces, such as skgfr (disk I/O), are not shown.
Oracle Code Stack

Oracle Call Interface
User Program Interface
Oracle Program Interface
Kernel Compilation Layer
Kernel Execution Layer
Kernel Distributed Execution Layer
Network Program Interface
Kernel Security Layer
Kernel Query Layer
Recursive Program Interface
Kernel Access Layer
Kernel Data Layer
Kernel Transaction Layer
Kernel Cache Layer
Kernel Services Layer
Kernel Lock Management Layer
Kernel Generic Layer
Operating System Dependencies
2-51
OCI
UPI
OPI
KK
KX
K2
NPI
KZ
KQ
RPI
KA
KD
KT
KC
KS
KJ
KG
S
Oracle Code Stack

The first few characters of the routine and structure names indicate which layer in the code
stack they come from.
RAC Component List
This course examines the following RAC component

list:
Cluster Layer and Cluster Manager (CM)
Node Monitor (NM)
Cluster Group Services (CGS)
Global Cache Service and Global Enqueue Service
(GCS and GES)
Interprocess Communication (IPC)
Cache Fusion in the GCS
Cache Fusion Recovery
2-52
RAC Component List

This course examines the components listed in the slide. This is the stack, with the most
fundamental module listed first (with some exceptions).
Module Relation View
ORACLE
DLM (GRD)
GCS
GES
CGS/IMR
DRM/FR
IPC
KSXP
NM
SKGXN
SKGXP
2-53
Module Relation View

GCS: Global Cache Service, or PCM locks
GES: Global Enqueue Service, or non-PCM locks
DRM/FR: Dynamic Resource Mastering/Fast Reconfiguration. Only partially activated in
a standard Oracle9i Release 2 installation.
IMR: Instance Membership Recovery. LMON handles instance death and split brain (two
networks).
KSXP: Multiplexing service (multithreaded layer). Allows DLM to do a lazy send;
ksxp informs client after send is completed.
NM: Node Monitor. Instances joining and leaving the cluster
IPC: Interprocess Communication. There is usually a choice of underlying protocols to
use, depending on the platform and hardware. The default is UDP (light; consumes no
resources/connections) memory mapped I/O (enhanced to IPC interface used by cache
fusion) versus port-based communication.
CGS: Cluster Group Service. Handles the sync up the bitmap. Also a name service for
publishing and querying configuration data. CGS in Oracle9i is changed from earlier
versions to speed up the reconfiguration.
Alternate Module Relation View
Client
code
PQ
kcl
ksq
KSXP SKGXP
ksi
DLM
CGS
2-54
Module, Code Stack, Process
2-55
The same code is present in all foreground and

background processes.
Modules may be constrained to run in a specific
process.
Module, Code Stack, Process

Although the running Oracle server consists of several processes (both foreground and
background), remember that this is the same program that runs in all processes. Processes
are limited to performing a set of functions, and thus some code is active in only some
processes. Thus there is no LMON program module, but some routines in the KJB source
modules have a comment stating that the function runs only in the LMON process. This is
confusing to remember when one process calls another process when examining code.
Cross process calls require a message or posting, and execution may have to wait until the
called process starts executing; in other words, a context switch must occur.
On the Windows platform, there is only one process. The various Oracle server processes
are implemented as threads inside this program.
Operating System Dependencies

(OSD)
Code that must be separate for each platform is

typically collected in OSD modules.
Generic version: Runs on development system
Reference version: Classic version ported to all
platforms
Platform version: Optimized and specialized;
several versions may exist.
2-56
OSD code is bracketed with #ifdef #endif in

some modules.
Operating System Dependencies (OSD)

This applies to many other Oracle server products or functions but is much more visible
with RAC.
If the platform dependency is small, it may be bracketed by the #ifdef #endif
construction; otherwise, a common routine is called in an OSD module, which is
appropriately rewritten for each platform. Such modules are generic. For example, refer to
the skgxnr.c module.
For some OSD modules, there may be more than one version. For example, the IPC
implementation has a number of protocols to be used. One OSD module with the same
interface is written for each protocol. Only one module is linked to the Oracle server, thus
deciding the IPC protocol to be used.
Where several implementations are possible, a reference module is constructed. This is
runable on all platforms and is the lowest common denominator. It proves functionality
and is used to verify the correct functionality of the other specialized version of the
module. However, it may not be used.
Platform-Specific RAC
Higher layers
SQL, Transaction, Data
Cache KC*
Service KS*
GES and GCS KJ*

Generic Layer KG*
(common functions)
Platform Specific Code
OSD S*
These are kernel

routines, so the names
start with K.
Service routines start
with KS.
OSD routines start with
S or SS.
OSD code is written by
the porting groups.
Operating System
Routines
2-57
Platform-Specific RAC
Many RAC problems are platform specific. The Operating System Dependency (OSD)
layer therefore must be examined for the platform concerned. The subdirectory is called
sosd or osds.
This cannot be examined in TAO with cscope; you need the vobs access.
OSD code is partially available at
/export/home/ssupport/920/rdbms/src/server/osds.
OSD Module: Example
SKGXP
2
U
D
P
T
C
P
H
M
P
SKGXP
module,
3 alternative
versions
3
5
4
OS routines
2-58
skgxp.h
Generic interface
skgxp.c
Reference
implementation
sskgxpu.c
UDP implementation,
port-specific
sskgxph.c
HMP implementation,
port specific (HP-UX)
OSD Module: Example

A module that needs to call the operating system must be port specific. Calling an I/O
routine may vary in name, arguments, and other particulars between platforms, even
though they give the same functionality.
The skgxp module has an official upward API (1). Internally, there are some common
functions and one way of achieving the necessary communication function of the SKGXP.
The UDP option, for example, performs the required OS-related calls through the OS API
(3) that send, receive, check status, and so on, by using UDP packets. It also possibly has
some code to hide or simulate functions so that the common set (2) is maintained. The
functions are similar for the other protocol options.
The reference implementation is made to compile and work on all platforms, but the whole
module is additionally rewritten by most platform groups. As explained previously, a
platform group makes several versions by using different protocols. This is selected at link
time by using the appropriate library. The HMP module, shown in this example, is only
available on the HP platform
OSD Module: Example (continued)

Dependencies on the OSD Module
For the skgxp module, some OSD variants have additional interfaces callable from
higher modules. The kcl module, for example, can call for a special memory map pointer
for the HMP protocol. Higher levels in the stack have #ifdef #endif bracketed calls
to the extended sskgxph.
Summary
In this lesson, you should have learned about the:

RAC architecture outline with internal references
Relationship between the RAC-related modules
and the Oracle code stack
2-60
References
Main sources for general RAC information:

RAC Web site
http://rac.us.oracle.com:7778
RAC Pack repository on OFO

http://files.oraclecorp.com/content/AllPublic/
Workspaces/RAC%20Pack-Public/
WebIV
Check folder Server.HA.RAC
2-61
Cluster Layer
Cluster Monitor
Objectives
After completing this lesson, you should be able to:

Describe the generic Cluster Manager (CM)
functionality
Outline the interaction between CM and RAC
cluster layers
3-63
RAC and Cluster Software

Node
Instance
Caches
ksi/ksq/kcl
GRD
CGS
NM
I
P
C
Other
nodes
(not
shown)
CM
3-64
Cluster Layer in RAC

The cluster layer is not part of the RAC instance. The Cluster Manager (CM) is part of the
cluster layer.
It has its own communication path with the peer cluster software on other nodes. It can
determine the status of other nodes in the cluster but does not maintain any consistent view.
Most of the synchronization and consistency is handled in the Node Monitor (NM).
Generic CM Functionality:
Distributed Architecture
3-65
Local cluster manager daemons

All daemons make up the Cluster Manager
One daemon elected as master node
Generic CM Functionality: Distributed Architecture

Every node in the cluster must have a local CM daemon(s) running. The set of all CM
daemons makes up the Cluster Manager. The CM daemons on all nodes communicate with
one another. The CM daemons on all nodes may elect a master node, which is responsible
for managing cluster state transitions.
Upon communication failure remaining CM daemons form a new cluster using an
established protocol and re-elect a new master if necessary.
The CM and the RAC cluster are distinct entities acting as physically distinct services. The
CM is responsible for cluster consistency. The CM detects and manages cluster state
transitions. The CM co-ordinates RAC cluster recovery brought about by cluster state
transitions.
Cluster State
3-66
State change
Cluster Incarnation Number
Cluster Membership List
IDLM Membership List
Generic CM Functionality: Cluster State

A cluster is said to change state when one or more nodes join or leave the cluster. This
transition is complete when the cluster moves from a previous stable configuration to a
new one. Each stable configuration is identified by a number called the cluster incarnation
number. Every state change in the cluster monotonically increases the cluster incarnation
number
The set of all nodes in a cluster form a cluster membership list. The set of all nodes in the
cluster where the RAC IDLM is running form an IDLM membership list. Every node in a
cluster is identified by a node-ID provided by the CM, which remains unchanged during
the lifetime of a cluster. The IDLM uses this node-ID to identify and distinguish between
members in the IDLM membership list
Node Failure Detection
3-67
Node failure detection

Communication failure detection
Generic CM Functionality: Node Failure Detection

To insure integrity of the cluster, the CM must detect node failures. The RAC cluster may
suspect node failure (for example, a communication failure with a node) in which it may:
Freeze activity and expect a message from the CM to start reconfiguration
Inform the CM of an error condition and await reconfiguration notification after a
new stable cluster state is established
If the CM and RAC cluster are to detect the same communication failures, CM should
monitor cluster health on the same physical circuit used by the RAC cluster (for example,
on HP use of HMP). Performance considerations may require the CM and RAC cluster to
use separate virtual circuits.
If the CM and RAC cluster are using separate physical circuits, the CM should be aware of
the RAC clusters physical circuit and monitor for cluster health via the same circuit. The
CM may provide for physical circuit redundancy for failover and performance.
RAC Cluster reconfiguration is begun after a cluster has reached a new stable state.
CM must be able to handle nested state transitions and communicate these state
changes to the RAC cluster.
Nested cluster transitions interrupt any in-process RAC cluster reconfiguration.
Cluster Layer and Cluster Manager
Node
Instance
NM
RAC cluster registers the

instance in the CM.
Primarily the LMON
process
Secondarily other I/O
capable processes (DBWR,
PQ-slaves, )
Obtains Node-ID from
cluster
CM
3-68
Cluster Layer and Cluster Manager

The Cluster Manager is a vendor- or Oracle-provided facility to communicate between all
the nodes in the cluster about node state. The CM uses a different protocol or channel. It
uses heartbeat and sanity checks to validate node status. The RAC processes communicate
directly with each other, but the CM is not the communication channel.
CM is used to monitor the node health, detect the failure of a node, and manage the node
membership in the cluster.
The CM handles nodes, not instances.
Registration and I/O Stop
To cope with the various failure scenarios, such as process termination and broken
communication, several RAC processes use the SKGXN to register in the CM. This is
described in more detail in the lesson titled Cluster Group Services and Node Monitor.
Oracle-Supplied CM
3-69
For the Linux and Windows platforms, the CM

software component is part of the Oracle
distribution.
RAC high availability extension functionality
makes use of the CM.
Oracle-Supplied CM
The Oracle-supplied CM is covered in the Linux platform lesson later in this course.
In the Oracle-supplied CM the integration to the RAC cluster is somewhat closer, blurring
the distinction.
Summary
In this lesson, you should have learned how to

Describe the generic Cluster Manager (CM)
functionality
Outline the interaction between CM and RAC
cluster layers
3-70
Cluster Group Services

and Node Monitor
Objectives

the following:
Describe the functionality of the cluster
configuration components
Node Monitor
Cluster Group Services
4-73
Identify the function of cluster configuration

components in dumps and traces
RAC and CGS/GMS and NM

Node
Instance
NM: Node Monitor

CGS: Cluster Group
Services
Caches
ksi/ksq/kcl
GRD
CGS/GMS
NM
I
P
C
Other
nodes
(not
shown)
CM
4-74
RAC and CGS/GMS and NM

In Oracle8, the Group Membership Service (GMS) was used. This has changed, and the
functionality is now in the CGS and NM layer.
Node Monitor (NM)
Provides node membership information

Notifies clients of any change in membership
status
Members joining
Members leaving
4-75
Provides query facility for management tools

Source reference skgxn.c
Node Monitor (NM)

The Node Monitor provides the interface to other modules for determining cluster
resources status, that is, node membership. It obtains the status of a cluster resource from
the Cluster Manager for remote nodes and provides the status of cluster resources of the
local node to the Cluster Manager.
skgxn has a passive interface; group events are delivered through constant polling by
clients.
The core of skgxn is the Distributed Process Group facility.
RDBMS SKGXN Membership
4-76
The purpose of membership is to determine which

instances are in the RAC cluster.
Nonmembers must not access the common
database files.
Group Membership
A process can register with a group on behalf of an instance that includes multiple
processes. It is important that, when the member deregisters from the group, the other
instance processes do not access the shared cluster resources (such as shared disk) after the
remaining group members have been informed of the deregistration. Otherwise, the
deregistered instance may overwrite changes that are made by the surviving instances.
To protect against this situation, the processes of an instance can share the membership of
the process that is registered with the group. These processes register as slave members,
specifying the member ID of the member that registered as a normal (primary) member.
The deregistration of the primary member must not be propagated to the groups other
primary members until all the associated slave members have also deregistered.
NM Groups
A process group is a named clusterwide resource.

Processes throughout the cluster register with the
process group by sending their:
IPC port ID and other bootstrap information
Node name and other information for administrative
tools to use
A process can register either as a primary member

or a slave member.
Primary member: Registers on behalf of the
instance
Slave member: Registers using the primary
members member ID
4-77
NM Groups
On registration, a process provides:
Private member data that can be retrieved only by other members and that consists of
IPC port ID and other bootstrap information
Public member data that can be retrieved by any skgxn client and that consists of
node name and other information for administrative tools to use
Primary members should ensure that all slaves are terminated on deregistering from the
group. Failure to do so is a bug or malfunction. LMON is the primary member of an
instance.
Slave members are all I/O-capable clients.
NM Internals
NM interface (skgxn.v2) is the OSD interface for

Generic Node/Process Monitor.
skgxncin: Defines an OSD context and returns a
handle
skgxnreg: Registers with process group as
primary member (LMON)
skgxnsrg: Registers with process group as slave
member
skgxnpstat: Polls/waits for process group status
4-78
skgxn is a passive interface; group events are

delivered through constant polling by clients.
NM Internals
Source Notes
The basic concept is Process Groups. This implementation relies on the UNIX Distributed
Lock Manager (UDLM) architecture, the same as the first version of the DLM that was
external.
skgxnpstat receives group membership changes. The client must call it to get group
changes (passive). The interface itself does not call back; the caller must check the state bit
in the context.
The process must call skgxnpstat to receive any state event changes. An example is
skgxpwait in IPC to receive an event such as I/O completion.
These routines are normally part of a daemon loop (LMON).
Node Membership
Each node index is represented by a bit.

0=down, 1=up
When nodes join or depart, the bitmap is rebuilt
and communicated to all members.
The bitmap is stored globally in the cluster with
the resource name of type NM or MM
skgxnmap
1
4-79
Node Membership
The bitmap is stored globally in the cluster by using the UDLM as a global repository to
store global information, and uses the global notification mechanism of the UDLM. The
global repository stores the bitmap.
The UDLM reserves a storage space for each resource in a Resource Value Block (RVB).
That space is limited to 16 bytes. Multiple resources and RVBs can be used for large
clusters. These are stored in persistent resources. Persistent resources survive crashes and
are recoverable. They are stored in the UDLM space struct kjurvb. (see kjuser.h
for more information).
Node Membership (continued)

The resource names used by the Solaris reference SKGXN has the format of res[0] =
opcode, res[1-*] = "type<grpname>". These are not exposed through
V$LOCK but can be dumped through lkdebug. The types are MM, MN, and MP. For
instance, the reference SKGXN uses 2 resources to represent a group bitmap and has the
names:
MAP1 : (0x0 "MM<grpname>")
MAP2 : (0x1 "MM<grpname>")
Some of the important resource names are:
JOIN : (0x00000042 "MM<grpname>")
SYNC1 : (0x00000040 "MM<grpname>")
SYNC2 : (0x00000041 "MM<grpname>")
Member
: (memno "MN<grpname>")
Private Data : (memno+(n*256) "MN<grpname>")
Public Data : (memno+(n*256) "MP<grpname>")
Instance Membership Changes
1. Registration at startup or deregistration at

shutdown
2. Bitmap updated
3. Reread of bitmaps
4. Propagate change to CGS (not shown)
Instance 1
Instance 2
Instance n
LMD0
1
LMON/NM
2
LMON/NM
3
LMON/NM
3
Cluster Layer - CM
4-81
Node Membership Changes

skgxncin is called to initialize or join a cluster when mounting the database. This
initializes a context and calls skgxnreg to register as primary with a process group
(slaves call skgxnsrg). This translates at the lower layers as registration to a particular
group. NM reads the existing bitmap to identify the members, then locates the index to
where the joining node should be, and turns that bit on a zero-to-one transition. The bitmap
is then invalidated. The bitmap itself is valid; it is the status of it that has changed. This
state change forces a reread.
The only way for existing members to know whether a member has joined or left is when
the bitmap is invalidated (skgxn_mapinv), that is, marked dubious.
At startup or shutdown, several iterations of reading/set bitmap/invalidate are made,
setting state fields as invalid or dubious in the RVB (see skgxnbc bitmap operations
where skgxnbcINV = 4, Invalidate map).
A member joins and invalidates the status flag of the bitmap. The other members see this
event change in their skgxnpstat call. Periodically, NM calls skgxnpstat to read
the bitmap. If the bitmap is invalid, then it initiates reconfiguration, and group membership
has changed via skgxngeRCFG. Reconfiguration at this layer means rebuilding the
bitmap. The status calls are in the LMON loop.
NM Membership Changes (continued)

Note: Rebuilding the entire bitmap may involve nested joins or deletes. The NM should
be able to handle this.
In 8.1.7, the DLM can detect reconfiguration because it can now talk to the NM API
directly. In release 8.0, you went through the GMS. The DLM detects reconfiguration
by calling status from NM. The DLM then calls CGS to do incarnation/synchronization.
In 8.0, it was hierarchical NM->GMS->DLM->RDBMS.
This node membership bitmap model is the referenced implementation and has not
changed in Oracle9i.
When you register, you tell NM what group you register to. The NM is responsible for
tracking your membership. The way it keeps track is through UDLM resources (MN, MM).
These resources are global and persistent. Through these resources, the NM can pull up
the right bitmap (in the event that there are multiple databases on the cluster).
Note: On platforms that do not use UDLM (such as Tru64), they may use the same
reference implementation but call out to a different DLM.
Each member maintains a lock on a members resource. If a node exits, then this
resource becomes invalid.
If an instance is alive, then there is a holder of the MM resource. If the instance exits,
then there is no holder. That is how the NM knows when an instance joins or exits the
group. This may be a lengthy process as the NM goes and checks the MM resource to
see whether it is dubious or not.
After the entire bitmap is rebuilt, it sends an event upward NM->CGS->DLM with a
new bitmap. CGS synchronizes the bitmap as there are transient operations active. CGS
must synchronize the cluster to be sure that all members get the reconfiguration event
and that they all see the same bitmap.
This is how NM API is implemented on Solaris. MN/MM resources are only used for
Node Monitor. On Solaris, the UDLM is part of the cluster software. This should not be
confused with the IDLM, which is part of the Oracle kernel.
NM Membership Death
4-83
NM Membership Death
Given a bitmap composed of eight nodes, all of which are up, skgxnpstatus is called.
This call also calls skgxn_neighbor to determine the right-side neighbor and
skgxn_test_member_alive to determine its status rather than scanning the entire
bitmap. This avoids all nodes calling skgxnpstatus to read the entire bitmap. This is a
protected read. When invalidating the bitmap, it is a lock:write.
Note: Reconfiguration may not happen simultaneously in all nodes. This is why the CGS
layer above must do the synchronization.
Starting an Instance: Traditional
Instance A runs, and

instance B starts:
1. B registers
2. Notification
3. Reconfiguration
Instance A
2
LMD0
LMON
3
CM
Instance B
LMD0
LMON
4-84
Starting an Instance: Traditional

Assume that instance A is running and instance B starts and joins the RAC cluster:
1. Instance B registers with the CM.
2. The CM notifies all instances that the cluster has changed.
3. The instances adjust themselves. This involves reconfiguration of the Cluster Group
Services, which in turn reconfigures the Global Resource Manager.
Starting an Instance: Internal
LMON trace, Instance A
Instance A
*** 2002-08-23 17:40:04.496

kjxgmpoll reconfig bitmap: 0 1
*** 2002-08-23 17:40:04.497
kjxgmrcfg: Reconfiguration
started, reason 1
CGS/GMS
NM
Communication via CM
Instance B
CGS/GMS
NM
4-85
Alert log, Instance B

Fri Aug 23 17:40:04 2002
ALTER DATABASE
MOUNT
Fri Aug 23 17:40:04 2002
lmon registered with NM instance id 2 (internal mem no
1)
Starting an Instance: Internal

The node up/down state is communicated via the CM and is not shown.
The NM of instance B registration is with the CM, and thus communicated to the NM of
instance A. The NM in both nodes could communicate via their own IPC link, but
registration is done via the CM, because the instances do not know of each others
existence before they are running.
Stopping an Instance: Traditional
Both instances run,

and instance B stops:
1. Deregistration
2. Notification
3. Reconfiguration
Instance A
2
LMD0
LMON
3
CM
Instance B
LMD0
LMON
4-86
Stopping an Instance: Traditional

Assume that both instances are running and that instance B stops in an orderly manner:
1. Instance B deregisters.
2. The CM sends the notification to the other registered members.
3. The instances adjust themselves. This involves reconfiguration of the Cluster Group
Services, which in turn reconfigures the Global Resource Manager.
Stopping an Instance: Internal
LMON trace, Instance A

*** 2002-08-23 17:45:04.596
kjxgmpoll reconfig bitmap: 0
*** 2002-08-23 17:45:04.597
kjxgmrcfg: Reconfiguration
started, reason 1
Instance A
CGS/GMS
NM
Communication via CM
Instance B
CGS/GMS
NM
4-87
Stopping an Instance: Internal

The NM and CM layers cannot detect a sudden instance death. This situation is handled by
the IMR in CGS.
NM Trace and Debug
Event 29718 traces calls to the CGS.
4-88
NM Trace and Debug

More details are covered in the debug section.
Provides a reliable and synchronized view of

cluster instance membership
CGS checks membership validity regularly.
Reconfiguration is stable.
Split-brain scenarios are avoided.
4-89
Distributed Repository is used by GES/GCS.

The CGS functionality is executed by LMON.

The complexity of CGS lies in producing the reliable cluster view of member instances.
Apart from checking that members remain valid, it also requires a stable reconfiguration
algorithm and detection of split-brain situations, where the communication path between
nodes is lost.
The distributed repository infrastructure provides:
Process Group Service with skgxn
Synchronization Service
Name Service
Stores IPC port ids and other data needed for inter-instance communication
Configuration Control
Each configuration has an incarnation number.

Incremented with each change in group
membership
Must be the same in all instances when
configuration is complete
For reconfiguration, an additional substate

number identifies the step within the recovery
sequence.
Set to 0
Incremented after each synchronization
4-90
Configuration Control
In the CGS, the most important data value is the incarnation value and synchronization.
Valid Members
Valid members must:

Perceive themselves to be part of the same
database group (no split-brain group)
Be able to communicate among themselves
(LMON and GES/GCS channels)
Be able to update the control file periodically
Failure of these requirements activates the Instance
Membership Reconfiguration (IMR).
4-91
Valid Members
The CGS checks whether members in the database group are valid. It ensures that all
members are operating on the same configuration.
All members vote, detailing which incarnation they are voting on and a bitmap of
membership as they perceive it to be.
The member that tallies the votes waits for all members of the last incarnation to register
that they have received the reconfiguration.
Instance Membership Reconfiguration (IMR)
This is a component part of the CGS layer. Source is in kjxgr.h and kjxgr.c.
Instance Membership Reconfiguration (IMR) (continued)

From kjxgr.h - OPS Instance Membership Recovery Facility
DESCRIPTION
The IMR facility is intended to provide a means for
verifying the connectivity of instances of a database group
and to expedite the removal from the group of members that
are not connected, thus removing potential impediments to
database performance. Connectivity, in this context, can
take a variety of forms. The three forms of connectivity
that are monitored by the initial implementation of this
facility are:
skgxn (group membership)
skgxp (IPC)
disk (database files)
When an instance is considered to be not connected along any
of the monitored channels, the remaining instances perform a
reconfiguration of the group membership. During this
reconfiguration, the instances vote on what they perceive
the membership to be and a single arbiter assesses the votes
and attempts to arrive at an optimal subnetwork, which is
then published as the voting results. Both the voting and
the results publishing are done via the control file. All
members read the results and, based on the bitmap of the
membership in the results, either commit suicide or continue
with the reconfiguration.
This facility also expedites recovery by initiating recovery
upon detection of a loss of connectivity and offers an
opportunity to guarantee database integrity by providing
well-defined points in the recovery process to fence a
member that is perceived to be disconnected. At initial
implementation, however, the facility only guarantees
integrity to the extent that the skgxn group membership
facility guarantees integrity. In other words, the
membership as determined via the voting is synchronized with
the skgxn membership as a means of ensuring that I/O is not
generated by departing members. A more aggressive approach
would be to fence I/O of the departing members, thereby
allowing reconfiguration to complete with a guarantee of
database integrity even while the instance remains active.
Instance Membership Reconfiguration (IMR) (continued)

This facility also offers the feature of a limited guarantee
of database integrity by allowing a periodic check of
membership that can be employed to limit the potential for
I/O generation after a hang.
This facility may also be employed to propagate arbitrary
reconfiguration requests to the database group membership.
In the initial implementation, the only reconfiguration
events propagated through this facility are for
communications errors and for a detected member death.
Membership Validation
Instance A
Instance B
LMON
CGS (IMR)
Instance C
LMON
CKPT
CGS (IMR)
CKPT
LMON
CGS (IMR)
CKPT
CKPT writes to control file every three seconds.
Control file
4-94
Membership Validation
The CKPT process updates the control file every three seconds, an operation known as the
heartbeat. CKPT writes into a single block that is unique for each instance; thus no
coordination between instances is required. This block or record is called the checkpoint
progress record and is handled specially. The CREATE DATABASE MAXINSTANCE
parameter controls the number of these block records. The heartbeat also occurs in single
instance mode.
LMON sends messages to the other LMON processes. If the send fails or no message is
received within the timeout, then reconfiguration is triggered. The LMON message send
failure detection is controlled by _cgs_send_timeout. The default value is 300
seconds.
Control file update failure is controlled by _controlfile_enqueue_timeout. The
default value is 900 seconds.
Reducing these values could cause false failure detection under heavy load. Using values
that are too large could cause hang-like conditions, where a bad instance member remains
undetected.
Note: Although the description is of a process doing a particular job, the code is part of the
CGS layer.
Membership Invalidation
Members are evicted if:

A communications link is down
There is a split-brain (more than one subgroup) and
the member is not in the largest subgroup
The member is perceived to be inactive
4-95
An IMR initiated protocol results in an eviction

message: ORA-29740.
Vendor clusterware may perform node eviction.
Membership Invalidation
IMR-initiated eviction of a member is not performed if a group membership change occurs
before the eviction can be executed.
Deciding the Membership
All members attempt to obtain a lock on a control file record (the Result Record) for
updating. The instance that obtains the lock tallies the votes from all members.
The group membership must conform to the decided membership before allowing the
GCS/GES reconfiguration to proceed; a skgxn reconfiguration with the correct
membership must be observed.
Vendor Clusterware
Vendor clusterware may also perform node evictions in the event of a cluster split-brain.
IMR detects a possible split-brain and waits for the vendor clusterware to resolve the splitbrain. If the vendor clusterware does not resolve the split-brain within
_IMR_SPLITBRAIN_RES_WAIT (default value of 600 milliseconds), then the IMR
proceeds with evictions.
Membership Invalidation (continued)

Bug#2209228 (fix 9.0.1.4/9.2) IMR RESOLVES SPLIT-BRAIN CONTRARY TO
CLUSTERWARE RESOLUTION
Bug#2401370 (fix 10.0) SPLIT-BRAIN WAIT IS NOT LONG ENOUGH.
Set _imr_splitbrain_res_wait as milliseconds (e.g., for a 10-minute wait),
specify _imr_splitbrain_res_wait=600000
CGS Reconfiguration Types
Group membership change

Initiated by skgxn
Caused by instance starting up or shutting down
Communications error
Initiated by IMR
Caused by communications error to either LMON or
GES/GCS
Detected member death

Initiated by IMR
Caused by instance failing to issue heartbeat to the
control file
4-97
CGS Reconfiguration Types

Note: The skgxn code is the NM.
CGS Reconfiguration Protocol
Reconfiguration is initiated by skgxn or IMR.
Six reconfiguration steps ensure that members

have the same view of the group.
One instance coordinates activities with all
instances for CGS steps.
GES/GCS reconfiguration starts when CGS
reconfiguration is complete.
4-98
CGS Reconfiguration Protocol

The reconfiguration is initiated when skgxn (that is, the NM) indicates a change in the
database group or the Instance Membership Recovery (IMR) detects a problem.
Reconfiguration is initially managed by the CGS, then the DLM (GES/GCS)
reconfiguration starts.
The coordinating instance is called the master node. This is usually known at the start of
reconfiguration, but if it is not known, one is nominated (typically the node that triggered
the reconfiguration). The master node hangs until all members send their reconfiguration
or incarnation acknowledgment. skgxnpstat should pick up the reconfiguration event.
CGS can handle nested reconfiguration events.
When the CGS reconfiguration steps are complete, the GES/GCS or IDLM reconfiguration
is started.
Reconfiguration Steps
Step 1:
a. Complete pending broadcast with RCFG status.
b. Freeze name service activity.
c. Freeze the lock database
Step 2:
a. Determine valid membership, Instance Membership
Recovery.
b. Synchronize incarnation.
c. Increment incarnation number.
Step 3:
Verify instance name uniqueness.
4-99
LMON trace file excerpt
*** 2002-08-23 17:26:01.262
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 1 0.
*** 2002-08-23 17:26:01.266
Name Service frozen
*** 2002-08-23 17:26:01.367
Obtained RR update lock for sequence 1, RR seq 1
*** 2002-08-23 17:26:01.370
Voting results, upd 0, seq 2, bitmap: 0 1
kjxgmps: proposing substate 2
Performed the unique instance identification check
Step 4:
Delete nonlocal name service entries.
Step 5:
a. Republish local name entries.
b. Resubmit pending requests.
Step 6:
a. Publish LMD processes IPC port-ids in the name
service.
b. Unfreeze name service.
Step 7:
Return reconfiguration RCFG event to GES/GCS.
4-100
Reconfiguration Steps (continued)

LMON trace file excerpt (continued)
Name Service recovery started
Deleted all dead-instance name entries
Multicasted all local name entries for publish
Replayed all pending requests
Name Service normal
Name Service recovery done
*** 2002-08-23 17:26:01.378
*** 2002-08-23 17:26:01.474
*** 2002-08-23 17:26:01.474
Reconfiguration started
IMR-Initiated Reconfiguration: Example

Broken
communication
Instance A
Instance B
LMON
CGS (IMR)
Instance C
LMON
CKPT
CGS (IMR)
CKPT
LMON
CGS (IMR)
CKPT
Control file
4-101
IMR-Initiated Reconfiguration: Example

The scenario is a broken communication link. Instance C is no longer sending or receiving
LMON messages but otherwise is working normally.
Control File Vote Result Records (CFVRR) contains
seq# inst# bitmap
2
0
110
2
1
110
2
2
001
The CFVRR is stored in the same block as the heartbeat in the control file checkpoint
progress record (see kjxgr.c/h).
Alert log in Instance C
Errors in file
/export/oracle/app/admin/rac/bdump/rac2_lmon_10911.trc:
Instance C is evicted. Its bit does not show up in the other members list of valid
members, thus it must leave the cluster.
ORA-29740: evicted by member 0, group incarnation 3
LMON: terminating instance due to error 29740
LMON Trace Instance A

Commented and slightly edited for brevity.
Failure is detected by LMON.
*** 2002-08-19 15:26:40.360
kjxgrcomerr: Communications reconfig: instance 1 (2,2)
IMR initiates CGS reconfiguration.

kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2002-08-19 15:26:46.469
kjxgmrcfg: Reconfiguration started, reason 3
*** 2002-08-19 15:26:46.473
Name Service frozen
The instance which obtained the RR lock tallies the vote result from all nodes and
updates the CFVRR.
*** 2002-08-19 15:26:46.592
Obtained RR update lock for sequence 2, RR seq 2
*** 2002-08-19 15:27:29.198
kjxgfipccb: msg 0x80000001002babe8, mbo
0x80000001002babe0, type 22, ack 0, ref 0, stat 3
kjxgfipccb: Send timed out, stat 3 inst 1, type 22, tkt
(32144,0)
:
:
*** 2002-08-19 15:28:27.526
kjxgrrecp2: Waiting for split-brain resolution, upd 0,
seq 3
*** 2002-08-19 15:28:28.127
Voting results, upd 0, seq 3, bitmap: 0
The evicted instance is terminated.
CGS reconfiguration is proposed.
Evicting mem 1, stat 0x0007 err 0x0002

Performed the unique instance identification check
:
:
*** 2002-08-19 15:28:37.802
CGS/GES reconfiguration and instance recovery is started by the surviving instance.

Reconfiguration started
Synchronization timeout interval: 600 sec
List of nodes: 0,
Code References
4-103
kjxg* : the CGS layer

kgxg* : the NM/CGS (still called GMS) layer
Skgxn.v2
Summary

Node Monitor functionality
Reconfiguration sequence
4-104
RAC Messaging System
Objectives

the following:
Outline the messaging subsystem architecture
Describe the trace options for IPC layers
5-107
RAC and Messaging

Node
Instance
Messages used for:

Lock changes
Data blocks
Cluster
information
Caches
ksi/ksq/kcl
GRD
CGS
NM
I
P
C
Other
nodes
(not
shown)
CM
5-108
RAC and Messaging

Messaging is used by the lock system.
Messaging is used for both intra-instance communication (between the processes of the
same instance on the same node) and inter-instance communication (between processes
on other nodes).
Messaging is used by:
The LMON process to communicate with the other LMON processes
The LMD process to communicate with the other LMD processes
Any processs lock client performing direct send operations
Typical Three-Way Lock Messages

1, 2: Direct Send
3:
Memory Copy
4:
Deferred
Instance R
Instance H
1
4
Instance M
5-109
Typical Three-Way Lock Messages

The DLM or GRM/GES functionalities are explained in later lessons.
Assume that instance R needs a block from the H instance and that the resource lock is
managed by the M instance.
1. The requester instance, R, sends a message to the master node, M. This is a critical
message, so this uses Direct Send. The transport protocol returns an
acknowledgment (not shown).
2. The master node, M, sends a command to forward the resource to the holding
instance, H. This, too, is a Direct Send.
3. The holding instance, H, sends the resource to the requesting instance, R. This is a
Memory Copy (memcpy) message, and the resource is received into its
destination memory.
4. The requestor instance, R, sends an acknowledgment for the resource message to
the master node, M. This message is not critical for response, so it is sent Deferred;
that is, it is placed in a queue for sending when convenient.
Asynchronous Traps
To communicate the status of lock requests, GES

uses two types of asynchronous traps (ASTs) or
interrupts:
Acquisition AST (AST)
Blocking AST (BAST)
Lock status may reflect late or lost messages.

V$LOCK_ELEMENT (or X$LE) columns MODE_HELD,
RELEASING, or ACQUIRING are non-zero
5-110
Asynchronous Traps
When a process requests a lock on a resource, the GES sends an acquisition AST to
notify the processes that currently own locks on that resource in incompatible modes.
Upon notification, owners of the locks can relinquish them to permit access to the
requestor.
When a lock is obtained, an acquisition AST is sent to tell the requester that it now owns
the lock.
To determine whether a blocking AST has been sent by a requestor or whether an
acquisition AST has been sent by the blocker (or owner of the lock), query the fixed
view GV$LOCK_ELEMENT or X$LE and check which bits are set. Examples for
incompatible modes are shared and exclusive modes.
An acquisition AST acts like a wakeup call.
AST and BAST
Each IDLM client process has an AST queue.

The following operations take place when an LMD
delivers an AST:
LMD hangs an AST structure in the IDLM client AST
queue.
LMD posts the IDLM client.
IDLM client has to scan the AST queue to process
the delivered AST.
5-111
LMS delivers a BAST to a process that owns a

lock that conflicts with a converting request.
AST and BAST

ASTs are delivered by LMD or LMS to the process that has submitted a lock request.
In the earlier releases of Oracle9i, all messages went to the LMD on the remote node,
which had to repost the message to the actual waiting process.
Message Buffers
The two message buffer types are:

KJCCMSG_T_REGULAR
KJCCMSG_T_BATCH
5-112
Message Buffers
Any sender or receiver allocates a message structure (or message buffer) before sending
or receiving a message.
KJCCMSG_T_BATCH is mostly used in reconfiguration or in remastering, or after
delivering a buffer in cache fusion.
There are three pools of messages:
REGULAR: With initial #buffers = processes*2 + 2*10 + 10 + 20
BATCH: With initial #buffer = processes*2 + 2*10 + 10 + 20
RESERVE: With initial #buffer = min(2*processes, 1000)
If the REGULAR pool is exhausted, then more allocations are done from the shared
pool.
Message Buffer Queues

Allocate
MsgPool
OutstandingQueue
OutstandingQueue
FreeMsgQueue
FreeMsgQueue
Release
Send-done
PendingSendQueue
PendingSendQueue
Direct send
Send
SendQueue
Indirect send
5-113
Message Buffer Queues

Several queues are in place to hold message buffers if they come from the SGA message
pools (like the REGULAR, BATCH and RESERVE pools). This is done to faciliate the
recovery message buffers.
OutstandingQueue, FreeMsgQueue, and PendingSendQueue are per process.
SendQueue and MsgPool are per instance. There is a threshold to trigger a process to
start releasing free available message buffers back to the shared message pools.
Messaging Deadlocks
5-114
Messaging can cause deadlocks to appear.

To avoid such deadlock situations, Oracle
introduced a Traffic Controller.
Messaging Deadlocks
Messaging can cause deadlocks to appear. If you are waiting to send a message to
acquire a lock and there is another process waiting on the lock that you hold, then you
will not be checking on BASTs and so will not see that you are blocking someone. If
many writers are trying to send messages and no one is reading messages to free up
message buffer space, there can be a deadlock.
Like the interface, messaging protocol is port specific. The message is typically less
than 128 bytes, so the interconnect must be low latency. In addition, the number of
messages can be high. It typically depends on the number of locks or resources.
Basically the more locks or resources, the higher the traffic. In Oracle8, the number of
message buffers depended on the number of resources; in Oracle7, the number depended
on the number of locks.
Message Traffic Controller (TRFC)
5-115
Circumvents the possibility of send deadlocks

Uses the send buffer in kjga
Message Traffic Controller (TRFC)

The TRFC is used to control the DLM traffic between all the nodes in the cluster by
buffering sender sends (in case of network congestion) and making the sender wait on
a send until the network window is big enough. This is managed by using tickets to
control the message flow.
TRFC Tickets
5-116
A number of tickets are kept in a pool.

A sender must acquire the ticket(s) before
performing a send operation.
Tickets are returned to the pool by the receiver
after reception.
GV$DLM_TRAFFIC_CONTROLLER shows the status
of the ticketing and the send buffers.
TRFC Tickets
You use flow control to ensure that the remote receivers (LMD or LMS) have just the
right amount of messages to process. New requests from senders wait outside after
releasing the send latch, in case receivers run out of network buffer space. Tickets are
used to determine the network buffer space available.
Clients that want to send first get the required number of tickets from the ticket pool and
then send. The used tickets are released back to the pool by the receivers (LMS or LMD)
according to the remote receiver report of how many messages the remote receiver has
seen. Message sequence numbers of sending nodes and remote nodes are attached to
every message that is sent.
The maximum number of available tickets is a function of the network send buffer size.
If at any time tickets are not available, senders have to buffer the message, allowing
LMD or LMS to send the message on availability of the ticket. A node relies on
messages to come back from the remote node to release tickets for reuse. In most cases
this works, because most of the client requests eventually result in an ACK or ND.
TRFC Tickets (continued)

However, in some very specific and rare cases this may not be true. For instance, if
an application makes a large number of asynchronous blocking convert requests
without expecting notifications, you have a case where a request does not result in a
reply for some time. To force a reply from the remote node, you send a null request
to the remote node, forcing the remote node to send a null ACK back. Thus, if the
ticket level dips too low, you send a null request to the remote node.
TRFC Flow
Node 1
Node 2
Tickets are sent back
to requestor side by
attaching the number
of ACK tickets in the
message header.
Msg.
Msg.
Queued messages
waiting for tickets
LMD
LMS
No more
tickets
LMD
LMS
Msg.
sender
Tickets available
Msg.
Tickets depleted,
NULL_REQ message
5-118
TRFC Flow
At the beginning, the number of available tickets is 500. One sent message consumes
one ticket. Each node maintains several counters for each communication partner.
AvailBuf: Number of buffers that are available to receive new messages (buffers
attributed to KSXP interface)
RecMsg: Number messages received, where message type is different from TEST,
NULL-REQ, or NULL-ACK
AvailMsg: Number of messages received (all types)
The pseudocode is:
if AvailBuf >= AvailMsg (if there are sufficient buffers)
then AckTickets = AvailMsg
else if RecMsg == AvailMsg (no NULL-REQUEST yet)
then AckTickets = AvailBuf
else if AvailMsg - RecMsg > AvailBuf (too many NULL-REQUEST)
then AckTickets = 0
else AckTickets = AvailBuf (AvailMsg - RecMsg)
TRFC Flow (continued)

Node 2 sends ACK tickets to node 1 to replenish the number of available tickets and
decrement AvailMsg, RecMsg, and AvailBuf with ACK tickets.
For more details, refer to kjcts_sndmsg, kjctr_updatetkt,
kjctr_collecttkt, and kjctcnrs (null request sent).
Message Traffic Statistics
System statistics V$SYSSTAT

gcs messages sent: number of PCM messages
sent
ges messages sent: number of non-PCM
messages sent
5-120
V$DLM_MISC reports statistics on messages of

local instance.
Message Traffic Statistics

V$DLM_MISC is a direct view of x$kjifst.
Message Traffic Statistics (continued)

V$DLM_MISC
SQL> select name, value from V$DLM_MISC;
Name
Value
-------------------------------------- ---------messages sent directly
203662
Messages sent directly without going through queue (tickets available)

messages flow controlled
messages sent indirectly
104
148
Messages queued (and to be sent by LMD or LMS)

messages received logical
Messages received
flow control messages sent
178579
0
Null sent request + Null Acknowledge sent

flow control messages received
gcs msgs received
Number of PCM messages received
gcs msgs process time(ms)
1
1587
1867
PCM messages processed time (should include also CR build time)

ges msgs received
177013
Number of non-PCM messages received

ges msgs process time(ms)
Non-PCM messages processed time
msgs causing lmd to send msgs
30485
59070
LMD receives a message, processes it, and has to send another

message to end processing.
lmd
gcs
gcs
gcs
msg send time(ms)

side channel msgs actual
side channel msgs logical
pings refused
6104
16
154
0
When a ping is sent because of a conflict in PCM locks

(S and X for example), increment when ping cannot be processed
gcs writes refused
When a write request is processed and the processing is aborted

then increment this statistic
gcs error msgs
When an error message is received (rare)

gcs out-of-order msgs
gcs immediate (null) converts
0
16
Number of PCM converts done immediately (because compatible),

resource granted mode was NULL
gcs immediate cr (null) converts
gcs immediate (compatible) converts
1177
15
Number of PCM converts done immediately (because compatible),

resource granted mode was not NULL
Message Traffic Statistics (continued)

V$DLM_MISC
SQL> select name, value from V$DLM_MISC;
Name
...
gcs immediate cr (compatible) converts
gcs blocked converts
gcs queued converts
gcs blocked cr converts
gcs compatible basts
gcs compatible cr basts (local)
gcs cr basts to PIs
dynamically allocated gcs resources
dynamically allocated gcs shadows
gcs recovery claim msgs
gcs write request msgs
gcs flush pi msgs
gcs write notification msgs
gcs retry convert request
gcs forward cr to pinged instance
gcs cr serve without current lock
msgs sent queued
Value
25
10
0
16
2
1212
0
0
0
0
4
6
0
0
0
0
248
Number of messages dequeued from queued-messages list

msgs sent queue time (ms)
msgs sent queued on ksxp
9731
203910
When a message sent is completed by ksxp, increment this statistic

msgs sent queue time on ksxp (ms)
msgs received queue time (ms)
msgs received queued
implicit batch messages sent
implicit batch messages received
gcs refuse xid
gcs ast xid
gcs compatible cr basts (global)
messages received actual
process batch messages sent
process batch messages received
msgs causing lms(s) to send msgs
lms(s) msg send time(ms)
499892
311453
178600
6
46
0
0
7
177777
2
224
21
10
IPC
The IPC component:

Handles component-level demultiplexing
5-123
Parallel Query (IPQ)

Cache (data blocks)
DLM (GES)
Internal context
Handles Connection Management or Name Service

Integration
Integrates with the Post/Wait model used in the
Oracle server
Uses asynchronous request management,
including state management
IPC
Because IPC was more synchronous in the releases before Oracle9i, the OPS systems
were more prone to hanging in this component. IPQ used its own interface (SKGXF).
IPC Code Stack
IPQ client
Cache client
DLM client
CGS client
KSXP
KSXP: Main IPC

Wait interface
Tracing
Message passing
Memory mapping
SKGXP: OSDdependent module
SKGXP
5-124
IPC Code
The SKGXP module is the OSD module. The source that is available on tao includes
the reference implementation. This has extensive comments in skgxp.h.
Reference Implementation
For internal QA
Simple code for easy portability
Interface example
Uses standard protocols for communication
TCP/IP
UDP
5-125
Reference Implementation
There are several reference implementations because there are several standard
protocols that can be used. These are available for the various ports.
Hardware vendors use the reference implementation as a starting point and replace the
protocol with their own optimized high-speed interconnect software by using their
hardware. This makes it very platform dependent.
KSXP Wait Interface to KSL
kslwat
ksl wait
facility
IPC
5-126
Default
IO
skgpwait
Net
ksxpwait
ksldwat
ksnwait
skgxpwait
skgfrwat odm_io
nsevwait
KSXP Wait Interface to KSL

When Oracle processes expect something to happen, they usually update something in
the shared memory and wake up (post) some other Oracle process, and then wait to be
posted back.
Posts are considered unreliable, and there is no direct correlation between
receiving a post and the state change that has occurred.
The wait facilities allows processes to synchronize on I/O completion from a
single I/O source or a local post.
KSXP Tracing
Event 10401
Bit flags
5-127
0x01 Minimal in tracefile

0x04 BID tracking
0x08 Slow send debugging
0x10 Dump ksxp trace information to trace file via
ksdwrf instead of KST
KST tracing with _trace_events=10401:8:ALL
KSXP Tracing
For more information, refer to ksxp.c of 10401. KST tracing is covered in a later
module.
KSXP Trace Records
All KSXP trace records contain the string 'KSXP'.

client says which component is performing the
operation (see ksxpcid.c). Cache = 1, DLM = 2,
IPQ = 3, CGS = 5.
krqh is the pointer to KSXP-level request handle.
srqh is the pointer to SKGXP-level request handle.
srqh is useful in correlating KSXP and SKGXP
tracing.
522683FB:000182BD
6
5 10401 39
KSXPQRCVB: ctx 2ec5a84 client 2 krqh 301c1bc srqh 301c218
buffer 2faca80
5-128
KSXP Trace Records

The label states the record type:
KSXPQRCVB: Queue a receive buffer (shown above)
KSXPWAIT: Message completion
KSXPRCV: Message receive completion
KSXPMCPY: Remote memory copy
KSXPMPRP: Memory update
SKGXP Interface
Port Connection Interface
Memory Mapped Interface
5-129
Ports: Communication endpoints

Connections
Request handlers
skgxpcon, skgxpvsnd, skgxpvrcv, skgxpwait
Region
Buffer
Buffer ID (BID)
skgxprgn, skgxpprp, skgxpmcpy
SKGXP Interface
The Port Connection is for asynchronous usethe client code submits a number of
requests to the interface and attempts to overlap the completion of these requests with
useful computation. This overlap of communication with computation acts to hide the
latency costs of remote communication.
Ports represent communication endpoints. Connections are used to cache information
regarding communication endpoints. Request handlers represent outstanding requests to
the interface (primarily outstanding message receives and sends).
Synchronization is provided by skgxpwait. Synchronization is integrated with the
standard VOS layer post/wait mechanism allowing Oracle processes to block the
waiting for outstanding network IPC or post from another process in the local instance.
The buffer cache uses the memory-mapped interface for cache fusion and parallel query
clients.
Regions are large areas of memory (such as the SGA). Clients that want to receive data
into their region prepare buffers in the region to receive data via the prepare call. The
output of the prepare call is a buffer ID or BID. BIDs are copy-by-value structures that
are transferred to remote instances via the lock manager. The BIDs are then used to
transfer data directly to the prepared buffer of the requesting process in the remote
instance.
Choosing an SKGXP Implementation
libskgxp.so contains the skgxp that is linked to

Oracle.
libskgxpd.so is a dummy implementation and
writes error messages when called.
Resolves linkage problems in non-RAC systems
5-130
Choosing an SKGXP Implementation

Swap the libskgxp.so library with the library of your choice and relink using the
makefiles. Problems can occur if the files are not accessible via LD_LIBRARY_PATH
or if the protection is changed. Determining which library is linked in may be noted in
the LMON trace file.
SKGXP Tracing
Event 10402
Bit flags in level
5-131
KSXP_OSDTR_ERROR
KSXP_OSDTR_META
KSXP_OSDTR_SEND
KSXP_OSDTR_RCV
KSXP_OSDTR_WAIT
KSXP_OSDTR_MCPY
KSXP_OSDTR_MUP
0x01
0x02
0x04
0x08
0x10
0x20
0x40
SKGXP Tracing
The levels for the event have changed considerably in Oracle9i Release 2. Examine
source skgxp.h, ksxp.c for details in older versions. In Oracle9i Release 1 (and
earlier), it was:
0x00040000
trace meta functions
0x00080000
trace send
0x00100000
trace receive
0x00200000
trace wait
0x00400000
trace cancel
0x00800000
trace post
0x02000000
trace unusual or error conditions
0x04000000
trace remote memory copies
0x08000000
trace buffer update notifications
Possible Hang Scenarios
5-132
If the Node Monitor and the IPC use different

protocols
Temporary drop-out on network
Possible Hang Scenarios

On systems where the IPC traffic and node monitor communication traffic are on
separate networks, a hang may result when the IPC network fails, because the Node
Monitor has no knowledge of the IPC network. On Solaris, version 7, the DLM is the
Node Monitor, and it may or may not send traffic over the same interface that Oracle
uses.
Hangs Related to Send Timeouts or Send Failures
Oracle takes serious action in response to either of these events. The DLM times out on
sends a total of three times (about 10 minutes) before declaring a receiver unreachable.
Because the IPC interface guarantees reliable delivery, either event is taken to mean that
the instance is no longer reachable and should be removed from the cluster. The
instance goes into the reconfiguration state waiting for notification that the instance is
gone. If the timeout or failure was spurious, a hang results. Hangs that show IPC send
timeouts might indicate this condition.
To work around this problem, find the destination node that is thought to have failed and
shut it down.
Other Events for IPC Tracing
29726 : DLM IPC trace event

Level 9 and above turns on skgxp tracing.
29718 : CGS trace event

Level 10 and above turns on skgxp tracing.
10392 Parallel Query (kxfp)

Level 127 turns on skgxp tracing.
5-133
Other Events for IPC Tracing

These events turn on IPC tracing as a side effect at high levels of tracing of their
functional stack.
Code References
kjc.h: Kernel Lock Manager Communication layer

ksxp.*: Kernel Service X (cross instance) IPC
5-134
Summary
In this lesson, you should have learned:

About the messaging components
How to activate tracing of IPC
5-135
Objectives

the following:
Explain the function of the System Commit
Number (SCN)
Describe SCN propagation schemes
6-137

Node
Instance
SCN
SCN
Other
nodes
(not
shown)
CM
6-138
System Commit Number (SCN)

The SCN represents the logical clock of the database. As such, it has to be global in the
RAC. This is not possible without extra hardware, but it can be simulated well enough by
synchronizing the instance local SCNs. By using the SGA, you can handle process-toprocess SCN coordination in a non-RAC environment.
Logical Clock and Causality Propagation
Oracle uses SCNs to order events.

An update commits with an SCN.
Any process that tries to get an SCN at a later time
must always receive a greater or equal SCN value.
There is no ambiguity in the order of events and
their SCN.
6-139
You must synchronize SCNs between instances

from time to time in an RAC environment.
The synchronization activity is called causality
propagation.
Logical Clock and Causality Propagation

In computation, the association of an event with an absolute real time is not essential; you
need to know only an unambiguous order of events.
In RAC, the causality may suffer:
Assume that process 1 on instance 1 performs an update and commit with SCN.
Process 2 on instance 2, which tries to get the SCN later, is not guaranteed to obtain a
higher or equal SCN value. Sometimes process 2 does not see the committed changes
that were made by process 1 even if a read is done after the committed change.
In practice, this occurrence is rare and the time window where it can occur is very small.
Basics of SCN
SCN wrap: 2 bytes

SCN base: 4 bytes
Monotonically increasing
Current SCN and Snapshot SCN
SCN Wrap
6-140
SCN Base
Basics of SCN
Much can be said about the SCN and the nature of causality.
The essentials are:
The SCN must always increase and may skip a number of values.
The SCN must be kept in sync between multiple instances.
- In RAC: Between all instances mounting the database
- In distributed databases: All instances that are involved in a distributed
transaction (that is, when using database links)
- Synchronizing means using the highest known SCN. Otherwise it conflicts with
the requirement to increase.
Dependencies (causality) between changes must be maintained (for example, in
multiple changes to the same block by different transactions).
For more information, refer to Note 33015.1.
There is some distinction between the Current SCN that is used for a commit and the
Snapshot SCN that is used for a Consistent Read (CR) operation. The Snapshot SCN is the
highest SCN seen or used by the instance.
Basics of SCN (continued)

At startup, the SCNs across the nodes are initialized to the database SCN (the highest SCN
recorded at the last shutdown), which is synchronized across the cluster. All nodes have the
same SCN at startup.
The SCN from a kernel standpoint is a service. Before a client can use an SCN or call
CURRENT SCN, GET NEXT SCN, or GET SNAPSHOT SCN routines, it must initialize
the service. That initialization uses the database SCN.
SCN Latching
Updating the 6 bytes of an SCN must be atomic.

Latching modes are supported for compare and
swap (CAS) primitives.
CAS Primitive
None
32-bit CAS primitives
64-bit CAS primitives
6-142
Latch-Free Access
Reads
Reads and writes
Reads and writes
Access with Latch

Writes
SCN wrap changes only
Never
SCN Latching
If the operation to update or increment the SCN cannot be performed as an atomic or
single CPU instruction, you must latch or lock the SCN data structure so that the other
processes do not see an invalid SCN.
Latchless CAS operations are controlled by the following initialization parameters:
_disable_latch_free_SCN_writes_via_32cas
The default is False (that is, enabled by default).
_disable_latch_free_SCN_writes_via_64cas
The default is True (that is, disabled by default, even if it is supported on the platform).
Lamport Implementation
Assign a time SCN(x) to an event x, such that for

any events a and b , if a -> b then SCN(a) <=
SCN(b).
Mechanism to assign Logical Time
Each instance increments local SCN between two
successive COMMITs.
If instance A sends a message m to instance B, then
m also contains instance As current SCN (SCNA) at
the time that m is sent. When instance B receives
message m, instance B sets instance Bs SCN to
max (SCNA , instance Bs current SCN ).
6-143
Lamport Implementation
Earlier, Oracle OPS had a choice of SCN propagations, some of them using platformspecific hardware protocols. The Lamport scheme was the reference implementation.
Lamport SCN
Oracle9i RAC uses the Lamport scheme:

Attaches SCNs on each lock message
Guarantees partial ordering only
Preserves causality through periodic pinging of
the SC lock
Is more efficient, because each node can generate
SCNs simultaneously
6-144
Lamport SCN
The Lamport SCN propagation assumes that there is a constant exchange of messages. If
an instance does many commits on blocks where it has cached all data, the SCN will not
change at the other nodes, as there are no messages sent. This is solved with a periodic
SCN update.
The SC global resource or lock is used to communicate the SCN for the periodic update.
Its value field contains the current SCN, and the instance holding the exclusive lock can
update the field. You can think of the SC lock as a dummy lock that is used if the SCN
has not been propagated recently through other lock or message activity.
For more information, refer to kjm.c.
Source References
The message sending routines in kjc.c will insert the current SCN into every message at
scn_kjctmsg. Messages that are received by LMD (9.0) or LMS (9.2) compare and
update the local SCN if the local SCN is lower.
The SCN is shown in message dump/traces.
Limitations on SCN Propagation

701
701
SCN sync.
Time
702
Tx1 Start
|
Commit
Tx3 Tx7
707
708
Tx2 Start
|
Commit
Tx8 Start
|
Commit
Instance 1
6-145
702
707
SCN sync.
Instance 2
Limitations on SCN Propagation

If the beginning of Tx2 is later than the commit of Tx1 and less than the time delay
max_commit_propagation_delay, then Tx2 may not see the changes that are
made by Tx1.
Note that there is an implicit protocol in the kernel to synchronize the SCN every three
seconds by using LCK piggybacking of the SCN in DLM messages. In case of
communication problems, these messages are subject to the traffic controller.
Problems with SCN synchronization may manifest themselves as ORA-600 [2662] errors
(see note 28929.1).
If Tx2 wants to read a block that is used by Tx1, it builds a CR buffer based on too low an
SCN (701), because the local SCN for that buffer is still valid and they are not
synchronized yet.
If the local low SCN is later than Tx1s commit SCN, Tx2 sees the changes from Tx1. It is
OK to see it early; it absolutely has to see it after
max_commit_propagation_delay!
The SCN limitation is only evident in operations that do not cause lock messages to be
exchanged. Between max_commit_propagation_delay timeouts, the SCN is
synchronized via the LCK process and messaging, which are very dependent on the type
of work performed.
max_commit_propagation_delay
6-146
max_commit_propagation_delay is the delay

time to propagate SCN changes to the other nodes
after a commit.
max_commit_propagation_delay is given in
centiseconds; the default is 700 (7 seconds).
The SCN is also propagated with every lock
message.
The max_commit_propagation_delay parameter
has several effects.
max_commit_propagation_delay
With Lamport SCN, every instance maintains locally generated SCNs. When they generate
a new SCN, the instance does not need to synchronize the SCN within the
max_commit_propagation_delay amount of time. Instances can increase their
locally generated SCN based on global SCNs
max_commit_propagation_delay < 1 second
Each time LGWR writes to the redo log (that is, with every commit):
- LGWR sends a message to the SCN resource (SC, 0, 0) master to update SCN.
- LGWR sends a message to every active instance to update SCN.
1 second < max_commit_propagation_delay < 7 seconds
Each time LGWR writes to the redo log, it also sends a message to the SCN
Resource Master to update the SCN.
If a Snapshot SCN is required by an instance and more than the
max_commit_propagation_delay time has elapsed since the last
synchronization event, then the process sends a message to the SCN resource master
to update the SCN.
7 seconds < max_commit_propagation_delay
Every three seconds, the LCK process sends a message to the SCN resource master
to update the SCN.
Piggybacking SCN in Messages

Instance B
Instance A
FG
LMS
clk_val_kjxreqh
> local SCN
Message
scn_kjctmsg
SCN
SCN
scn_kjctmsg
LMS
The SCN of the instance sending a message is

systematically stored in the message header.
6-147
Piggybacking the SCN in Messages

During any message preparation in instance A, the scn_kjctmsg routine adds the
current SCN to the message. On receiving any message, instance B compares the SCN in
field clk_val_kjxreqh to the current SCN. If it is greater, then it updates the local
SCN to the SCN received in the message.
Periodic Synchronization
Every three seconds, LCK0 calls kcsciln.

Called for SCNLCK and SCNSRV only
If Lamport is in use, then pings SC lock
If not using Lamport, then update SCN Server or SC
Lock resource depending on the scheme
Periodic synchronization does not occur if

max_commit_propagation_delay is less than
one second
1:KJX_GET_SCN_REQ
LCK0
LMD0
2: Simple ACK, includes SCN
Node 2
Node 1
6-148
Periodic Synchronization
The LCK0 timeout event, kcsmto, checks whether it is time for an SCN update.
SCN Generation in Earlier Versions

of Oracle
6-149
Lamport method was one of several.

Earlier choice of methods were less generic.
SCN Generation in Earlier Versions of Oracle

In Oracle8i:
DLM lock (SCNLCK). SC resource in DLM, slow, uses Lamport if
max_commit_propagation_delay >700 centiseconds
SCN server (SCNSRV). Uses port-specific OSDs for SCN server, uses Lamport if
max_commit_propagation_delay >700
In Oracle8:
SCNLCK (as above)
SCNSRV (as above)
Broadcast on commit (SCNBOC). Used in the DLM lock scheme when
max_commit_propagation_delay <100
Hardware clock (SCNCNT)
In Oracle7:
DLM lock implementation (SCNLCK) using the SC resource in DLM
SCN Server (SCNSRV) was never really implemented
Lamport: Implemented in DLM. This did not possess full causality preservation until
Oracle 7.3.4
Hardware clock: SP2 switch, for example
Code References
kcm.*: Kernel Cache Miscellaneous

kcs.*: Kernel Cache SCN Management
scn.h: Lamport implementation details
sparams.h: Some comments on SCN schemes
6-150
Summary

Explain SCN propagation
Describe the purpose of the SCN in lock messages
6-151
Formerly the Distributed Lock Manager
Objectives

the following:
Describe the Global Resource Directory concepts
and components
Describe the global locking model of enqueues
Outline the internal resource allocations
7-153
RAC and Global Resource Directory (GRD)

Node
Previously known as the

Distributed Lock Manager
(DLM)
Instance
Caches
ksi/ksq/kcl
GRD/GCS/GES
CGS
NM
I
P
C
Other
nodes
(not
shown)
CM
7-154
RAC and Global Resources

For particular applications or for historic reasons, the Distributed Lock Manager (DLM)
has many alternative terms that are used to describe it.
The Global Resource Directory (GRD) is the function that manages the locking or
ownership of all resources that are not limited to a single instance in RAC. Generally,
this is the same as a DLM, and the GRD is a DLM implementation, based on the IDLM
of Oracle8i and earlier releases.
The GRD can be considered to consist of Global Cache Services (GCS), which handles
the data blocks, and the Global Enqueue Service (GES), which handles the enqueues
and other global resources.
The terms GRD, GES, and GCS are the preferred terms, but DLM is the pervasive term
in all materials and, therefore, used in this course.
DLM History
7-155
Oracle7: External OS-based DLM

Oracle8: Integrated DLM
Oracle8i: Cache fusion 1, the CR problem
Oracle9i: Cache fusion 2
DLM History
The Oracle DLM comes out of the development that is performed primarily on SP2 and
HP DLMs for Oracle7, which were used where the vendors did not provide any DLM.
In Oracle 7, version 3, Digital, Sequent, NCR, and Pyramid used their own DLMs. They
were all different, as were the debugging tools and the output. The particular
functionality that was supported in each case also varied, which made it difficult for
Oracle to implement certain functions on some platforms at certain releases. Groupbased locking is an example.
In Oracle7 DLMs, pipes facilitated the communication between the DLM daemons and
the client processes. In Oracle8, clients of the DLM have direct access to the DLM
structures in the SGA. This permits optimization of the communication path by allowing
clients to modify the structures directly and by waiting only on an LMD process to send
messages to remote nodes where remote operations must be performed. Therefore, local
lock operations can be considerably faster.
The DLM has been continuously improved with more views, better deadlock detection,
and changed message paths to eliminate needless context switches. The Cache Fusion
improvements are more of a change in how the client buffer handling routines use the
DLM.
DLM Concepts: Terminology
7-156
Resource: Any object accessed by the application

Client: Any process asking for a resource
Lock: An intention of a client on a resource
DLM services: Allow client applications to create,
modify, and delete locks that are shared
DLM database: Stores information on resources,
locks, processes
DLM Concepts: Terminology

Since Oracle8, the DLM database has been integrated in the Oracle SGA (that is, part of
the IDLM).
Directory Node Structures: Area in DLM memory that stores which node is the lock
master for each lock.. In Oracle8i the master node is always the directory node. In
Oracle9i, the dynamic remastering uses a lookup table to map the hashed master key to
the actual master (this is explained later), but it is not named the directory node.
DLM Concepts: Resources
The DLM does not provide the ability to lock the

objects themselves.
The DLM provides the resources as the lockable
entity.
The client code defines what this resource
represents and what protocols are satisfactory to
access it. There are two resource types:
PCM resources are for block buffers.
Non-PCM resources are [0x10000f8][0x1],[BL]
row locks (transaction
Grant Q
enqueues), file locks,
Convert Q
and instance locks.
Lock value block
7-157
Resources
A resource is just a name. Each resource can have a list of locks that are currently
granted to users. This list is called the Grant Q. Similarly there is a Convert Q, which is
a queue of locks that are waiting to be converted. In addition, a resource has a 16-byte
lock value block (LVB) that contains a small amount of data. The LVB is used in some
resources. For example, the PS resource for parallel query slaves uses it to pass the
kxfpqd structure to the other nodes.
The two resource types have different data structures.
DLM Concepts: Locks
A client (user) must get a lock on a resource to be

able to use what it represents.
Two types, with different data structures
Enqueues: Locks on non-PCM resources
Lock Elements: Locks on PCM resources
Locks can be acquired in various modes in

accordance with a matrix of compatible modes.
[0x10000f8][0x1],[BL]
Grant Q
Convert Q
Lock value block
7-158
lockp
PID
GID/DID
Locks
If the lock before use rule has not been followed by the Oracle programmer, then that
is a bug. It may not show up as system or data corruption for some time.
The DLM lock modes and the Oracle locking modes are not identical. The locking
matrix for the DLM is covered in later slides. The lock matrix depends on the type of
lock.
Locks are placed on a resource. When a process has a lock on the grant queue of the
resource, it is said to own the resource. Imprecise usage also talks of owning the
lock.
The example in the slide shows a lock on the Grant Q of the resource. The lock may be
either process- or group-owned. If it is process-owned, the PID field shows which
process holds the lock. In the case of group-owned locks, the GID field has a group
number, and the DID field has the Transaction ID (TxID) of the client transaction.
DLM Concepts: Processes
A representation in the DLM of a process that

requested or acquired the locks:
[0x10000f8][0x1],[BL]
7-159
Grant Q
lockp
Convert Q
PID
Procp
Lock value block
GID/DID
PID
Process-Based Versus Session-Based Locking

In a simple implementation, a DLM provides a lock to a process. This works fine when
the process-to-session mapping is maintained. In MTS and XA, however, the session
may migrate or multiple processes may contribute to a transaction. It is preferable to be
able to provide a session-based identifier to control access to the lock. This is what
group-owned locking does. Generally, Oracle provides the transaction ID as the group
ID, and then anyone working on that transaction simply provides that XID and lock
operations are honored.
Domains
Domains are largely redundant in Oracle8 because there is a DLM for each database.
Although present in Oracle8, the domain functionality is largely unused.
DLM Concepts: Shadow Resources
Resources are mastered on a node.

The master node has all resource information,
such as full grant queues and convert queues.
The shadow resource exists on any other node
that has an interest in this resource; it knows only
about locks on its own node.
[0x10000f8][0x1],[BL]
Grant Q (All nodes)

Convert Q (All nodes)
Master node
7-160
[0x10000f8][0x1],[BL]
Grant Q (local)
Convert Q (local)
Shadow node
Persistent Resources
The shadow resource exists on any other node that has an interest in a resource, that is,
any node on which a lock is open against that resource.
A persistent resource is maintained in a dubious state in the DLM following the closure
of all locks on it when the processes holding the locks exited abnormally while holding
a lock in PW or EX mode.
Recovery Domain (rdomain)
A recovery domain is the mechanism by which persistent resources can be recovered.
Each persistent resource is linked to a recovery domain. There is one such domain per
database.
DLM Concepts: Copy Locks
When a lock is held on a node other than the master

node, the master keeps a copy of the lock locally.
[0x10000f8][0x1],[BL]
Grant Q
Convert Q
Master node
7-161
[0x10000f8][0x1],[BL]
Grant Q
Convert Q
Shadow node
lockp
lockp
Copy lock
Owner node
DLM Concepts: Copy Locks

There is only one copy of the lock for every other node that has an interest in this
resource. The copy lock is held at the highest node where the other node holds a lock.
This is the information that the master node requires. The other node maintains all the
other information that is required.
The master node has the master lock, and the local node has the shadow lock.
Resource or Lock Mastering
7-162
The DLM maintains information about the locks on

all nodes that are interested in a given resource.
Lock mastering is distributed among all nodes in
the cluster.
The master node contains the description of the
resource and at least the lock on this resource
with the highest LOCKING mode.
The master node for a resource is computed by
using several arrays: res_hash_val_kjga (for
non-PCM resources) and pcm_hv_kjga (for PCM
resources).
Resource or Lock Mastering

The DLM mastering algorithm chooses one node to manage the relevant information of
a resource and its locks on a resource by resource basis; this node is referred to as the
master node.
The res_hash_val_kjga and pcm_hv_kjga arrays are updated at
reconfiguration when a node joins or leaves the cluster. The update minimizes resource
migration. Each element of the arrays is a bucket and contains a physical node number.
For non-PCM resources, you hash the resource name to obtain a bucket number bidx
and then look up the master node number with res_hash_val_kjga[bidx].
These arrays are private to each node. The algorithm is covered in detail in later lessons.
Basic Resource Structures
7-163
Resource name: Unique name to identify the

resource. This is three ub4 numbers, the last
interpreted as a character pair.
Value block: Area in memory that is used to store
information about the resource
Granted queue: Locks granted on resources
Convert queue: Locks in the process of
converting from one mode to another
Basic Resource Structures

Each non-PCM resource is identified in the cluster by its name (for example struct
kjr).
The name consists of three integers of 4 bytes (ub4 n[3]).
For non-PCM or enqueues: n[0] is set to id1, n[1] is set to id2, n[3]
receives string values, such as DI or LB.
A PCM resource is identified by a name with two integers, with the third integer
character pair implied as BL.
DLM uses the resource name to compute the resource master node.
DLM Structures
PCM (GCS) and non-PCM (GES) resources are

kept separate and use separate code paths.
GES:
Resource table: kjr and kjrt
Lock table: kjlt
Processes: kjpt
GCS:
Resource table: kjbr
Lock table: kjbl
7-164
DLM Structures
The separation of GES and GCS resource handling is new to Oracle9i. The earlier
versions had more common structures and code paths.
There are differences in these structures between versions 9.0.1 and 9.2
kjr (partial)
kjurvb
kjurn
kjsolk
kjsolk
kjsolk
kjsolk
ub2
ub1
ub1
kjuvlst
ub2
kjsolk
kjsolkl
ub1
ub1
kjulevel
valblk_kjr;
/* the value of the lock
resname_kjr;
/* the resource name
grant_q_kjr;
/* list of granted resources
convert_q_kjr;
/* list of resources being converted
req_q_kjr;
/* list of open reqs when master_node unknown
scan_q_kjr;
/* For the DLMD to perform move_scan_cvt etc
grant_count_kjr[6];
/* count of # of locks at each level
granted_bits_kjr;
entry_kjr;
/* dir, master, local
valstate_kjr;
/* state of valblk
master_node_kjr;
/* ID of the node mastering the resource
hash_q_kjr;
/* hash list : hp
*hp_kjr;
options_kjr;
/* same as open option
remaster_kjr;
next_cvt_kjr;
/* Global next cvt. mode
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
DLM Structures (continued)

LocK manager Resource Table structure
typedef struct kjrt
{
kjsolkl
*reshash_kjrt;
/* resource hash bucket array */
ub4
n_reshash_kjrt;
ub4
*res_bucket_seq_kjrt;
ksspa
res_cache_kjrt[3];
/* cache of freeable resources */
ub4
res_cnt_kjrt[3];
/* count on cached resources */
boolean
clear_cache_kjrt;
/* How should we clear the cache */
ub4
res_cache_sz_kjrt;
/* size of resource cached */
sb2
pral_kjrt;
/* Flag indicating preallocation of object
Values: -0-need,1-have,2-don't */
ksspa
*res_parent_kjrt;
/* parent of resources */
ksllt
*latch_kjrt;
/* resource freelist latch array */
ub4
num_lst_kjrt;
/* number resource freelist */
}kjrt;
Lock manager Lock Table
typedef struct kjlt
{
ksllt
*latch_kjlt;
ksspa
gpar_kjlt;
ub2
num_lst_kjlt;
}kjlt;
Process table Structure
typedef struct kjpt
{
ub4
maxproc_kjpt;
ub4
clnt_kjpt;
ub4
n_prochash_kjpt;
kjsolkl
*prochash_kjpt;
ksllt
*latch_kjpt;
} kjpt;
/* tab latch */
/* parent of group locks */
/* number of lock freelist */
/* maximum number of items in table */

/* # local clients */
/* FreeList Latch */
DLM Structures (continued)

/* PCM resource structure */
typedef struct kjbr {
kjsolk
hash_q_kjbr;
ub4
resname_kjbr[2];
kjsolk
scan_q_kjbr; /* chain to
kjsolk
grant_q_kjbr;
kjsolk
convert_q_kjbr;
/*
ub4
diskscn_bas_kjbr;
ub2
diskscn_wrap_kjbr;
ub2
writereqscn_wrap_kjbr;
ub4
writereqscn_bas_kjbr;
struct kjbl *sender_kjbr;
ub2
senderver_kjbr;
ub2
writerver_kjbr;
struct kjbl *writer_kjbr;
ub1
mode_role_kjbr; /* one of
ub1
flags_kjbr;
ub1
rfpcount_kjbr;
ub1
history_kjbr;
kxid
xid_kjbr;
} kjbr ;
/* 68 bytes on sun4u
/* hash list : hp
/* the resource name
lmd scan q of grantable resources
/* list of granted resources
list of resources being converted
/* scn(base) known to be on disk
/* scn(wrap) known to be on disk
/* scn(wrap) requested for write
/* scn(base) requested for write
/* lock elected to send block
/* version# of above lock
/* version# of lock below
/* lock elected to write block
'n', 's', 'x' && one of 'l' or 'g'
/* ignorewip, free etc.
/* refuse ping counter
/* resource operation history
/* split transaction ID
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
/* kjbl - PCM lock structure

** Clients and most of the DLM will use the KJUSER* or KJ_* modes and kscns
typedef struct kjbl {
/* 52 bytes on sun4u
union {
/* discriminate lock@master and lock@client
struct {
/* for lock@master
kgglk
state_q_kjbl;
/* link to chain to resource
kjbopqi
*rqinfo_kjbl;
/* target bid
struct kjbr *resp_kjbl;
/* pointer to my resource
} kjbllam;
/* KJB Lock Lock At Master
struct {
/* for lock@client
ub4
disk_base_kjbl;
/* disk version(base) for replay
ub2
disk_wrap_kjbl;
/* disk version(wrap) for replay
ub1
master_node_kjbl;
/* master instance#
ub1
client_flag_kjbl;
/* flags specific to client locks
ub2
update_seq_kjbl;
/* last update to master
} kjbllac;
/* KJB Lock Lock At Client
} kjblmcd;
/* KJB Lock Master Client Discrimnant
void *remote_lockp_kjbl;
/* pointer to client lock or shadow
ub2
remote_ver_kjbl;
/* remote lock version#
ub2
ver_kjbl;
/* my version#
ub2
msg_seq_kjbl;
/* client->master seq#
ub2
reqid_kjbl;
/* requestid for convert
ub2
creqid_kjbl; /* requestid for convert that has been cancelled
ub2
pi_wrap_kjbl;
/* scn(wrap) of highest pi
ub4
pi_base_kjbl;
/* scn(base) of highest pi
ub1
mode_role_kjbl; /* one of 'n', 's', 'x' && one of 'l' or 'g'
ub1
state_kjbl;
/* _L|_R|_W|_S, notify, which q, lock type
ub1
node_kjbl;
/* instance lock belongs to
ub1
flags_kjbl;
/* lock flag bits
ub2
rreqid_kjbl;
/* save the reqid
ub2
write_wrap_kjbl;
/* last write request version(wrap)
ub4
write_base_kjbl;
/* last write request version(base)
ub4
history_kjbl;
/* lock operation history
} kjbl;
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
Lock Mode Changes
New lock
requested
GRANT
QUEUE
Conversion
granted
7-167
CONVERT
QUEUE
Compatible
In-place
conversion
Incompatible
conversion
Lock Changes
Locks are placed on the resource grant or convert queue. If the lock mode changes, then
it is moved between the queues.
If several locks exist on the grant queue, then they must be compatible. Locks of the
same mode are not necessarily compatible with another of the same mode. The
compatibility matrix of the various locks differs between GES and GCS locks.
Compatible in-place conversions are typically downgrades, converting to a lesser mode.
Some exceptions exist and are covered later.
A lock can leave the convert queue under any of the following conditions:
Process requests the lock termination, (that is, removes the lock).
Process cancels the conversion; the lock is moved back to the grant queue in
previous mode.
The requested mode is compatible with the most restrictive lock in the grant queue
and with all the previous modes of the convert queue, and the lock is in the head of
the convert queue. Convert requests are processed first in, first out (FIFO).
Simple Lock Changes on a Resource

1
7-168
Res.
Res.
Res.
Res.
Res.
A:CR
Grant
Convert
NL: Null
CR: Concurrent Read
EX: Exclusive Write
A:CR
B:CR
A:CR
B:CR
C:CR
A:NL
B:CR
C:CR
Grant
A:NL
B:CR
Convert
C:CR EX
Grant
Convert
Grant
Convert
Grant
Convert
Simple Lock Changes on a Resource

Example of a resource getting locks placed on its grant or resource queues.
1. A shareable read lock (Concurrent Read) is granted.
2. Another shareable read lock is granted. They are compatible and can reside on the
grant queue.
3. Another sharable read lock is placed on the grant queue.
4. One lock converts to shareable NULL. This conversion can be done in place
because it is a simple downgrade.
5. Another lock attempts to convert to exclusive write. It has to be placed on the
convert queue.
Changes on a Resource with Deadlock

1
Res.
Res.
Res.
Res.
A:CR
B:CR
Grant
B:CR
C:CW
Convert
A:CREX
Grant
C:CW
Convert
A:CREX
Grant
C:NL
Convert
A:CREX
Grant
Convert
NL: Null
7-169
C:CW
B:CRPR
B:CRPR
CR: Concurrent Read PR: Protected Read

CW: Concurrent Write EX: Exclusive Write
Changes on a Resource with Deadlock

The convert queue is a first in, first out (FIFO) queue. This may lead to deadlock
situations.
1. Two shareable read locks and a shareable write are granted.
2. Lock A attempts an upgrade to exclusive write (no other access allowed). This
mode is incompatible with the mode of B and C, so it gets placed on the convert
queue. A note is kept of the old mode, in case the conversion is canceled.
3. Lock B attempts an upgrade to the protected read mode (no other writers allowed).
This mode is incompatible with Cs mode, and is also placed on the convert queue.
4. Lock C downgrades to share NULL (no restrictions on other access). Lock A
cannot complete its conversion, because even though exclusive write is compatible
with NULL, it is not compatible with lock Bs old share read mode. Lock B could
complete its conversion, but it is in the queue behind lock A.
DLM Functions
Distributed: DLM exists in each instance of the

cluster.
Coordinates requests and access to shared
resources between different instances
Keeps an inventory of all locks
Grants and notifies processes when a resource
becomes available
Notifies owners of a lock when other processes
request the lock
7-170
Fault tolerance: DLM can survive n1 node failures.

Deadlock detection: DLM must be able to detect
and report deadlock.
DLM Functions
Interprocess communication is critical to the DLM because it is distributed. Being
distributed permits the DLM to share the load of mastering (administering) resources.
The result of this is that you may lock a resource on one node but actually have to
communicate with the LMD processes on another node entirely. Fault tolerance requires
that no vital information about locked resources is lost irrespective of how many DLM
instances fail.
The durability of the database (that is, being able to recover blocks that are lost in an
aborted instances buffer cache) is not a DLM function, but global cache handling of
blocks still uses the same log before write rule to ensure durability.
DLM Functionality in
Global Enqueue Service Daemon (LMD0)
Performing periodic scanning for move-scanconvert operations

Performing periodic scanning of the timer queue
for locks with expired timers
Performing deadlock detection
Processing incoming messages for non-PCM
locks
There is only one LMD0 in 9.2
7-171
DLM Functionality in Global Enqueue Service Daemon (LMD0)

The DLM or GRD consists of the GES component and the GCS component.
The move-scan-convert operation is a periodic check, if a lock that is currently waiting
on the convert queue is eligible for the grant queue.
LMD0s Loop: kjmdm
If lock db is frozen:
- Stops any deadlock detection: kjdddei
- Freeze and reset: kjfzfcl
The lock db is either in a frozen or a running state. In the frozen state, it is not possible
to get any locks from the DLM or to create any new resources. The DLM is frozen
during reconfiguration so that the node failure can be recovered from.
If lock db is open:
1. Check for converting locks: kjcvscn.
2. Deadlock detection: kjddits/kjddscn.
3. Clean up recovery domains: kjprsem.
4. Update stats: kjxstc.
5. Send flow control messages: kjctssb.
DLM Functionality in Global Enqueue Service

Daemon (LMD0) (continued)
LMD0 is the core of the DLM. If it were not for the odd unpleasant failure or
reintroduction, it would probably do well without LMON. Nonetheless, LMD0
handles all lock operations and creation of resources, the detection of deadlocks, and
the sending of messages to other LMD0s.
Statistics are updated only if _lm_statistics is TRUE. In Oracle8i, statistics
for the two views V$DLM_CONVERT_LOCAL and V$DLM_CONVERT_REMOTE
require that event 29700 is also set. You also need to set timed_statistics to
TRUE for timing information to be valid.
Note: _lm_statistics parameter does not exist in Oracle9.2 or in Oracle 9.0.1.
It does exist in Oracle 8.1.5 and Oracle 8.1.6.
Global Enqueue Service Monitor (LMON)
Publishing work load of the node (active PQ user,

active PQ session)
Processing naming-service requests that are
queued by the client
Polling the Cluster Manager to manage
reconfiguration:
Instance joining the group
Instance leaving the group, shutdown or node death
Performing Dynamic Remastering (only if

explicitly enabled)
There is only one LMON in 9.2
7-173
DLM Functionality in Global Enqueue Service Monitor (LMON)

Dynamic Remastering (DMR) is not enabled by default in Oracle9i, Release 2. It can be
enabled by setting _kcl_local_file_time in version 9.2.0.
LMONs Loop: kjfcln
Listens for local messages: kjcswmg
Responds to reconfig events: kjfcrfg
Cleans out the GES cache: kjrchc
Reconfiguration is perhaps the most significant of LMONs responsibilities. It is used
during the recovery from a node failure (or other shutdown of a DLM instance) and
during the startup of new DLM instances. kjfcrfg is the reconfiguration routine.
The DLM caches resources and lock structures and, as already explained, has freelists
on which resources are placed when they are no longer needed. kjrchc cleans out the
DLM cache of resources; it is a housekeeping operation.
Global Cache Service Process (LMS)
Scanning PCM-resources that have grantable

converting locks
Processing down-convert queue
Flushing messages, if messages are enqueued and
exceeded _side_channel_batch_timeout
Processing remote messages for PCM locks
The number of LMS processes is fixed by _lm_lms
7-174
Default value is max(#CPU/4, 2)
DLM Functionality in Global Cache Service Process (LMS)

The down-convert queue is handled in kclpbi.
(The number of LMS processes can be dynamic and adjusted by workload if
_lm_dynamic_lms is set to TRUE. But this is not functioning in 9.2, so the
parameter should be FALSE.)
Other Processes
DIAG process:
Provides low-overhead in-memory tracing and
logging
Manages and maintains the diagnosability across
multiple instances
Helps execute ORADEBUG on all nodes of the RAC
cluster
All processes:
Process PING for BUFFER-CACHE
Process-deferred queue and CR log-flush queue
Adjust local SCN (Lamport) when receiving DLM
messages
7-175
DLM Functionality in Other Processes

The PING handling for the buffer cache was done by LCK in previous versions.
The CR log-flush queue is handled in kclpto.
PMON still does all forms of cleanup after unexpected process death, including the
release of locks and other DLM calls (see kjplhd/kjgxda.)
Configuring GES Resources
Initial allocation is:

64 if cluster_database is not set
_lm_ress if parameter is defined
1.1 * ( localres + (number_of_instance-1) *
localres / number_of_instance ) otherwise
If exhausted, then more resources are allocated in

shared_pool.
ges_ress in V$RESOURCE_LIMIT shows the high
water mark.
7-176
Configuring GES Resources

GES resources are the non-PCM resources.
The localres value is the sum of local resources, which is calculated by:
localres = processes + dlm_locks + transactions +
enqueue_resources + db_files + 7 +
parallel_max_servers * cluster_database_instance +
parallel_max_servers + cluster_database_instance + 200
To view the usage:

SELECT * FROM V$RESOURCE_LIMIT
WHERE RESOURCE_NAME LIKE 'ges%' ;
Configuring GES Locks

128 if cluster_database is not set
_lm_locks if parameter is defined
(localres+_enqueue_locks) +
(number_of_instance-1 *
(localres+_enqueue_locks) / number_of_instance)
otherwise
7-177
If exhausted, then more locks are allocated in

shared_pool.
ges_locks in V$RESOURCE_LIMIT shows the high
water mark.
Configuring GES Locks

The localres value is the same as in the previous slide.
Configuring GCS Resources

_gcs_resources if defined
2* _db_block_buffers if primary/secondary
instances are configured (RAC Guard, Failover)
max(1.1*_db_block_buffers, 2500) otherwise
If exhausted, then more resources allocated from

shared_pool in increments of 1024.
gcs_resource in V$RESOURCE_LIMIT shows the
high water mark.
7-178
Configuring GCS Resources

GCS resources are the PCM resources.
Note: Parameter for default value is based on _db_block_buffers (leading
underscore), not db_block_buffers.
To view the usage:
SELECT * FROM V$RESOURCE_LIMIT
WHERE RESOURCE_NAME LIKE 'gcs%' ;
Configuring GCS Locks

_pcm_shadow_locks if defined
max( 1.1* _ db_block_buffers, 2500) otherwise
7-179
If exhausted, then more locks are allocated in

shared_pool in increments of 1024.
gcs_shadows in V$RESOURCE_LIMIT shows the
high water mark.
Configuring DLM processes

_lm_procs if set
max( ( 64 + 256 ) + ( number_of_instance-1 ),
processes ) otherwise
7-180
If exhausted, then allocate more structures in

shared_pool.
ges_procs in V$RESOURCE_LIMIT shows the high
water mark.
Logical to Physical Nodes Mapping
hash_node_kjga
maps logical to
physical node.
hash_node_kjga[0]
always contains one
live node.
This array is updated
in a three-step
reconfiguration.
N1
Dead node
N2
Live node
2
3
5
-
N3
N4
N5
-
hash_node_kjga
7-181
Buckets to Logical Nodes Mapping
Resource N
Hash value
of name
N1
0
0
0
res_hashed_val_kjga
Resource M
0
0
N2
N3
N4
N5
0
pcm_hv_kjga
7-182
hash_node_kjga
Buckets to Logical Nodes Mapping

Initially, the res_hashed_val_kjga and pcm_hv_kjga all point to the first
hash_node_kjga, which must be the first instance to start up.
The number of buckets is set by _lm_res_part; the default value is 1289.
Each element res_hashed_val_kjga and pcm_hv_kjga is a bucket.
Mapping for a New Node Joining the

Cluster
Resource N
N1
0
1
0
res_hashed_val_kjga
Resource M
2
N2
4
-
0
1
N3
N4
N5
-
1
pcm_hv_kjga
7-183
hash_node_kjga
Mapping for a New Node Joining the Cluster

When a second instance joins the cluster, hash_node_kjga reflects this state. Then
res_hashed_val_kjga and pcm_hv_kjga are updated. Each instance publishes
its weight, which is _gcs_resource if defined (otherwise, it is
_db_block_buffers). In LMON trace file, you can see:
kjfcpiora: publish my
...
res_master_weight for
res_master_weight for
Total master weight =
weight 6331
node 0 is 6331
node 1 is 6331
12662
Mapping for a New Node Joining the Cluster (continued)

The new instances joining the cluster compute the redistribution differently from the
old instance. In the example, node 4 computes values for
res_hashed_val_kjga and pcm_hv_kjga as:
total_weight = sum of weight of every alive node = 12662
For each alive node in cluster, avgpart = (weight_of_node / total_weight) *
buckets = buckets/2
For each node in hash_node_kjga :
- For i in 0 to (buckets 1), if buckets are not attributed and the current
node does not have more than avgpart buckets, then attribute the buckets
to the current node by setting pcm_hv_kjga[i] with the current node
and marking this bucket as need_remastering.
- For i in 0 to (buckets 1), attribute bucket i to the node in a round-robin
manner and mark bucket as need_remastering.
Remapping When Node Joins
res_hashed_val_kjga
Step 1
N2
7-185
N1
N3
N4
N5
hash_node_kjga
U
1
pcm_hv_kjga
Mapping When a Node Joins on Old Node

Old nodes update arrays:
Compute avgpart = buckets / number_of_alive_nodes
Takeoff buckets from death instance or instance having more than avgpart buckets:
For i in 0 to ( buckets 1), pnode = res_hashed_val_kjga[i]. If pnode
has death (shutdown) or if pnode has more than avgpart buckets, then set
res_hashed_val_kjga[i] to UNKNOWN.
Attribute buckets with UNKNOWN flag to under-allocated nodes: for i in 0 to
(buckets 1), if bucket has UNKNOWN flag, then for k in 0 to (number of alive
nodes 1), pnode = hash_node_kjga[k]. If pnode has less than avgpart
buckets, then set res_hashed_val_kjga[i] = pnode.
Apply the same calculation to update pcm_hv_kjga, but avgpart for each node
is computed as weight(node) / sum_weight(every node).
Non-PCM resources are evenly distributed to every alive node, and PCM resource
are distributed based on weight (or _db_block_buffers) of node.
For more details, refer to kjshashcfg.
Mapping Broadcast by Master Node
N1
0
1
3:Send
master node
ID
N2
2: Determine master node
N3
0
1
7-186
N4
4: Master sends
hash tables
1: Send
hash_node_kjga[0]
N5
Mapping Broadcast by Master Node

The complete mapping table is broadcast to all members:
1. Send the hash_node_kjga[0] indicating if the current node is new or old.
2. After receiving every message, determine which is the lowest surviving node that
will be elected as master node (in this example, node 2).
3. Inform everyone what the master node is.
4. Master node broadcasts pcm_hv_kjga and res_hashed_val_kjga to other
nodes in the cluster.
Broadcast is done in step 5 of reconfiguration and if the number of alive nodes in the
cluster is at least two.
Master Node Determination for GES
If there is only one node in the cluster, then it is

the master node.
For RT or IR resources, the master node is
hash_node_kjga[0].
Otherwise, let key = sum of resource name (three
integers):
For TX enqueues and _lm_tx_delta >0
master node = hash_node_kjga
[ (key % 1289) % number live nodes]
Otherwise, master node =
res_hashed_val_kjga
[ key % length(res_hashed_val_kjga) ]
7-187
Master Node Determination for GES

RT is the redo thread global enqueue, IR is the instance recovery serialization global
enqueue, and TX is the transaction enqueue.
The default value of _lm_tx_delta is 16.
The length refers to the number of elements.
Master Node Determination for GCS
7-188
If there is only one node in the cluster, then it is

the master node.
Otherwise, let key = sum of resource name (two
integers).
Master node = pcm_hv_kjga
[ key % length(pcm_hv_kjga)]
Master Node Determination for GCS

The algorithm is slightly different if dynamic resource remastering is active. It is not
active in 9.2.
Dump and Trace of Remastering
7-189
Query X$KJDRHV to see res_hashed_val_kjga.

Query X$KJDRPCMHV to see pcm_hv_kjga.
Event 29731, level 14, traces LMON remastering

progress.
Dump and Trace of Remastering

Partial DESCRIBE of X$KJDRHV
Name
Type
-------------- ------
KJDRHVID
NUMBER
bucket ID (from 1 to N)
KJDRHVCMAS
NUMBER
master-node that this bucket is attributed to
KJDRHVPMAS
NUMBER
previous master (before reconfiguration)
KJDRHVRMCNT
NUMBER
number of reconfigurations
Partial DESCRIBE of X$KJDRPCMHV
Name
Type
-------------- ------
KJDRPCMHVID
NUMBER
bucket ID ( from 1 to N )
KJDRPCMHVCMAS NUMBER
master-node that this bucket is attributed to
KJDRPCMHVPMAS NUMBER
previous master (before reconfiguration)
KJDRPCMHVRMCNT NUMBER
number of reconfigurations
DLM Functions
The main DLM client APIs are:

kjual: Connection to DLM
kjpsod: Disconnection from DLM
kjusuc: Synchronous open and convert a lock
kjuscv: Synchronous convert a lock
kjuscl: Synchronous close a lock
kjuuc: Asynchronous open and convert a lock
kjucv: Asynchronous convert a lock
7-190
DLM Functions
kjual is called when the Oracle shadow process is started.
kjpsod is called before the Oracle shadow process leaves.
The other functions are used to manage only non-PCM resources and locks.
kjual Connection to DLM
Every DLM client (local or remote process) is

identified by a kjp structure:
OS process PID
Process node number
Process flags (such as DEAD, RMOT, LOCL)
List of process-created DLM locks
Queue of pending AST for the process
Various statistics on lock conversion activity
For a local process, the structure is allocated by

kjual at process start.
For a remote process, the structure is allocated by

LMD when a lock creation request comes from a
remote instance.
7-191
kjual Connection to DLM

Interesting members of kjp structures are :
ub4
#define
#define
#define
#define
#define
#define
#define
#define
#define
#define
flg_kjp;
/* process flag
KJP_DEAD 0x0001
/* process is dead, pending cleaned up
KJP_LMON 0x0002
/* process is the DLM-MON
KJP_DLMD 0x0004
/* process is DLMD
KJP_RMOT 0x0008
/* remote process
KJP_LOCL 0x0010
/* local process
KJP_IOPENDING 0x0020
/* has i/o pending, dont remove
KJP_IID
0x0040 /* 'Important' process: death =>inst termn
KJP_DLMS 0x0080
/* process is LMS
KJP_DIAG 0x0100
/* process is DIAG
KJP_RMRDR 0x0200 /* p. is reading a PT/HV struct, critical sec
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
kjual Connection to DLM (continued)

kjsolk
kjsolk
skgpid
kjftnid
word
ksupr
ub4
ub4
ub4
ub4
lock_q_kjp;
/* list of locks created by this process
ast_q_kjp;
/* ast queue
pid_kjp;
/* OS pid of process
node_kjp;
/* ID of the node the proccess belong to
orapnum_kjp;
/* oracle process number
*oraproc_kjp;
/* oracle process structure address
loc_lck_cvt_tm_kjp[KJST_CONVTYPE];
/* cumulative time of local converts
loc_lck_cvt_ct_kjp[KJST_CONVTYPE];
/* cumulative number of local converts
rem_lck_cvt_tm_kjp[KJST_CONVTYPE];
/* cumulative time of remote converts
rem_lck_cvt_ct_kjp[KJST_CONVTYPE];
/* cumulative number of remote converts
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
kjual Flow
Pid-1
1: Allocate and
initialize
P1
Procs.
LMD0
Locks
2: Update ges_procs
in v$resource_limit
7-193
Res.
LMON
kjpsod Flow
1: Flag procp
2: Clear ASTs,
put to freelist
P1
Pid-1
KJP_DEAD
Procs.
LMD0
Locks
3: Update ges_procs
in v$resource_limit
7-194
Res.
LMON
DML Enqueue Handling Flow: Example
In this example, three processes on

two nodes work on the EMPLOYEE
table:
1.
2.
3.
4.
5.
6.
7-195
P1 locks table in share mode.

P2 does rollback.
P1 locks table in exclusive mode.
P1 does rollback.
P1 and P2 are on node 1; P3 is on

node 2.
The enqueue for EMPLOYEE is
mastered on node 2.
P1
Node 1
P2
P3
Node 2
Enq.
DML Enqueue Handling Flow: Example

The steps in the slide are covered twice in the following slides, focusing first on the lock
states and then on the code references.
Step 1: P1 Locks Table in Share Mode

Instance 1
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSERPR KJUSERPR 65549
2 16190
1
0
Instance 2
RESOURCE_NAME
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
--------- -------- ------ ------ ------ ------------- ---------KJUSERPR KJUSERPR
0
0 13354
0
0
7-196

The EMPLOYEE table in this example has object ID 0x6dfd, thus the enqueue is
[TM][0x6dfd][0]. The columns RESOURCE_NAME, ON_CONVERT_Q, ON_GRANT_Q,
MASTER_NODE, and NEXT_CVT_LEVEL are from V$DLM_RESS, and the columns
GRANT_LEVEL, REQUEST_LEVEL, TRANSACTION_ID0, TRANSACTION_ID1,
PID, OPEN_OPT_DEADLOCK, and OWNER_NODE are from V$DLM_ALL_LOCKS.
Column names may be abbreviated in the slides.

Instance 1
RESOURCE_NAME
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV
--------KJUSERPR
KJUSERPR
REQUEST_ TX_ID0 TX_ID1

-------- ------ ------ ------ ------------- ---------KJUSERPR 65551
2 16287
1
0
KJUSERPR 65549
2 16190
1
0
Instance 2
RESOURCE_NAME
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
0
0 13354
0
0
7-197

There are no changes for instance 2 locks.
Step 3: P2 Does Rollback

Instance 1
RESOURCE_NAME
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
2 16190
1
0
Instance 2
RESOURCE_NAME
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
0
0 13354
0
0
7-198

There are no changes for instance 2. You are effectively in the same state as at step 1.
Step 4: P1 Locks Table in Exclusive Mode

Instance 1
RESOURCE_NAME
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
--------- -------- ------ ------ ------ ------------- ---------KJUSEREX KJUSEREX 65549
2 16190
1
0
Instance 2
RESOURCE_NAME
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
--------- -------- ------ ------ ------ ------------- ---------KJUSEREX KJUSEREX
0
0 13354
0
0
7-199
Step 4: P1 Locks Table in Exclusive Mode

This causes changes for both instances.

Instance 1
RESOURCE_NAME
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
2 16190
1
0
Instance 2
RESOURCE_NAME
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
1
1
1 KJUSERNL
GRANT_LEV
--------KJUSEREX
KJUSERNL
7-200

-------- ------ ------ ------ ------------- ---------KJUSEREX
0
0 13354
0
0
KJUSERPR 131085
2 16199
1
1

One lock is in the convert queue (REQUEST_LEVEL is KJUSEREX and
GRANT_LEVEL is KJUSERNL) on instance 2. There is no change in instance 1.

Instance 1
RESOURCE_NAME
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
Instance 2
RESOURCE_NAME
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
1
1
1 KJUSERNL
GRANT_LEV
--------KJUSERNL
KJUSERPR
7-201

-------- ------ ------ ------ ------------- ---------KJUSERNL
0
0 13354
0
0
KJUSERPR 131085
2 16199
1
1

Instance 1 now has no rows in V$DLM_ALL_LOCKS. V$DLM_ALL_LOCKS is updated
on instance 2, but the lock is not removed. GRANT_LEVEL and REQUEST_LEVEL are
both set to KJUSERNL.
Steps 1 and 2: Code Flow
ktaiam
Kernel Transaction Access,

Internal Allocate DML lock
ksqgtl
Get an enqueue
ksqcmi
General Get/Convert function

(Change Mode Internal)
ksipget
kjusuc
7-202
Get a group lock

Synchronous upconvert
Steps 1 and 2: Code Flow

The next few slides show the same steps in greater detail. For each step, there is an
overview of the code stack active, followed by the corresponding flow detail.
In step 1, P1 locks the table in share mode. In step 2, P2 locks the same table in share
mode. The difference shows up in the kjusuc processing.
ksqgtl: Get an enqueue, type = TM, id1 = table_object_id, id2 = 0, timeout =
infinite. Allocate enqueue lock and hang lock on appropriate resource before calling
ksqcmi.
ksqcmi: General get-convert function. Register wait-event for the specific enqueue.
Compute XID for DLM. Set up option for lock get (DEADLOCK detection required).
Only register the wait event enqueue, set up kjiwev. When kjusuc waits for AST,
the wait-event registered in kjiwev is used.
ksipget: Synchronous interface to DLM for lock GET. Set up DLM resource name.
Set timeout to infinite. Increment global lock sync gets. On return of kjusuc,
increment global lock get time.
Step 1: kjusuc Flow Detail
lock_q_kjp resp_kjl
Proc.1
Res.1
procp_kjl Lock1
KJL_OPENING
1:Allocate
P1
2:Set
3:Allocate
4:Compute
Instance 1
7-203

1: Allocate lock1 and update V$RESOURCE_LIMIT.
2: Set lock state to KJL_OPENING.
3: Allocate resource1 and update V$RESOURCE_LIMIT.
4: Compute Master-node. This uses the algorithm explained earlier.
Because this is the first time the instance has shown interest in this resource, it has to
send a message to the master node.

Proc.1
Res.1
Res.1
Lock1
8:Granted
6:Lock CVT
9:AST
LMD0
2:Allocate
3:Send
LMD0
KJX_CONV_AST_IND
P1
Lock1
1:Allocate
7:Loop
5:Send
KJX_OPEN_CONVERT_DIR_REQ
10:Complete
Instance 1
7-204
Instance 2
Step 1: kjusuc Flow Detail (continued)

5: Send message to the master (directory) instance.
Now two activities occur in parallel in the two instances (instance number in
parentheses).
(1) 6: Put lock on convert queue, and hang it in the deadlock queue. This type lock
TM has an infinite timeout, so it is not attached to the timer queue, but is put on the
deadlock queue as it can become part of a deadlock.
(1) 7: Loop waiting for AST, by testing a flag event=enqueue.
(2) 1: Allocate process 1 descriptor and lock 1. Lock 1 is in same mode.
(2) 2: Because the resource has never been used in instance 2, it must be created, and
then it is linked to lock 1. Because it is the first time that resource 1 is used in instance 2,
the open convert request will be successful.
(2) 3: Queue and send a message to the requester instance.
Instance 1 has been waiting to continue; the remaining steps happen in instance 1.
8: Put lock in grant queue and remove it from the deadlock queue.
9: Send AST to client process by setting its flag.
10: Process the AST; clear KJL_OPENING and exit.

Proc.1
Lock1
Proc.2
Res.1
Lock2
1:Allocate
P2
KJL_OPENING
KJL_CONVERTING
2:Set
3:Allocate
4:Complete
Instance 1
7-205

The resource exists already, so processing is simpler. This lock can be granted
immediately, because of the following:
There is no incompatible lock locally.
The requesting mode is S, which is the same as the held mode. Granting another S
mode lock does not increase the lock mode.
There is no need to send a message to the master instance.
1. Allocate lock1 and update V$RESOURCE_LIMIT.
2. Set the lock state to KJL_OPENING, KJL_CONVERTING.
3. Hang the lock on existing resource 1.
4. Process AST; clear KJL_OPENING, KJL_CONVERTING and exit.
Step 3: Code Flow
ktaidm

Internal Delete DML lock
ksqrcl
Release an enqueue
ksqcmi

ksiprls
kjuscl
7-206
Release a group lock

Synchronous close
Step 3: Code Flow

In step 3, P2 releases the table share mode lock by doing a rollback.
Ksiprls is the synchronous interface to DLM for CLOSE lock. On return of kjuscl,
increment the global lock releases statistic.
Step 3: kjuscl Flow Detail

Proc.1
Lock1
Proc.2
Res.1
Lock2
KJL_CLOSING
2:Remove
P2
1:Set
3:Free
4:Complete
Instance 1
7-207

Because lock 1 is still attached to resource 1, you cannot free resource 1.
1. Set lock state to KJL_CLOSING.
2. Remove lock 2 and process 2 from resource 1.
3. Free lock 2. Update V$RESOURCE_LIMIT.
4. Exit. Because removing lock 2 does not change the held mode of resource 1, or its
request mode, no message is sent to the master node.
Step 4: Code Flow
ktagetg0
ksqcnv
Convert an enqueue
ksqcmi

ksipcon
kjuscv
7-208

Get Generic DML lock
Convert a group lock

Synchronous Convert
Step 4: Code Flow

In step 4, P1 upgrades the table share mode lock to an exclusive lock.
ksqcnv provides a lock description, as obtained previously with kjusuc.
ksipcon is the synchronous interface to DLM for lock convert. Calls kjuscv with
timeout = infinite, increments global lock sync converts, and updates global lock
convert time.
Step 4: kjuscv Flow Detail
Res.1
Proc.1
Lock1
KJL_CONVERTING
P1
1:Set
3:Deadlock queue
2:Re-Queue
Instance 1
7-209

Resource 1 and lock 1 are allocated and linked.
Because satisfying this conversion would bring the resources held mode from S to X,
and instance 1 is not the master instance, a message must be sent to the master instance
to see if conversion is possible.
1. Set lock state to KJL_CONVERTING.
2. Remove lock from grant queue to convert queue for resource 1. Lock 1 is not hung
on the timer queue because kjuscv is called with timeout = infinite.
3. Hang lock 1 on the deadlock queue, because lock1 is deadlockable.
Note that the lock is hung on the timer queue and the deadlock queue only if the lock is
local. In other words, the owning instance of the lock is the same as the local instance.

Proc.1
Res.1
Lock1
Res.1
Lock1
6:Granted
7:AST
LMD0
2:Send
LMD0
KJX_CONV_AST_IND
P1
1:Convert
5:Loop
4:Send
KJX_CONVERT_REQ
8:Complete
Instance 1
7-210
Instance 2
Step 4: kjuscv Flow Detail (continued)

4. Send message to the master (directory) instance.
5. Loop waiting for AST, by testing a flag event = enqueue.
Instance 2
1. Convert lock 1 from S to X immediately, because you are in the master instance
and there is no conflict.
2. Queue and send a message to the requester instance.
Instance 1 has been waiting to continue.
6. Put lock in grant queue and remove it from the deadlock queue.
7. Send AST to client process by setting its flag.
8. Process the AST; clear KJL_CONVERTING and exit.

3:Queue
Res.1
Proc.1
Lock1
Proc.3
1:Allocate
LMD0
5:Send
KJX_CONV_AST_IND
P3
Lock3
KJL_OPENING
2:Set KJL_CONVERTING
4:Queue
Instance 1
7-211
Instance 2

In step 1, P3 requests locking the table in share mode. The code path is the same as for
step 1 and 2, with processing in kjusuc.
1. Allocate lock 3 and process 3, update V$DLM_RESOURCE_LIMIT.
2. Set the state of lock 3 to KJL_OPENING, KJL_CONVERTING.
3. Put lock 3 in the convert queue for resource 1. Lock 3 is in conflict with lock 1 so
it cannot be granted immediately.
4. Put lock 3 in the deadlock queue.
5. Send a message to see if something has changed in the blocker instance. One
message is sent for every lock on the grant queue of resource 1 and in conflict with
lock 3. One message is also sent for every lock in the convert queue with a
previous mode conflicting with lock 3.

Proc.1
Res.1
Res.1
Proc.1
Lock1
Lock1
KJL_CLOSING
Proc.3
5:Free
1:Set
Lock3
2:Change
P1
4:Release
3:Send
LMD0
KJX_CONVERT_REQ
6:Complete
Instance 1
7-212
Instance 2

In step 6, P1 releases its exclusive table lock by doing a rollback in instance 1.
1. Set lock state to KJL_CLOSING.
2. Move the lock from S mode to N mode.
3. After converting from X to NULL, lock1 lowers the resource1 held-mode and
must send a message KJX_CONVERT_REQ to the master instance.
4. Release resource 1 because there are no longer any locks on it, and update
V$RESOURCE_LIMIT.
5. Free lock 1 and update V$RESOURCE_LIMIT.
6. Exit.

Res.1
Proc.1
Lock1
Proc.3
Lock3
P3
1:Convert
LMD0
4:Complete
3:AST
2:Grant
Instance 2
7-213
Step 6: kjuscl Flow Detail (continued)

On receiving the KJX_CONVERT_REQ message from instance 1
1. Lock 1 is converted from X to NULL.
2. Attempt to grant all locks on the convert queue for resource 1, because lock 1 has
been downgraded to NULL. This therefore grants lock 3.
3. An AST is sent to P3, which is still waiting from step 5.
4. P3 processes the AST, completes its lock acquisition, and exits the DLM, letting
the transaction continue.
Code References
kj*.*: Kernel Lock manager

kcl.*: Kernel Cache Lock background process
7-214
Summary

Lock manager architecture
Main functional flow of global locks
7-215
References and Further Reading
Oracle8.0 DLM Under the Covers and Beside the

Point, by Daniel Semler (1998)
7-216
References and Further Reading

Daniel Semlers paper is available under WEBIV reference note 72568.1.
Cache Coherency (Part One)
Enqueues/Non-PCM
Objectives

the following:
Describe enqueue types
Follow the locking and deadlock detection
algorithms
8-219
Cache Coherency: Enqueues

Node
Other
nodes
Instance
Caches
ksi/ksq
GRD(GES)
CGS
I
P
C
There are over 70

types of enqueues,
such as:
NM
CF Control Files
CI Cross Instance
Call
DM Mount Lock
LB Library Cache
Lock
IR Instance
Recovery
CM
8-220
RAC and Global Resources

The GRD consists of Global Cache Services (GCS), which handles the data blocks, and
Global Enqueue Service (GES), which handles enqueues and other global resources.
The enqueues (representing such things as transactions) have to be kept coherent across
instances.
The global resources covered by GES are the row cache (dictionary cache) and the library
cache.
Alphabetical List of Enqueues

The boldfaced items in the following list are not documented in Oracle9i Real Application
Clusters Deployment and Performance, Appendix A. Most of these items are also listed in
the Database Reference manual under V$LOCK.
AK
DLM deadlock Detection
BR
Backup Recovery
CF
Controlfile Transaction
CI
Cross-instance Call Invocation
CU
Bind Enqueue
DF
Datafile
DL
Direct Loader Index Creation
DM
Database Mount
DR
Distributed Recovery
DV
PL/SQL Diana Version
DX
Distributed TX
FS
File Set
HW
Space Management on specific segment
IN
Instance Number
IR
Instance Recovery
IS
Instance State
IV
Library Cache Invalidation
JQ
Job Queue
KK
Redo Log Kick
KM
Resource Manager Load
L[A-P] Library Cache Lock
MM
Mount Defenition
MR
Media Recovery
N[A-Z] Library Cache Pin
OC
Outline Management
OL
Outline Management
PF
Password File
PI
Parallel Slaves
PR
Process Startup
PS
Parallel Slave Synchronization
Q[A-Z] Row Cache
RT
Redo Thread
Alphabetical List of Enqueues (continued)

SC
SM
SMON
SN
Sequence Number
SQ
Sequence Number Enqueue
SR
Synchronized Replication
SS
Sort Segment
ST
Space Management Transaction
SV
Sequence Number Value
SW
Resume/Suspend change
TA
Transaction Recovery / Generic Transaction Enqueue
TM
DML Enqueue
TS
Temporary Segment (also TableSpace)
TT
Temporary Table
TX
Transaction
UL
User-defined Locks
UN
User Name
US
Undo Segment, Serialization
WL
Being Written Redo Log
XA
Instance Attribute Lock
XR
CKPT Direct Block Loader
XI
Instance Registration Lock
The list is not complete. Look for ksqget calls in the source code to get more
information.
Enqueue Types
Enqueues are broadly divided into:

Instance: Instance mount and recovery; manage
SCN
Transaction: Locking tables and rows
Library cache, such as cursors
Dictionary cache
Parallel Query
User mode
8-223
Most enqueues are used in single and shared

instances; a few are relevant to shared instances
only.
Enqueue Types
Refer to WebIV Note 1020008.6 for a lock decoding script. The standard supplied
CATBLOCK script creates the view DBA_LOCK and DBA_LOCK_INTERNAL. These DBA
views do not expand the RAC-only enqueues.
User mode enqueues are created and used by applications, they are a simple named
resource without relation to server data structures.
Enqueue Structure
V$LOCK examines which locks are queued on the

resources.
Resource structure: ksqrs
<TM,432,0>
Owners
Waiters
Converters
Lock structures:
ksqlk
(showing modes)
S -> X
SX
8-224
Enqueue Structure
When access is required by a session, a lock structure ksqlk is obtained and a request is
made to gain access to the resource at a specific level (mode). The lock structure is placed
on one of the three linked lists (called the owner, waiter, and converter lists) that hang off
of the resource.
Examining Enqueues
8-225
V$LOCK: Locks held

V$ENQUEUE_STAT : Enqueue statistics by type
Examining Enqueues
In V$LOCK, the mode held (LMODE) and request (REQUEST) columns determine if the
enqueue is an owner, waiter, or converter:
Held
Request Enqueue is
Nonzero
Zero
Owner
Nonzero
Nonzero Converter
Zero
Nonzero Waiting
For V$ENQUEUE_STAT, the average time waited in milliseconds is
CUM_WAIT_TIME / TOTAL_WAIT#.
Enqueues and DLM

Enqueues are requested by clients in the ksq layer. If it
must be a global enqueue, then a similarly named DLM
lock is requested in the kj layer.
Get
ksqget
Convert
ksqcnl
Release
ksqrcl
ksq
Local
Enqueue processing
ksqcmi
ksipget
ksipcon
ksi
kjusuc
kjuscv
kju
Global
DLM
8-226
Enqueues and DLM

Local enqueues have their processing completed in the ksq. Global enqueues are
processed further in ksi, kju, and so on.
Each enqueue resource has a corresponding DLM resource, and each enqueue lock has a
corresponding DLM lock.
Every DLM lock for global enqueue uses group-based locking, even though every process
in an Oracle instance belongs to the same group. The code distinguishes group-based
and process-owned locks, but there is no longer a group concept.
If there is a current transaction, then the transaction identifier (XID) is part of the DLM
lock identification and is then used for deadlock detection.
If there is no current Oracle transaction, then an identifier concatenating the thread number
(2 bytes), Oracle process ID (2 bytes), and ksuseq is used. ksuseq always begins with
a 0 for an Oracle process and is incremented for each identifier.
Source Tree for Non-PCM Lock Flow
KSQ
KGL
KQR
Misc.
Clients
KQLM
KSI
KJU
8-227
Source Tree for Non-PCM Lock Flow

The ksq layer always calls the ksi layer with an XID to create the DLM lock. Other
layers, such as kqr or kqlm, call the ksi layer without an XID (process-owned) and
therefore do not use the deadlock detection feature of DLM.
Lock Modes
8-228
Enqueues are resources that are locked in various

modes.
The DLM lock modes differ from other modules in
naming.
DLM
Value
Local
Granted (Owner)
Other Grants
GCS
NULL
CR
CW
PR
PW
EX
0
1
2
3
4
5
NULL
SS
SX
S
SSX
X
No Access
Read
Read or Write
Read
Read or Write
Read or Write
Anything
Read or Write
Read or Write
Read
Read
No Access
9
9
Lock Modes
These are the GES lock modes. The naming differences between the DLM and the kernel
lock mode names result from historical reasons.
For GCS locks, only the NULL, Share, and Exclusive locks are used.
Lock Compatibility
NL:NL
CR:SS
CW:SX
PR:S
PW:SSX
EX:X
8-229
NL:NL
Yes
Yes
Yes
Yes
Yes
Yes
CR:SS
Yes
Yes
Yes
Yes
Yes
No
CW:SX
Yes
Yes
Yes
No
No
No
PR:S
Yes
Yes
No
Yes
No
No
PW:SSX
Yes
Yes
No
No
No
No
EX:X
Yes
No
No
No
No
No
Lock Compatibility
Compatible locks can exist on the grant queue at the same time. The locks on the request
queue are incompatible with the locks on the grant queue and are incompatible with other
locks on the convert queue.
Note that although a PR or S mode is more restrictive, it is not compatible with the lesser
mode CW. This prohibits simple downgrading of the lock mode from PR to CW.
A special case exists for the PR and CW combination. A PR lock on the convert queue can
be compatible with the most restrictive mode lock on the grant queue (for example,
another PR lock) and still not be compatible with a less restrictive lock (the CW lock) on
the grant queue.
The GCS lock modes are underlined.
Deadlock Detection:
The Classic Deadlock
time
Process 1
Locks resource
R1 in mode X
Process 2
OK
Requests
resource R2 in
mode X
Waits
Waits
Locks resource
R2 in mode X
OK
Requests
resource R1 in
mode X
Waits
Deadlock
8-230
Deadlock Detection: The Classic Deadlock

The slide shows the classic deadlock scenario. The resources in question could be anything.
In the server, they could be rows, tables, ITL slots, or library cache or row cache locks.
This situation can also occur in a RAC cluster, even where the processes are on separate
nodes.
Deadlock Detection:
The Classic Deadlock
L2
P1
R1m
Blocked convert request

of a lock on a resource
L3 P2
Lock held in a blocking

mode on a resource
Distributed resource
R2m
Process performing lock

operation on a resource
N1
Node in a cluster
R2s
L4
Node x in a cluster
Px
Process x on a node
Lx
Lock x
Rym Resource y (master)
Rys Resource y (shadow)
Nx
P2
R1s
L5
P3
P1
N2
8-231
Deadlock Detection: The Classic Deadlock (continued)

The slide shows the classic deadlock as it is viewed by deadlock detection algorithms. In
this case, the two processes are on different nodes, and the resources are distributed with a
master and a shadow of each resource present. It is evident that in a multinode
environment, deadlock detection requires tracking lock converters and blockers from one
node to another. This task is performed by the LMD processes.
Deadlock Detection:
A More General Example
L2
P1
P1
R1m
R1s
L1
L3 P2
Blocked convert request

of a lock on a resource
P3
Lock held in a blocking

mode on a resource
L8
R4m
R2m
P2
Node in a cluster
R2s
P2
R4s
R13
R3m
L5
P3
P1
P1
Node x in a cluster
Px
Process x on a node
Lx
Lock x
Rym Resource y (master)
Rys Resource y (shadow)
Nx
L7
P2
L6 P3
N2
8-232
Process performing lock

operation on a resource
N4
N1
L4
Distributed resource
N3
Deadlock Detection: A More General Example

Whenever processes share resources, deadlock situations can occur. A simple deadlock
scenario occurs when entity A holds resource Y in exclusive mode, entity B holds resource
Z in exclusive mode, and each entity contends for the resource held by the other. If neither
entity is willing to give up its access rights on the held resource, a deadlock has occurred.
The owning entity that the lock manager uses to determine deadlocks is determined by an
ID that is passed to the lock manager during the lock open calls. This may be a process
identified by a PID or an Oracle transaction identified by a deadlock ID (DID).
The lock manager performs deadlock detection whenever a request is made to convert a
lock and the request cannot be granted in a short period of time. As part of the convert
option in the lock convert call, the user specifies whether a particular lock will participate
in deadlock detection.
Wait-For Graph
In the context of lock operations, a wait-for graph is a graph where nodes are the
participating processes (or transactions) and resources, and the edges are the converting
and held locks.
In a generalized case, this graph involves multiple resources and locks being operated by
many processes and transactions spanning some or all of the nodes of a cluster.
A cycle in the wait-for graph indicates a deadlock situation.
Deadlock Detection and Resolution
Deadlock detection is done at several layers.

ksq resolves local deadlocks (non-RAC).
DLM in kjd resolves global deadlocks.
Message deadlocks are prevented by the Message
Traffic Controller (TRFC).
8-233
Oracle deadlock detection is driven by timeouts.
Deadlock Detection and Resolution

Deadlock detection can be performed whenever any lock is requested or when needed.
Finding out whether there is a deadlock can be very time consuming because the number
of resources and locks increases. The Oracle kernel therefore uses the when needed
approach to check for deadlocks whenever someone has waited a long time, presumably
because there is a deadlock.
Resolution of a deadlock requires one holder to release the locks, thereby effectively
aborting its work.
Timeout-Based Deadlock Detection
Each deadlock detectable lock is put on the

deadlock timer queue if it is queued for convert.
A deadlock search starts when timeout on the
convert expires.
Timeout is _lm_dd_interval seconds;
the default is 60.
LMD performs the search, one lock at a time.

A deadlock graph trace file is generated.
The dd_ts_server resource (DI,0,0) must be held
in EX mode to perform a deadlock search.
8-234
Timeout-Based Deadlock Detection

Deadlock detection searches attempts to find a cycle. It begins by building graphs of the
converting lock through the blocking processes and then through the locks that they are
waiting on. It may well span more than one node. If a cycle is found, then the solution is to
return an error to one of the processes in the cycle. If the deadlock cycle is contained
entirely within a node, then the last process in the cycle is the one that gets the error. If the
cycle spans nodes, then the process that initiated the search receives the error.
Timeout for deadlock detection to start is current_time + (60 + number_of_nodes
/ 2) seconds.
Deadlock Graph Printout
/users/t920r/admin/t920r/bdump/t920r_1_lmd0_24675.trc
Oracle9i Enterprise Edition Release 9.2.0.1.0 Production
With the Partitioning, Real Application Clusters, OL
JServer Release 9.2.0.1.0 - Production
Instance name: t920r_1

Redo thread mounted by this instance: 0 <none>
Oracle process number: 5
Unix process pid: 24675, image: oracle@sunblade (LMD0)
*** 2002-07-11 09:45:04.187
Global Wait-For-Graph(WFG) at ddTS[0.27] :
BLOCKED 22c432bc 5 [0x6dfe][0x0],[TM] [65549,2] 0
BLOCKER 22c4c19c 5 [0x6dfe][0x0],[TM] [131085,2] 1
BLOCKED 22c6224c 5 [0x6dfd][0x0],[TM] [131085,2] 1
BLOCKER 22c42eac 5 [0x6dfd][0x0],[TM] [65549,2] 0
LOCK
MODE
8-235
ID1
ID2 TYPE
Deadlock Graph Printout

When a database is opened, each LMD0 process opens a lock in NULL mode on
resource DI,0,0. Each instance in turn performs a deadlock detection by converting this
lock from NULL to X mode. Each deadlock detection is limited in time
(number_of_nodes_in_cluster / 2 minutes). If a deadlock is not found during this time,
then the deadlock detection is aborted to avoid spending too much time tracing deadlock
graphs, because these can be very lengthy. The deadlock detection is distributed.
When a lock is moved to the CONVERT-QUEUE of a resource (because of trying to
convert to a conflicting mode), this lock is attached to the end of a deadlock queue. This
lock will be a candidate for deadlock detection.
Deadlock detection involves the following three steps :
1. Deadlock search: In this step, an oriented wait-for graph is built. Several nodes can
be involved in the building of this graph if the deadlock spans several nodes.
2. Deadlock validation: If a deadlock is found, then each node that is involved in the
previous search validates each lock in its own subgraph. (These locks must remain
valid, that is, not canceled.)
3. Wait-for graph printing: If the previous step is successful, then the whole graph is
printed.
Deadlock Flow
DI-0-0
resource
EX
NL
LMD0
LMD0
Begin deadlock detection
L3
L2
L1
Deadlock queue
Node 2
Node 1
8-236
Deadlock Flow
When an enqueue lock enters the convert queue and if it can be deadlocked (that is, if it is
of the type TM, TX, or UL), then the lock information is also put in the deadlock queue.
At this time, you compute a time to deadlock detection, time_to_dd (expressed in
seconds for this lock), as the number of active nodes / 2 + _lm_dd_interval as a
timestamp, which is now + time_to_dd.
LMD0 checks the deadlock queue every five seconds and starts a deadlock search if the
deadlock queue is not empty and if the lock at the head of the deadlock queue is in the
queue for more than time_to_dd. Otherwise, LMD0 moves the lock in the head of the
deadlock queue to the tail and returns to normal activity.
If a deadlock detection starts on node 1, then LMD0 converts its lock on DI,0,0 from
NULL to EXCLUSIVE; in the whole cluster, only one node is allowed to start DD.
Deadlock Flow
DI-0-0
resource
EX
NL
LMD0
L3
LMD0
L2
L1
Deadlock queue
Node 1
8-237
Node 2
Deadlock Flow (continued)

If the DI lock to EX mode conversion is successful, then LMD0 performs the following:
1. Take a lock L1, which is in convert state (otherwise, it would not be in the deadlock
queue) and is owned by a process P from the head of the deadlock queue.
2. Put L1 in the deadlock queue just before a lock having a timestamp bigger than L1s
timestamp, and then start a deadlock detection from L1. The time_to_dd of L1 is
adjusted to the number of active nodes * 10 + _lm_dd_interval. This adjusted
value is used for the second and all subsequent deadlock searches for L1.
3. LMD0 begins to build an oriented graph, with L1 as the base head.
For each lock, a counter is maintained, which reports the number of times deadlock
detection is started from the lock.
So a deadlock is found at most time_to_dd seconds after the beginning of conversion.
Deadlock Flow: One Node

DI-0-0
resource
EX
NL
LMD0
LMD0
Deadlock graph
R1
X11
X12
X13
R132
X1
Node 1
8-238
Node 2
Deadlock Flow: One Node

Graph generation ends when a deadlock is found locally (in this slide), or when no
deadlock is found or a remote resource/process is found (following slide).
Notes for Deadlock Graph
(R1) a resource with lock L1 which is owned by X1 and is on convert queue of R1
(X11, X12, X13) XID of owners of locks on grant_q or convert queue of R1
which conflict with L1
(R132) resource that one lock of X13 and is on convert queue
(X1) X1 is owner of one lock of grant_q or convert queue of R132 conflicting
with X13.
(Yellow triangle) Here the deadlock is found.
Deadlock Flow: Two Nodes

DI-0-0
resource
EX
NL
LMD0
LMD0
Deadlock graph
R1
Deadlock graph
KJX_DEADLOCK_IND
R1
message
X11
8-239
X12
X13
X11
X12
X13
R132
R132
X1
X1
Deadlock Flow: Two Nodes

Deadlock graph building on node 1 ends after finding a remote resource and after LMD0
returns to its message-processing loop.
Notes for Deadlock Graph
The dashed resources are shadow enqueues of the other nodes' owned enqueues.
At node 1, deadlock graph generation performs the same steps until it examines the
holders of R132 and finds X1 to be on the other node.
Node 1 deadlock detection sends message KJX_DEADLOCK_IND to node 2 to
continue deadlock detection.
Node 2 builds the same graph.
X1 is the owner of one lock of grant_q or convert queue of R132 conflicting with
X13.
Deadlock is found.
Node 2 may have concluded that there is no deadlock: X1 does not own anything else in
the graph, in which case it sends a BACKTRACK message to node 1, effectively asking it to
search further or to conclude that there is no deadlock.
Node 2 may also have found a resource on another node (node 1 or node 3), in which case
(like node 1) it sends the KJX_DEADLOCK_IND message for further remote processing.
Parallel DML (PDML) Deadlocks
8-240
Locks that are identified by the transaction

identifier (XID) may fail to detect deadlocks
involving PDML operations that have the same XID.
A spanning set of a transaction TX is the list of
nodes where this transaction takes place.
The coordinator of the PDML transaction
publishes the spanning set by using the CGS
name service.
When a lock is opened and found to be involved in
a PDML, then IDLM is informed by the API (to
perform a global DD).
Parallel DML (PDML) Deadlocks

A spanning set is identified with name = <XID> of TX and value = <spanning set>.
You can use the CGS name service to create or search for a name (the spanning set
identifier).
Deadlock Detection Algorithm
The simple algorithm is enhanced to account for the

PDML identifiers.
8-241
Deadlock Detection Algorithm: Examples

In this example, transaction X1 converts a lock L on a resource R, then pushes RES(R) on
top of the stack.
While( stack is not empty){
pop element from STACK ;
if( RES){
push all conflicting LOCK of grant queue to top of STACK
push all LOCK conflicting on convert queue on top of STACK
or LOCK ahead on convert-queue to top of STACK
} else if LOCK and LOCK is remote {
send message to remote to continue;
save current stack and go back to normal TASK;
} else push RES that current LOCK has
lock on convert-queue on top of STACK
}
A deadlock is found when X1 is found on top of the stack.
Deadlock Detection Algorithm: Examples (continued)

In this example, dd-starting is lock L on the convert-queue of a resource R, then pushes
(R,L) on top of the stack. The algorithm is recursive and uses a stack. The stack contains
RES, TXN , TXN_GLOBAL, TXN_REMOTE, SHADOW_GRANT or SHADOW_CONVERT
While( stack is not empty){
pop element from STACK ;
switch( type_of_pop_element)
case RES:
L1 = poped lock
for each lock L on grant-queue of RES conflicting with L1 {
if Lbelong to the same XID as dd-starting lock then deadlock is found
if Lbelong to a remote node then push (SHADOW_GRANT,L) to top of stack
if L is local and L not in global TX then push (TXN,L) to top of STACK
if L is local and L in global TX then push (TXN_GLOBAL,L) to top of STACK
}
for each lock L on convert-queue conflicting with L1 {
if Lbelong to the same XID as dd-starting lock then deadlock is found
if Lbelong to a remote node then push (SHADOW_CONVERT,L) to top of stack
if L is local and L not in global TX then push (TXN,L) to top of STACK
if L is local and L in global TX then push (TXN_GLOBAL,L) to top of STACK
}
break;
case TXN:
L1 = poped lock
for each lock L with same XID as L1 {
if L is on convert-queue of ressource R then push (RES,R) to top of
stack
}
break;
case TXN_GLOBAL:
L1 = poped lock
query CGS to find out associated spanning set
for each node in spanning set different from local node {
push (TXN_REMOTE,L1) to top of stack
}
if local node belong also to spanning set then push ( TXN, L1) to top of
stack
break;
case TXN_REMOTE :
send message to remote node to continue ( message type KJX_DEADLOCK_IND)
go back to normal TASK
break;
case SHADOW_GRANT:
case SHADOW_CONVERT:
send message to remote node to continue ( message type KJX_DEADLOCK_IND)
go back to normal TASK
<----------- DD temporaryly stop in this node
break;
}
<----------- end switch
}
<----------- end while
If stack is empty {
if local node is not the dd-starting node then send a message dd-starting node
to perform a BACKTRACK ( message type
if local node is the dd-starting node then DEADLOCK is not found
}
Deadlock Detection Algorithm: Examples (continued)

When LMD0 in a node receives a message KJX_DEADLOCK_IND asking for
BACKTRACK, then deadlock detection resumes with the stack at the appropriate position.
When LMD0 in a node receives a message KJX_DEADLOCK_IND asking to continue DD,
then:
switch( sub-type of message){
case message_sent_by_SHADOW_GRANT :
case message_sent_by_SHADOW_CONVERT :
push the involved ressource on top of STACK
break;
case message_sent_by_TXN_REMOTE :
<-------- GLOBAL transaction
for L every converting lock owned by involved GLOBAL transaction {
R = resource where L in on converting-queue
push (RES, R) on top of STACK
}
break;
}
process STACK as described previous page
Deadlock Validation Steps
8-244
When the stack is popped, a wait-for graph (a list

of linked locks keeping track of DD PATH) is built
at the same time.
When a deadlock is found, then deadlock
validation occurs.
The validation also identifies the victim lock.
The victim lock is generally the starting-deadlocksearch lock.
Deadlock Validation Steps

If deadlock is found on a node other than dd-starting node {
send a message to dd-starting node asking for validation
/* dd-starting node, when receiving the request for validation, will
start validation as below */
} else { /* validation */
follow wait-for graph to see and examine lock by lock
{ if a lock in wait-for graph is invalid ( canceled) then the whole DD
is invalidated }
}
if the node of last lock in the wait-for graph is the local node or
this node receives a request for a validation and wait-for graph is
already validated {
/* the whole wait-for graph is validated , here we must be in ddstarting node */
if local_node is not the lowest node{
send the wait-for graph to node with lowest to print
} else print the whole wait-for graph
} else {
if local_node is not the lowest node send the wait-for graph to node
with lowest to print
send a message to node of last-lock in the wait-for graph to continue
VALIDATE ( LMD0 of this node will validate the subgraph with previous
code )
}
Code References
ksq.*: Kernel Service enQueues
8-245
Summary
In this lesson, you should have learned about:

GES activity in locking resources
LMD0 deadlock detection
8-246
Cache Coherency (Part Two)
Blocks/PCM Locks
Objectives

the following:
Describe the global cache service concepts and
components
Outline the history of Cache Fusion
Describe the flow of blocks and their locks in
Cache Fusion
9-249
Cache Coherency: Blocks

Node
Other
nodes
Instance
Caches
kcb/kcl
GRD(GCS)
CGS
I
P
C
NM
CM
9-250
Cache Coherency: Blocks

The GRD consists of Global Cache Services (GCS), which handles the data blocks, and
the Global Enqueue Service (GES), which handles the enqueues and other global
resources.
The term cache coherency is often used to refer to keeping the data buffer caches
coherent across instances, as it does represent the bulk of the cache coherency activity.
This cache coherency is handled by PCM locks. The block cache coherency can be
handled in two ways: disk pings and Cache Fusion. Oracle9i has both methods available.
Block Cache Contention
Block cache contention occurs when two caches want

the same resource.
Read/read contention
Write/read contention
Write/write contention
Holder
9-251
Requestor
Block Cache Contention

Contention occurs when instance H holds the resource and instance R requests the
resource. Instance R gets the resource. The complexity of Cache Fusion depends on how
much control of the resource is retained by instance H and the different types of requests
supported, such as current read and consistent read.
In Oracle9i, the resource could be sent via the communication services in all of the
following three cases. Enabling multiblock locks disables this.
Read/Read Contention
Read/read contention is currently not a problem due to shared disk architecture. A data
block from a read-only tablespace can be read by any instance without DLM
intervention. The blocks read this way (for example, from read-only tablespaces) are not
transferred across the caches in the current implementation.
Write/Read Contention
Depending on the read request type, instance H reduces its access rights (downgrades
the lock) on the block and sends a copy to instance R. This was the major change in
Oracle8i.
Write/Write Contention
Instance H reduces its access rights on the block and sends a copy to instance R. This
was the major change in Oracle9i. In earlier releases, it would flush the block to disk.
Earlier Cache Coherency:

Oracle8 Ping Protocol
Checks the instance for a lock

Requests DLM to acquire lock in specified mode
If there is a conflict, the holder is requested by the
master to write to disk and downgrade.
BAST is sent.
AST is sent on successful downgrade.
9-252
Reads the block from the disk
Oracle8 Ping Protocol

The protocol was also used in Oracle7 and earlier versions. All block data transfer was
via the disks. The DLM kept track of block ownership; that is, either one instance had
exclusive access, or several instances had shared read access. Any read request thus
involved a downgrading from exclusive to shared mode, by:
Flushing the redo log
Writing the block to the disk
Recovery
For recovery of one, several, or all instances, only the log threads of failed instances
apply. The log threads can be processed in any order. The ping protocol effectively
penalizes the steady-state OPS performance in favor of simpler and efficient recovery.

Oracle8i CR Server
9-253
Designed for the write/read contention

Holder constructs consistent read copy.
CR blocks are shipped across the communication
path.
Fairness counter implements Light Work Rule.
Oracle8i CR Server
The holder of a data block, on receiving a consistent read (CR) request, uses the undo
data (the blocks of which were locally resident in the cache) to construct the block.
Light Work Rule and Fairness Counter
If creating the consistent read version block involves too much work (such as reading
blocks from disk), then the holder sends the block to the requestor, and the requestor
completes the CR fabrication. The holder maintains a fairness counter of CR requests.
After the fairness threshold is reached, the holder downgrades it to lock mode.

Oracle8i CR Server
Requesting instance:
Foreground process prepares the buffer.
Sends the message to the master and waits
Gets CR buffer or a lock to read from disk
Master:
Checks the lock mode
Forwards the request to the holder if X mode held
Grants shared lock to the requestor on other modes
Holder:
Sends CR buffer
9-254
Oracle8i CR Server (continued)

The CR Server code was executed by a dedicated process, the Block Server Process
(BSP).
The Oracle8 ping protocol is used in the case of write/write contention, or any request
other than those for a consistent read.
Oracle9i Cache Fusion Protocol
9-255
Addresses write/write contention

Eliminates disk ping protocol; sends current
blocks via the communication path
Handles the recovery of blocks that have been
transferred across the cache
Uses the CR server functionality for write/read
contention
Oracle9i Cache Fusion Protocol

There are problems when shipping current blocks between instances. Consider a simple
case:
1. Instance A modifies a block, then the block is shipped to instance B. Before any
dirty block is sent, a log flush is made.
2. Instance B modifies the block, then the block is shipped back to instance A.
3. Instance A modifies the block again. No write of the block to disk has occurred in
any steps.
Note
If instance A dies, then its log contains records of modifications with a gap.
Modifications done in instance B are stored in the log of instance B.
For instance As crash recovery of the block, the two logs must be merged before they
can be applied. The current recovery code does not support this, except for media
recovery. The log merge, even if implemented, would require time and resources that
are proportional to the total number of instances. It does not matter whether instance B
does the crash recovery or not.
GCS (PCM) Locks
9-256
PCM locks manage the locking of data blocks in

the buffer cache.
PCM locks are internally mapped to a lock element
block class.
The block classes are described in
V$LOCK_ELEMENT, based on X$LE.
The PCM lock state information is stored in data
structures called lock elements.
The LMSn processes handle the PCM locks.
GCS (PCM) Locks

The synchronization cost for instance locks can be high. PCM locks are typically much
more numerous than non-PCM locks. The number of non-PCM locks does not grow as
high as the number of PCM locks. The local enqueues that become global can still be
seen in the V$LOCK view. Some instance locks and PCM locks, however, cannot be
seen in the V$LOCK view.
PCM Lock Attributes
Cache fusion separates PCM lock attributes into:

Lock modes
Lock roles
Past images
9-257
PCM Lock Attributes

Cache fusion changes the use of PCM locks in the Oracle server and relates the locks to
the shipping of blocks through the system via IPC. The objectives are to separate the
modes of locks from the roles that are assigned to the lock holders, and to maintain
knowledge about the versions of past images of blocks throughout the system.
Lock Modes
PCM locks use the following modes:

Exclusive (X)
Shared (S)
Null (N)
9-258
Lock mode compatibility is described as:

.
X
S
N
+
+
+
+
+
Lock Modes
A lock mode describes the access rights to the resource.
The compatibility matrix is clusterwide. For example, if a resource has an S lock on one
instance, then there cannot be an X lock for that resource anywhere else in the cluster.
Lock Roles
Roles can be:

Local: Block is dirty in the local cache.
Global: Block is dirty in a remote cache or several
caches.
9-259
Roles are for Cache Fusion.
Lock Roles
A lock role describes how the resource is to be handled. The treatment differs if the
block resides in only one cache.
Past Image
Is an indication
0: It is absent.
1: It is present.
9-260
Is present on modified block buffers that are not

current
Lock Past Image Attribute

Initially, a block is acquired in a local lock role with no past images. If the block is
modified locally and other instances express interest in the block, then the instance
holding the block keeps a past image (PI) and ships a copy of the block, and then the
role becomes global.
A PI represents the state of a dirtied buffer. Initially, a block is acquired in L role, with
no past images present. The node that modifies the block keeps past images, as the lock
role becomes G, only after another instance expresses interest in this block. A PI block
is used for efficient recovery across the cluster, and can be used to satisfy a CR request,
remote or local.
A PI must be kept by the node until it receives notification from the master that a write
to disk has completed covering that version. The node then logs a Block Written Record
(BWR). The BWR is not necessary for the correctness of recovery, so it need not be
flushed.
When a new current block arrives on a node, a previous PI is kept untouched because it
might be needed by some other node. When a block is pinged out of a node carrying a
past image and the current version, it might or might not be combined to a single PI. At
the time of the ping, the master tells it whether there is a write in progress that will
cover the older past image. If a write is not in progress, then the older PI is replaced by
the existing current block. If a write is in progress, then this merge is not done and the
existing current becomes another PI. There can be an indeterminate number of PIs.
Local Lock Role
Possible lock modes are S or X.

All changes are on the disk version, except for any
local changes (mode X).
When requested by the master instance, the
holding instance serves a copy of the block to
others.
If the block is globally clean, then this instances
lock role remains local.
If the block is modified by this instance and passed
on dirty, then a past image is retained and the lock
role becomes global.
9-261
Lock mode reads from disk if the block is not in

the cache.
Lock mode may write block if lock is X.
Local Lock Role

The local role states that the block can be handled very similarly to the way it is done in
single instance mode. In local role, the lock mode reads from disks and writes the dirty
block back to disk when it ages out without any further DLM activity.
Global Lock Role
9-262
Possible lock modes are N, S, or X.

Implies other instances also had or have the block
in global mode.
The block is globally dirty when role G is assigned.
It can modify the block further in mode X.
The instance cannot read from disk; it is not
known whether the disk copy is current or not.
Instance serves a copy to others when instructed
by the master.
Instance may only write a block in X mode or PI
when directed by the master.
The write requests must be sent to the master.
Global Lock Role

A global lock role limits the handling of a block, because another instance also has a
dirty version of the block, and the disk version of the block is obsolete.
Block Classes
9-263
There are ten classes of ORACLE blocks.

Each ORACLE block is protected by a PCM lock
that is described by a lock element structure.
Block Classes
Class
Description
1
DATA
2
SORT. These are never protected by PCM locks, because they are private to
one instance.
3
SAVE UNDO BLOCK, used for TBS management
4
SEGMENT HEADER
5
SAVE UNDO SEGMENT HEADER, used for TBS management
6
FREE-LIST
7
EXTENT MAP, used for unlimited extents
8
BITMAP BLOCK for locally managed tablespaces
9
BITMAP INDEX BLOCK for locally managed tablespaces
>=11 If odd, it is an UNDO HEADER, and the block type is (RBS_number*2) +
11, used for the transaction table.
If even, it is an UNDO BLOCK, and the block type is (RBS_number*2) + 12,
used for undo blocks.
Lock Elements (LE)
9-264
Reside in the SGA

Hold lock state information (converting, granted,
and so on)
Are managed by the lock processes to determine
the mode of the locks (exclusive, null, shared, and
so on)
Hold a chain of cache buffers that are covered by
the lock element
Allow the Oracle database to keep track of cache
buffers that must be written to disk in case a lock
element (mode) needs to be downgraded (X N)
Lock Elements
The lock elements (LE) are also known as BL type enqueues.
Allocation of New LE
For blocks other than UNDO:

id1 = BNO bit-ored ( AFN << 22)
id2 = ( AFN >> 10) << 15
For UNDO blocks:

id1 = ( BNO / _kcl_undo_grouping) %
_kcl_undo_locks
id2 = block class
9-265
The LE is identified by <BL,id1, id2>.

Which LMSn to use is given by:
(id1+id2)%(number_of_LMS_procs)
Allocation of a New LE
The block that is to be covered by the LE has an absolute file ID (AFN) and a block
number (BNO).
Note: Cache fusion applies only to blocks other than UNDO.
The default value of _kcl_undo_grouping is 32.
The default value of _kcl_undo_locks is 128. This represents the number of locks
per UNDO segment.
Hash Chain of LE
Every active releasable LE is in one hash chain.
Hash
Chain
head
of LE
9-266
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
Hash Chain of LE
The number of hash chain heads or buckets (NBH) is the nearest prime lower than
_db_block_buffers.
The hash algorithm for LE is ID1 modulus NBH.
Block to LE Mapping
BEGIN
LE
with same
Id1, Id2
in chain
yes Use it for

block
End
no
Take LE
from freeyes
list and
initialize
with id1,
id2
Some
LE in
free-list
no
Post LMS to
free some LE
9-267
Wait 20 ms
on "Global
cache
freelist
wait"
Link LE
into the
hash chain
End
Retry only once
Block to LE Mapping
When LEs need to be freed, you must post the LMS that is associated with the <id1,
id2> LE. The statistics global cache freelist waits is incremented.
Queues of LE for LMS
Latch
9-268
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
Down-convert queue
LE with BAST
Lazy-close queue
when WRITE is done
Deferred-ping queue
when timeout
LE
Long-flush queue
wait for log flush
Queues of LE for LMS

Each LMS process has a number of latches equal to gc_latches. Each latch protects
several queues.
The lazy-close queue is also used for clearing blocks and no-buffer operations.
LMSn Free of LE
BEGIN
Choose a LE
from associated
lazy close
queue
Get Latch of
queue
no
Buffer
linked
to LE
yes
Get buffer's
HashList
Compute rdba,
tsn from LE;
get HashList of
(rdba,tsn)
Got
Hash-Latch
in Shared,
NoWait
yes
no
Free queue
latch
9-269
go through
code path
of BAST
management
Get Latch in
Shared Wait
End
Cache Fusion Examples: Overview
Initial state:
D
Master
A
9-270
1008
Cache Fusion Examples: Overview

Initial State
The examples in the following slides show the messages and the resource status changes
in Cache Fusion, and the transfer of blocks between instances and disks. This slide
shows the setup that is used for these examples.
There are four instances, A, B, C, and D, and a shared drive. For simplicity, the
examples use just one block that is initially shown on the disk with a system change
number (SCN) of 1008.
Lock state has a three-letter indication; lock mode is indicated with the letters N, S, or X;
lock role is indicated with the letters L or G; and PI is shown with 0 or 1.
The block that is used throughout these examples has its resource master on instance D.
This is to show the lock messages clearly. If the lock master and the block coincide on a
node, then some optimizations occur, reducing the number of messages or choosing a
different way to get the block.
Cache Fusion Examples: Overview (continued)

The following slides show the final state of the block transition as well as the lock
transitions. The initial state for each example is the final state of the previous
example according to the roadmap:
(1) --+-> (3) ---> (4) --+-> (5) ---> (6)
|
+-> (2)
+-> (7) --+-> (8) ---> (9)

|
+-> (10)
Example 11 is a stand-alone example.
Cache Fusion: Example 1
Getting a block from the disk:

D
Master
1:LReq (S,C)
2:Grant SL0
1008
SL0
3:Read
B
9-272
4:Notify
1008
Example 1: Getting a Block from the Disk

Instance C wants to read the block in its current version.
1. Instance C sends a message to the master requesting a shared lock on the block.
The (S,C) indicates Shared, Instance C.
2. Master D grants instance C the lock as SL0.
3. After C receives the lock, it initiates an I/O to read the block from disk.
4. C is notified that the read is complete and now holds the block with SCN 1008.
Getting a block from the cache:

D
Master
A
2:Ping(S,B)
1008
SL0
4:Assume(SL0)
3:Send(SL,SL0)
1:LReq (S,B)
1008
SL0
9-273
1008
Example 2: Getting a Block from the Cache

Following example 1, the steps for instance B attempting to read the block are:
1. Instance B requests master D for a shared lock. It has no knowledge of where the
block is; it simply asks for the access rights of a shared lock.
2. The lock master at instance D knows that the block is being held in instance C;
therefore it sends a ping message to instance C, instead of granting the lock as it
did in example 1.
3. Instance C sends the block to instance B and indicates that instance B should take
the S lock and the current lock mode and role of instance C, which is SL0.
4. Instance B sends a message to master D that it received the block and will assume
SL0. This message is sent asynchronously, whereas other messages were sent
synchronously.
Optimization in the code may decide that it is less of a load on the whole cluster (or less
latency) to read the block from the disk, instead of sending messages and blocks over
the network.
Getting a clean block from the cluster for

modifications:
D
Master
A
2:Ping(X,B)
1008
SL0CR
4:Assume(SL0)
3:Send(X, Close)
1:LReq (X,B)
1009
XL0
9-274
1008
Example 3: Getting a Clean Block from the Cluster for Modifications

Following example 1, the steps when instance B requires the block in write mode are:
1. Instance B requests master D for an exclusive (X) lock.
2. The lock master knows all the nodes that hold an S lock and sends a ping (X,-)
message to close their locks (that is, to discard their copy), until only one is left.
The lock master then sends a ping(X,B) to C.
3. Instance C sends the block to instance B with lock information and closes its lock;
that is, it discards the block. The block can be held in CR mode; this does not
require a lock, and this is not a PI.
4. Instance B sends a message to master D that it has assumed XL0. It then modifies
the block to SCN 1009.
Getting a dirty block from the cluster and modifying it:

D
Master
1:LReq (X,A)
1013
4:Assume(XG0,
NG1,1009)
XG0
CR
3:Send(XG,NG1)
2:Ping(X,A)
1009
XL0NG1
9-275
1008
1008
Example 4: Getting a Dirty Block from the Cluster and Modifying It

Following example 3, the steps for instance A attempting to modify the block are:
1. Instance A sends an X lock request to master D.
2. Master D sends a ping to B to give up the block to instance A.
3. Instance B sends the dirty block to instance A, retains the block, and converts its
lock to NG1. The SG1 mode would be incompatible with the XG0 lock at instance
A. Instance B has a PI at SCN 1009.
4. Instance A informs master D that it got the block and assumed XG0 on the lock. It
then modifies the block to SCN 1013.
This is a write/write contention.
Getting a shared copy of the writeable buffer:
2:Ping(S,C)
4:Assume
(SG0,SG1,1013)
D
Master
1:LReq(S,C)
1013
XG0SG1
3:Send(SG,SG1)
1009
NG1
9-276
1013
SG0
1008
Example 5: Getting a Shared Copy of the Writeable Buffer

Following example 4, the steps for instance C attempting to read the current block are:
1. Instance C sends a share lock request to master D.
2. Master D sends a ping to instance A, saying instance C wants a share copy.
3. Instance A, when it has finished the work on the buffer, flushes the redo log and
sends the block at SCN 1013 to instance C, retains a PI, and converts its lock to
SG1.
4. Instance C gets the block at SCN 1013 and sends a message to the master that it
assumed a lock mode of SG0.
Getting a shared copy of the dirty shared buffer:

D
Master
2:Ping(S,B)
1013
SG1
3:Send
(SG,SG1)
SG0
4:Assume
(SG1,SG1,1013)
1:LReq (S,B)
1013
NG1SG1
9-277
1013
1008
Example 6: Getting a Shared Copy of the Dirty Shared Buffer

Following example 5, instance B wants a shared copy of the block. This differs from
example 2 as the blocks are dirty (the disk copy is out of date) and available in two
caches.
1. Instance B sends master D a request for an S lock.
2. Now master D knows that both A and C have a shared copy of the block. It
chooses one instance and sends a ping message.
3. Instance A sends the block to instance B with lock information.
4. Instance B sends a message that it has assumed the lock in SG1 mode.
The Shared Selection Rule picks an instance that holds the resource in decreasing
preference from this list:
Master, if it has a lock S (shortest message path)
Instance with S mode holding the last PI (most recent nonmaster access)
Shared Local
Most recently granted S
Writing blocks back to disk:

D
Master
5:W Notify
2:ReqW
1013
6:Flush PI
XG0XL0
1:Req W( )
9-278
4:Notify
1009
NG1
3:Write
1013
Example 7: Writing Blocks Back to Disk

Following example 4, the steps for instance B attempting to write the block are:
1. Instance B sends a write request to master D with the necessary SCN.
2. Master decides the current node or latest holding node for the requested write. In
this case, it sends the write request to A and remembers that it asked A to write the
block.
3. Instance A issues a write to disk.
4. Instance A gets the notification that the write has completed.
5. Instance A notifies the master that the write has completed.
6. On receipt of write notification, master D tells all PI holders to discard their locks
and the block buffer.
Getting the shared buffer once it is written:
2:Ping(S,B)
D
Master
4:Assume(SL0,SL0)
1:Req(S,B)
1013
XL0SL0
9-279
1013
C
3:Send(SL,SL)
SL0
1013
Example 8: Getting the Shared Buffer Once It Is Written

Following example 7, the steps for instance C attempting to read the block, after it has
been written, are:
1. Instance C requests the shared lock from master D.
2. Master D knows instance A holds the lock in XL0, and sends a ping message to
instance A.
3. Instance A sends the block to instance C and downgrades its lock to SL0.
4. Instance C assumes SL0.
Getting the shared buffer from multiple copies:

D
Master
A
2:Ping(S,B)
1013
SL0
1013
1:LReq(S,B)
SL0
3:Send(SL,SL)
4:Assume(SL0,SL0)
1013
SL0
9-280
1013
Example 9: Getting the Shared Buffer from Multiple Copies

Following example 8, this example shows instance B getting a shared copy of the block.
In this case, both A and C are the candidates. According to the Shared Selection Rule,
instance C (last one to receive the shared lock) gets the ping message and serves the
block.
Getting the shared buffer from dirty:
2:Ping(S,B)
D
Master
4:Assume(SG0,SG1,1015)
1:LReq(S,B)
1015
XL0 SG1
9-281
1015
C
3:Send(SG,SG)
SG0
1013
Example 10: Getting the Shared Buffer from Dirty

After writing the block in example 7, instance A further dirties the block. Instance C
attempts to read the block. In this case, instance A downgrades its lock to SG1, retains
the PI, and sends the block to instance C.
Consistent read request:

D
Master
A
1.1:CRreq
1025 3:Create CR
XL0
4:Send CR image
1.2:NoCRavailable
1013
9-282
2:Make CR
1022
Example 11: Consistent Read Request (CR Server)

The previous examples have shown current read for shared or exclusive in all cases. For
consistent read, the CR server process is used. A CR block is without a lock as it is a
local scratch copy by definition.
1. Instance B requires a version 1013 of the block and has no block of higher
version in its own buffer.
a. It sends a CR request to the master.
b. If there is no appropriate block copy in the other caches, then the master
returns the request, indicating that instance B must get the current copy of
the block, and performs rollback. This would then be the same as example 1
earlier.
2. In the slide diagram, instance C has a copy of the block, but this is a later version.
The master instance D sends the request to instance C to ship a 1013 version of
the block.
3. Instance C takes the 1025 buffer, makes a copy, and applies undo on the copy
until it matches 1013.
4. The block is sent. Instance B receives no lock change, and there is no assume
message. If instance C is unable to make the CR copy because it does not have the
undo blocks available, it sends a message to instance B to construct the CR block
itself. The light-weight rule also causes instance C to flush its 1025 copy to disk,
thus enabling instance B to get a read-current copy (share lock) to construct its
own CR copies.
Views
9-283
V$LOCK_ELEMENT: Based on X$LE, shows the

status of each PCM lock stored in the SGA
V$BH: Based on X$BH, shows the status and pings
of every buffer
Views
X$BH: see WebIV note 33568.1
Views (continued)
V$LOCK_ELEMENT
lock_element_addr: raw address for the lock element covering a buffer
indx:
lock element number
class:
block class (1 = data/index, 2 = sort, etc.)
lock_element_name:
flags:
status of the lock element (1 = fusion lock, 2 = no buffer on
LE, 4 = has deferred ping, 8 = LE waiting for log flush,
16=LE is being evicted, 32 = LE has been deactivated, 64 =
LE is fixed)
mode_held:
lock mode held (0 = null, 3 = S, 5 = X)
block_count:
number of blocks covered by the PCM lock
releasing:
Release flags. Non-zero if PCM lock is being downgraded.
acquiring:
Acquiring flags. Non-zero if PCM lock is being upgraded.
invalid:
Non-zero if PCM lock is invalid, always 0 in
V$LOCK_ELEMENT
Release Flags
Value Description
KCLLEBP
01
Process has sent a request to DLM
KCLLEAP
02
Acquisition Pending, the lock
operation has been started.
KCLLERECON
04
CR request aborted since because
reconfig
KCLLEINVAL
08
CR request could not started
because RECONFIG.
KCLLECOMM
10
CR request failed because time out.
KCLLENRN
20
No recovery needed.
KCLLESUSP
40
PI is suspect.
KCLLEHIGH
80
Our PI is the highest (can be made
current).
Acquire Flags
Value Description
KCLLEBA
01
BAST has been delivered.
KCLLESHR
02
Downgrad to SHARE mode
KCLLECLS
04
About to be closed
KCLLESCP
08
Scan completed
KCLLERP
10
Release processing, enables down
convert
KCLLEDCL
20
On Down Convert list
KCLLEDCS
40
Down-convert has been started.
KCLLEREAL
80
Real BAST has arrived during fake
bast.
KCLLEDFR
100
BAST has been deferred once.
More detail in kcl0.h
Views (continued)
V$BH
file#:
block#:
class#:
status:
datafile number
block number
class of the block
status of the block (free=not in use, xcur=exclusive, scur=shared
current, cr=consistent read, read=reading from disk; mrec=mr mode,
irec=ir mode)
xnc:
# of PCM lock conversions
lock_element_addr: raw lock element address
lock_element_name:
lock_element_class:
dirty:
(Y) block modified
temp:
(Y) temporary block
ping:
(Y) block pinged
stale:
(Y) block is stale
direct:
(Y) direct block
new:
(Y) new block
objd:
object number
ts#:
tablespace number
Column state of X$BH can contain following value :
0 or FREE
1 or EXLCUR
2 or SHRCUR
3 or CR
4 or READING
5 or MRECOVERY
6 or IRECOVERY
7 or WRITING
8 or PI
Parameters
_LM_LMS
Default value min(#CPU/4, 10)
0 if cluster_database is false
GC_FILES_TO_LOCKS
Same value as Oracle8i, but setting this disables
Cache Fusion for the specified files
9-286
Summary

Cache fusion implementation levels
Flow of locks and blocks in Cache Fusion
9-287
Cache Fusion 1
CR Server
Objectives

the following:
Describe Consistent Read (CR) Cache Fusion
Outline the flow of CR request handling
10-289
Cache Fusion: Consistent Read Blocks

Node
Other
nodes
Instance
Caches
kcb/kcl
GRD(GCS)
CGS
I
P
C
NM
CM
10-290
Cache Fusion: Consistent Read Blocks

Cache coherency of Consistent Read (CR) blocks was introduced in Oracle8i. The
configuration of the feature and its detailed implementation are different in Oracle9i,
but the functionality is the same.
Consistent Read Review
10-291
Current Block: The most recent version of a block

CR Block: A coherent version of a block with only
the committed changes
CRSCN: SCN for block
CR_Xid: Transaction ID for which block is limited
CR_uba: UBA for transaction: kcbdsxid
CRSfl: Snapshot flag
Snap_SCN: SCN of a snapshot of a block from a
particular point in time
Snap_UBA: UBA at time Snap_SCN
Env_Scn: SCN at current time
Env_uba: UBA of the current transaction
Consistent Read Review

Consistent Read is the Oracle implementation of the read committed isolation level.
There are two possibilities:
Statement level: Query results are consistent with respect to the start of the query
(snap_scn = current SCN when the query starts its Execute phase).
Transaction level: Query results are consistent with respect to the beginning of
the transaction (snap_scn = current SCN when the transaction begins).
Consistent Read sees the world in an asymmetric way: A transaction sees only other
committed changes but sees its own uncommitted changes.
CR_Xid, CR_uba, and CRSfl are available as CR stat structures that are associated
with each buffer in the cache (also available in X$BH).
The Snap_UBA is useful for the consistent read problem of a modified block, for
example:
UPDATE T SET status = status+1 WHERE status > 0
Getting a CR Buffer
ktrget:
Initializes a buffer cache CR scan request
Calls kcbgtcr for the best resident buffer to start
from to build the CR buffer
Calls ktrgcm to build the CR buffer by applying
undo
Returns CR buffer to the requestor
kcbgtcr:
If successful, returns the best candidate
(performed by ktrexf or examination function)
Scans the hash bucket for the DBA for buffers that
may be used to build a CR buffer
If not successful, calls kcbget
10-292
Getting a CR Buffer
Any and all queries start with getting a CR buffer version of the block.
Getting a CR Buffer
kcbget:
Retries the scan just tried by kcbgtcr
If you find a buffer, you return it now.
If not, then if it is being READ in or there is a current
mode buffer, you wait until it is available and then
rescan the buffer.
If these fail, you cannot use any locally cached
buffers.
If the above fails:

CR server manages the CR request.
10-293
Getting a CR Buffer (continued)

The CR server was a separate background process in Oracle8i. In Oracle9i and later, the
same functionality is part of the LMS process.
Prior to Oracle8i, instead of issuing a CR request, a ping operation was started to get the
current block from disk.
Getting a CR Buffer in Oracle9i Release 2
Owner instance
Requesting instance
UNDO
Current
CR
10-294
CR
Getting a CR Buffer in Oracle9i Release 2

This feature has been available since Oracle8i. In contrast, before Oracle8i the current
block and all undo blocks were pinged across to the requesting instance to construct
the CR buffer at its destination.
CR Server in Oracle9i Release 2

Master
1. Ask for CR and
LOCK in SHARE mode.
LMS
Requestor
LMS
3. Send info to LMS

including (port,IP)
address for answer.
2. No conflict
mode:
grant LOCK
3. AST for
conversion
Interconnect
message
Holder
9. Send CR
buffer.
LMS
8
4. Read since
LOCK is granted.
FG
4. Build CR
block and stop
when completed
or IO required.
5. Ask LGWR
to flush REDO.
6,7
LGWR
Database
10-295
Log
CR Server in Oracle9i Release 2

There are three instances involved: the requestor instance, the lock master instance, and
the current block owner instance.
The lock is granted if one of the following is true:
Resource held mode is NULL.
Resource held mode is S and there is no holder of an S lock in the master node.
Otherwise, the master forwards the CR request to the holder node.
If the lock is global, then you choose a node to forward the CR request to as follows:
If there is a past image (PI) at the lock master instance, and the PI SCN is greater
than snap-scn, then the master node is this node.
Otherwise, you choose a PI with the smallest SCN and PI-SCN greater than snapSCN. The owner node of this PI is the node you forward the CR request to. The PI
with smallest SCN is the most interesting one, because you have less UNDO to be
applied.
If there is no PI at all, you choose the node that the current buffer belongs to.
CR Requests
If there is no usable local buffer:

Construct a message to the LMS master node
for the BL resource covering the block
Message contains:
Lock convert request
Message to the CR server for the requested buffer
10-296
CR Requests
Resource master node will either:

Grant the lock mode
Forward the CR request to PI or CURRENT holder
node
10-297
Light Work Rule
The LMS process of the node CR request is

forwarded to build the CR buffer by calling
kcbgtcr or ktrget. LMS stops the CR buffer
building and sends what it has when the light
work rule is fired:
I/O is required.
A buffer with the same class, same AFN, and same
blockID but with different objectID is found,
signifying a dropped or truncated object.
Write in progress
10-298
Ship that buffer to the requestor: Requestor

completes the CR build.
Light Work Rule

The CR server only does light work, which does not include I/O.
Fairness
LMS (building the CR buffer) also performs a down

convert of the lock covering the buffer, if:
The block is not UNDO and the lock is held in X
mode
Too many CR requests for a buffer since the last
change was made to the block. The holder pings the
block to disk. LMS does this if there is more than
_fairness_threshold CR requests.
10-299
_fairness_threshold default value is 4
Statistics
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
10-300
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
gets
get time
converts
convert time
cr blocks received
cr block receive time
current blocks received
current block receive time
cr blocks served
cr block build time
cr block flush time
cr block send time
current blocks served
current block pin time
current block flush time
current block send time
freelist waits
defers
convert timeouts
blocks lost
claim blocks lost
blocks corrupt
prepare failures
skip prepare failures
Wait Events
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
10-301
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
open s
open x
null to s
null to x
s to x
cr request
cr disk request
busy
freelist wait
bg acks
pending ast
retry prepare
cancel wait
cr cancel wait
pred cancel wait
domain validation
assume wait
recovery free wait
recovery quiesce wait
claim wait
Fixed Table X$KCLCRST Statistics

REQCR
:CR Request
REQCUR :CURRENT Request
REQDATA :DATA Block Request
REQUNDO :UNDO Block Request
REQTX
:UNDO Header Request
RESCUR :CURRENT result
RESPRIV :Only Readable By Requestor Result
RESZERO :Only Readable By 0 XID Result
RESDISK :Read From Disk Result
RESFAIL :Retry Result
RESWAIT :
FAIRDC :Fairness Down Convert
FAIRCL :Fairness Count Cleared
FREEDC :Fairness Down Convert On Free Lock Element
FLUSH
:LMS Has To Wait For A Block Flush
FLUSHQ :Request Put On Log Flush Queue
FLUSHF :Log Flush Queue Full
FLUSHMX :Max Log Flush Time
LIGHT
:Light Work Rule Signaled
LIGHT1, LIGHT2
ERROR
:Some Error Signaled
HINT, NOCUR, PIPING, PIFAIL, WRITEPI
10-302
Fixed Table X$KCLCRST Statistics

An extract of this table is available in the view V$CR_BLOCK_SERVER (V$BSP in
Oracle8i).
CR Requestor-Side Algorithm
kcbgtcr
ktrget
BEGIN
BEGIN
call kcbgtcr
to get a best
buffer
"consistent get"++
compute and follow hash bucket;
for each buffer:
call ktrexf to find best buffer
call ktrgcm to
apply UNDO (if
any) to
produce a good
CR buffer
END
best
buffer found
in local
cache?
yes
no
call kcbzib
to get the
buffer
END
10-303
The following statistics are incremented by ktrgcm:
cleanouts and rollbacks - consistent read is incremented if UNDO is applied to
BUFFER and CLEANOUT is performed.
rollbacks only - consistent read gets is incremented if UNDO is applied to
BUFFER and no CLEANOUT is performed.
cleanouts only - consistent read gets is incremented if no UNDO is applied and
CLEANOUT is performed.
no work - consistent read gets is incremented if no UNDO is applied and no
CLEANOUT is performed.
When UNDO is applied to produce a CR BUFFER, other UNDO blocks should be read.
When CLEANOUT is performed, the TX transaction table must be read.
kcbzib
for CR
request
BEGIN
call kcbzgb to get

a buffer and set
the state of this
buffer to READING
no
bit KCBBHFCR
is set or
LE mode>=requested
mode?
yes
db
mounted
shared?
no
yes
Call kclgclk
asking for convert
LE in SHARED mode
with KCLCVCR option
10-304
Read block from disk

to buffer already
allocated
increment statistic
"physical reads"
Buffer CR
is received
and usable
END
CR Requestor-Side Algorithm (continued)

Bit KCBBHFCR is set if a timeout occurs during LE conversion.
LE mode >= requested mode, if the DLM conversion succeeded.
Note: Only the CR case is presented here.
kclgclk
BEGIN
kclcls
BEGIN
Call kclcls for

each buffer to
see status of LE
Find or locate LE
LE in
transition?
no
Some
LE to convert
yes
or to open
wait 1 sec on
?
no
"global cache busy"
yes
DLM
requested
Call kclscrs to
END
END
mode > LE held
start CR
no
mode?
Call kclwcrs
to wait for CR
yes
to complete
Allocate a lock
Set bit 0X1 of
ctx and link buffer
LE->acquiring
END
to LE
10-305

kclcls indicates whether some LE has to be opened (first time for buffer) or whether
some LE must be converted, because LE held-mode is smaller than CR requested
mode (S).
If an LE is associated with a global lock and the lock already exists (not NEW), then
you also allocate a lock context and link to the LE. You issue a predecessor-block read
for this LE. This is done because you no longer have the PI in your cache and you
cannot read from the disk because the lock is global.
LE mode allows the requested mode to return and let kcbzib read the buffer.
LE is in transition if acquiring != 0 or releasing != 0.
kclscrs
BEGIN
no
some LE
left?
END
yes
Take LE and setup a CR request
Call kjbcropen
Set bit 2 of
LE->acquiring
yes
Call kjbcrconvert yes

Set bit 2 of
LE->acquiring
LE lock
not opened
yet?
no
LE
lockmode
NULL?
no
Call kjbpredread
Set bit 2 of LE->acquiring
10-306

In the three cases (lock open, lock convert, or predecessor read), you receive either a
buffer or a lock grant with some differences:
For lock-open or lock-convert, you receive a buffer or a grant.
For predecessor-read:
- You receive a grant and the lock role is converted to local if there is no PI for
the buffer in the cluster.
- You receive a buffer containing the highest PI (sent by some node) in the
cluster.
You step to call kjbpredread when the lock role is global and the LE is already
opened and you do not have any more PI in your cache.
kclwcrs
BEGIN
req not
examined?
Next CR request
yes
req
completed?
no
Increment "global cache
current blocks received" and
"global cache current block
receive time"
Set CR request status
"completed"
Increment "global cache CR
blocks received" and "global
cache CR block receive time"
Set CR request status
"completed"
yes
Set request status
"completed"
AST has fired; lock granted S
10-307
no
yes
req type
"predread"
and buffer
received
?
END
req not
completed?
no
yes
no
Wait 1 sec on
"global cache
CR request"
Get
message
yes
req type
"open" or
"convert" and buffer
received?
bit 2
LE->acq
no
cleared?
yes
no
kclwcrs
The description of kclwcrs is simple, and the code path for error management is not
displayed.
CR Requestor-Side AST Delivery

Requestor node
LE
2: Set
bit 0X2
of acq.
1:
Locate
LE.
6: Unset bit
0X2 of LE with
AST callback
provided by FG.
LMS
7:
Post FG.
FG
4: Wait on
"global cache
CR request".
Master node
LMS
5: Notify
that LOCK
is granted.
3: CR submit along with

lock request with
(ip,port) information.
Scenario where LOCK is granted to FG

10-308
CR Requestor-Side CR Buffer Delivery

Requestor node
Master node
acquiring
LE
2: Set
bit 0X2
of acq.
1:
Locate
LE.
LMS
5: Build
CR buffer.
FG
LMS
3: CR submit along
with lock request with
(ip,port) information.
4: Wait on
"global cache
CR request".
6: Deliver CR buffer
with (ip,port)
information.
Scenario where CR Buffer is delivered to FG
10-309
CR Server-Side Algorithm
Call kcbgtcr to get block with
kclexf as examination function to
retain only CURRENT block
BEGIN
request
for CURRENT
block?
no
yes
error
from kcbgtcr
or ktrget
?
yes
REQCUR++
REQ{DATA|UNDO|TX}++
REQCR++
REQDATA++
no
error
KCBOERLWRx
?
Call ktrget to
fabricate CR buffer
yes
ERROR++
LIGHTx++
buffer
state is
CR?
yes
10-310
no
Set req status to STATCUR

RESCUR++
Set req status to

STATPRIV
RES{PRIV|ZERO}++
Send ERROR
to requestor
RESFAIL++
FLUSH LOG
SEND BACK BUFFER
FAIRNESS MANAGEMENT
END
X$KCLCRST.LIGHTn is incremented if the light work rule fires while the CR block is
building, because of the following reasons:
A buffer is found with the same AFN and BLOCKNUM but the object-id in the
buffer is different from the object-id that is submitted by the requestor (object was
DROPPED or TRUNCATED after consistent read <is> started and before the end).
A wait for WRITE COMPLETE
A wait because the buffer is in READING state
Buffer is suspended and a free buffer is needed
A wait for free buffer wait
A read block from disk to buffer-cache
A wait for space for redo
A wait for ITL
X$KCLCRST.LIGHT1 is incremented if a block is found with bit modification
started set; in this case the process sleeps some seconds, and when it wakes up, the
same process is still modifying the block.
X$KCLCRST.LIGHT2 is incremented if a buffer is in instance RECOVERY state.
This description of kclgcr is simplified.
kclgcr
REDO
ondisk?
BEGIN
yes
no
X$KCLCRST.FLUSH++
no
room
in logflush
queue?
yes
Add new element
in logflush queue
X$KCLCRST.FLUSHQ++
X$KCLCRST.FLUSHF++
call kcrfisd and wait on
"log file sync" but only once
10-311
END
kclgcr
FLUSH LOG
Note: There are no more than 255* processes elements in the logflush queue.
BEGIN
Increment
LE.FAIRNESS_COUNTER
Queued in
LOG FLUSH
phase
LOGFLUSH
queued?
yes
END 1
no
Send CR buffer
to requestor
no
LE heldmode is
EXCLUSIVE and
LE.FAIRNESS_COUNTER >=
fairness_threshold?
yes
Update statistics
requested
block is
UNDO or UNDO
header?
X$KCLCRST.FAIRDC++
downgrade LE
to SHARE mode
END 2
yes
END 3
no
10-312
kclgcr (continued)
Send back buffer fairness management.
At END 1 the buffer is not sent; this is done in LOGFLUSH queue processing.
The following statistics are updated after the CR buffer is sent to the requestor:
global cache cr block build time with time spent in ktrget or kcbgtcr
global cache cr block log flush time with time spent in LOG FLUSH phase
global cache cr block send time with time spent in CR block sending
Note: LE.FAIRNESS_COUNTER is reset at each buffer modification.
BEGIN
kclqchk
Next element
dequeue
element
yes
element
on LOGFLUSH
queue?
caller
asks for
wait?
yes
no
END
no
call kcrfisd
to check if REDO
is on disk
call kcrfisd to flush

REDO; wait on "log file
sync" if redo is not on
disk, but only once
REDO
on disk?
no
yes
send CR buffer to requestor
10-313
kclqchk
LOGFLUSH queue processing.
After the CR buffer is sent to the requestor, the following statistics are updated:
global cache cr block build time with time spent in ktrget or kcbgtcr
global cache cr block log flush time with time spent in LOG FLUSH phase
global cache cr block send time with time spent in CR block sending
Summary

Describe CR server functionality
Outline CR processing
10-314
Cache Fusion 2
Current Block: XCUR
Objectives
After completing this lesson, you should be able to

describe the flow of current blocks in Cache Fusion.
11-317
Cache Fusion: Current Blocks

Node
Other
nodes
Instance
Caches
kcb/kcl
GRD(GCS)
CGS
I
P
C
NM
CM
11-318
Cache Fusion: Current Blocks

Cache coherency of current (XCUR) blocks was introduced in Oracle9i.
PCM Locks and Resources
PCM DLM locks that are owned by the local

instance are allocated and embedded in an LE
structure.
PCM DLM locks that are owned by remote
instances and mastered by the local instance are
allocated in SHARED_POOL.
LE in
kclle structure
LE_ADDR
X$LE
PCM DLM resource in
kjbr structure
X$KJBR
11-319
KJBLLOCKP-0x60
KJBLRESP
PCM DLM lock in

kjbl structure
X$KJBL
KJBRRESP
PCM Locks and Resources

Fields of interest in the kclle structure: kcllerls or releasing; kcllelnm or
name(id1,id2); kcllemode or held-mode; kclleacq or acquiring; kcllelck or
DLM lock.
Fields of interest in the kjbr structure: resname_kjbr[2] or resource name;
grant_q_kjbr or grant queue; convert_q_kjbr or convert queue;
mode_role_kjbr, which is a bitwise merge of grant mode and role-interpreted
NULL(0x00), S(0x01), X(0x02), L0 Local (0x00), G0 Global without PI (0x08), G1
Global with PI (0x018).
The field mode_role_kjbl in kjbl is a bitwise merge of grant, request, and lock
mode: 0x00 if grant NULL; 0x01 if grant S; 0x02 if grant X; 0x04 lock has been opened
at master; 0x08 if global role (otherwise local); 0x10 has one or more PI; 0x20 if request
CR; 0x40 if request S; 0x80 if request X.
Fusion: Long Example
Three instances
One block is
SELECT on I3
selected and
3
2
updated
UPDATE on I2
SELECT on I2 Instance 2 is the
4
master of the
UPDATE on I1
block resource
Start
Write on I1
7
SELECT on I3
SELECT on I3
11-320
Fusion: Long Example

The SQL for each step is one of:
SELECT * FROM emp WHERE empno = ;
UPDATE emp SET sal = sal + 10 WHERE empno = ; COMMIT;
ALTER SYSTEM CHECKPOINT LOCAL;
The empno is chosen differently in each instance to avoid considering transactions

locks and to limit the flow to PCM locks. All rows are in the same block, which has
number 10 and is in file 8 in the subsequent dumps.
Step Purpose
1
Lock and block acquisition, remote master
2
Lock and block acquisition, local master, shared block
3
Lock conversion, lock downgrade
4
Block fusion, write/write
5
Block fusion, write/read (CR)
6
Write involves locks, discard PI
7
Block fusion write/read, similar to step 5
Fusion: Examples (continued)

You use the following SQL statements to monitor locks and resource states in each instance:
1. SELECT state, mode_held, le_addr, class, dbarfil, dbablk,
cr_scn_bas, cr_scn_wrp
FROM x$bh
WHERE obj IN (SELECT data_object_id
FROM dba_objects
WHERE owner='SCOTT'
AND object_name='EMP')
AND class = 1;
2. SELECT name, le_class, le_rls, le_acq, le_mode, le_write,
le_local
FROM x$le
WHERE le_addr IN (SELECT le_addr
FROM x$bh
WHERE obj IN (SELECT data_object_id
FROM dba_objects
WHERE owner='SCOTT'
AND object_name='EMP')
AND class = 1
AND state != 3 );
3. SELECT r.* FROM x$kjbr r
WHERE r.kjbrname LIKE '%[0x200000a][0x0],[BL]%';
4. SELECT l.kjblname, l.kjblrole, l.kjblgrant, l.kjblrequest,
l.kjbllockst, l.kjblresp
FROM x$kjbl l
WHERE l.kjblname LIKE '%[0x200000a][0x0],[BL]%';
A resource name is (id1, id2), with BL for PCM locks. The id1 and id2 for our block are
derived by:
Id1 = blockno || ( fileno << 22)
= 10 || ( 8 << 22)
= 0x200000a
Id2 = ( fileno >> 10) << 15
= ( 8 >> 10 ) << 15
= 0
Initial State
X$LE
no rows
X$BH
no rows
X$KJBR
no rows
X$KJBL
no rows
selected
selected
selected
selected
Instance 1
Instance 2
(Master)
Instance 3
11-322
Initial State
Initially, nothing has been read into cache or locked, so the queries do not return any
rows.
In displaying the X$KJBR.KJBRNAME in subsequent slides, the column has been
truncated to fit. It has the same value as the X$KJBL.KJBLNAME for these examples.
Step 1:
Instance 3 Performs SELECT
Instance 2
Instance 1
(Master)
2 Grant(SL0)
1 CRREQ(S)
3 Read
Instance 3
4 Notify
11-323
Step 1: Instance 3 Performs SELECT

Because there is no lock yet, master grants SL0 mode to instance 3. Instance 3 then
reads the block from the disk to its buffer cache.
Lock Changes in Instance 3

X$BH
Before
no rows selected
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
After
X$BH
no rows selected
X$LE
no rows selected
X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FE343C KJUSERPR KJUSERNL
0 [0x200000a]
1 22884D40 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSERPR KJUSERNL GRANTED
22FE343C
11-324
Step 1 (continued): Lock Changes in Instance 3

You see the resources that were created and the local locks that were acquired.

X$BH
Before
no rows selected
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
X$BH
STATE
MODE_HELD LE_ADDR CLASS
DBARFIL
DBABLK
CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------2
0 24FF9030
1
8
10
0
0
X$LE
NAME LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
3
0
1
X$KJBR
no rows selected
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST
KJBLRESP
------------------------- ---------- --------- --------- ------------ -------[0x200000a][0x0],[BL]
00
11-325

The X$LE.STATE is 2, which means that the buffer is shared current.
After
Step 2:
1 CRREQ(S)
Instance 1
2 Grant(SL0)
Instance 2
(Master)
3 Read
4 Notify
Instance 3
11-326

The master grants SL0 mode to instance 2, because:
There is an S lock on the resource (owned by instance 3).
There is no S lock on the same resource in the master.
Instance 2 then reads the block from the disk to the BUFFER-CACHE.
The behavior changes if there is an S lock on the master or
_cr_grant_local_role is TRUE. In this case, the master forwards the CR request
to an instance owner of the S lock (instance 3). This instance sends the current buffer (as
a lock in S mode) to instance 2.
The default value for _cr_grant_local_role is FALSE.

Before
X$BH
no rows selected
X$LE
no rows selected
X$KJBR
KJBRROLE KJBRNAME
0 [0x200000a]
1 22884D40 00
00
X$KJBL
KJBLNAME
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
22FE343C
X$BH
STATE
MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------2
0 253F3A10
1
8
10
0
0
After
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
3
0
1
X$KJBR
KJBRROLE KJBRNAME
0 [0x200000a]
1 22884D40 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
0 KJUSERPR
[0x200000a][0x0],[BL]
0 KJUSERPR
11-327
KJBLREQUE
--------KJUSERNL
KJUSERNL
KJBLLOCKST
----------GRANTED
GRANTED
KJBLRESP
-------22FE343C
22FE343C

The X$BH.STATE is 2, which is shared current.
X$LE.NAMEs value 3355442 is 0x200000D.
Step 3:
Instance 2 Performs UPDATE
5 ASSUME(XL0,close)
to master
1 LREQ(X)
to master
Instance 2
Instance 1
(Master)
4 Send
Buffer to requestor
2 PING(X,Node2)
Instance 3
3 Make
Buffer CR
11-328
Step 3: Instance 2 Performs UPDATE

Instance 2, the requestor, sends an X request to the master (itself).
The Master (instance 2) sends ping X to the S lock holder (instance 3).
Instance 3 converts the buffer state from S CURRENT to CR and closes the lock.
Instance 3 sends the buffer to the requestor (instance 2).
The requestor (instance 2) sends ASSUME to the master (itself) for lock mode and tells
the master that the previous holder (instance 3) has closed the lock.

Before
X$BH
no rows selected
X$LE
no rows selected
X$KJBR
KJBRROLE KJBRNAME
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSERPR KJUSERNL
0 [0x200000a]
1 22882980 00
00
X$KJBL
KJBLNAME
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
22FD8B24
X$BH
STATE
MODE_HELD
LE_ADDR
CLASS
DBARFIL
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253ECED0
1
8
10
0
0
After
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
1
X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL KJBRROLE
KJBRNAME
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL 0
[0x200000a]
1 253ECF30 00
00
X$KJBL
KJBLNAME
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSEREX KJUSERNL GRANTED
22FD8B24
11-329

The X$BH.STATE shows 1, which is buffer current (X CURRENT).
The X$KJBR.KJBRROLE shows 0, signifying that the lock owned by instance 2 is
XL0, which implies that the lock owned by instance 3 is closed.
X$BH
Before
STATE MODE_HELD LE_ADDR

CLASS
DBARFIL
--------- ---------- -------- ---------- ---------- ---------- ---------- ---------2
0 253F2690
1
8
10
0
0
X$LE
NAME LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
--------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
3
0
1
X$KJBR
no rows selected
X$KJBL
KJBLNAME
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
00
X$BH
CLASS
DBARFIL
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------3
0 00
1
8
10
1423681
0
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
11-330

The X$BH.STATE changes from 2 to 3 (that is, S changes to CR).
There are no rows for the lock because it has been closed.
After
Step 4:
Instance 1 Performs UPDATE
2 PING(X)
3 Set lock
to NG1
1 LREQ(X)
Instance 2
Instance 1
(Master)
5 Send block
4 Buffer
X CURRENT
to PI
6 ASSUME(XG0, NG1)
Instance 3
11-331
Step 4: Instance 1 Performs UPDATE

Instance 1, the requestor, sends an X request to the master (instance 2).
The master (instance 2) sends ping X to the X lock holder (itself).
Instance 2 converts the buffer state from the local X CURRENT to PI.
Instance 2 sends the buffer to the requestor (instance 1).
The requestor (instance 1) sends ASSUME to the master (instance 2) for lock mode and
tells the master that instance 1 has a global X lock and instance 2 has a global NULL
lock.

Before
X$BH

CLASS
DBARFIL
--------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253ECED0
1
8
10
0
0
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
1
X$KJBR
KJBRROLE KJBRNAME
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL
0 [0x200000a]
1 253ECF30 00
00
X$KJBL
KJBLNAME
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
22FD8B24
X$BH
CLASS
DBARFIL
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------8
0 253ECED0
1
8
10
1423699
0
After
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
0
0
0
X$KJBR
KJBRROLE KJBRNAME
8 [0x200000a]
1 253ECF30 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
24 KJUSERNL
[0x200000a][0x0],[BL]
8 KJUSEREX
11-332
KJBLREQUE
--------KJUSERNL
KJUSERNL
KJBLLOCKST
----------GRANTED
GRANTED
KJBLRESP
-------22FD8B24
22FD8B24

The X$BH.STATE switches from X to PI mode.
The KJBL.KJBLROLE value of 24 is 8 + 16, indicating PI and GLOBAL, respectively
(that is, G1 mode).
X$BH
Before
no rows selected
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
X$BH
CLASS
DBARFIL
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253F8A80
1
8
10
0
0
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
0
X$KJBR
no rows selected
X$KJBL
KJBLNAME
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
00
11-333

The X$BH.STATE is 1, which is current exclusive.
After
Step 5:
2 Build CR buffer
Instance 2
Instance 1
(Master)
3 Send CR buffer
1 CRREQ(S)
Instance 3
11-334

Instance 3 (the requestor) sends a CRREQ(S) to the master (instance 2).
The master (instance 2) chooses the instance CR server as follows:
If the resource role is G0, then the master takes the highest PI belonging to
instance 2 (the example)
Otherwise, if the resource role is G1 then the master takes the instance with a PI
whose SCN is closest to SCN of the requested SCN in CRREQ.
If the resource role is XL0 then the master chooses the instance with the current
buffer.
The master instance (instance 2) forwards the CRREQ to the chosen instance (itself).
The chosen instance (instance 2) builds the CR buffer and ships it to instance 3.
No DLM lock is opened for instance 3.
Note: Step 7 as previously described in the slide on page 5 is very similar to this step,
and therefore not shown in detail later.
Before
X$BH

CLASS
DBARFIL
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------3
0 00
1
8
10
1423681
0
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
X$BH
STATE
MODE_HELD LE_ADDR CLASS
DBARFIL
DBABLK
CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------3
0
00
1
8
10
1423681
0
3
0
00
1
8
10
1423821
0
After
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
11-335

One more CR buffer that is in use on instance 3 came from the CR server (instance 2).
Step 6:
Instance 1 Performs WRITE
1 REQW
7 Make
PI buffer
to CR
2 REQW
6 Set role
Local to LE &
DLM lock
5 WNOTIFY
Instance 1
3 WRITE
Instance 2
(Master)
4 NOTIFY
Instance 3
11-336
Step 6: Instance 1 Performs WRITE

Instance 1 (the requestor) sends a W request (request from client to master) to the
master (instance 2).
The master (instance 2) registers the SCN of the block to be written (in the DLM
resource) to remember that there is a pending write. The master not grant another write;
it sends a W request to instance 1, because instance 1 has the highest SCN (current
block).
Instance 1 writes the buffer by linking it in the ping queue. DBWR will do the write.
Instance 1 sends a W notification to instance 2 (master).
Instance 1 (master) sets Local role to resource and sends FLUSH_PI to every instance
containing a PI (in this case, itself). An instance that receives this makes the PI buffer to
a CR buffer and releases the associated LE.

Before
X$BH

CLASS
DBARFIL
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------8
0 253ECED0
1
8
10
1423699
0
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
0
0
0
X$KJBR
KJBRROLE KJBRNAME
8 [0x200000a]
1 253ECF30 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
24 KJUSERNL
[0x200000a][0x0],[BL]
8 KJUSEREX
KJBLREQUE
--------KJUSERNL
KJUSERNL
KJBLLOCKST
----------GRANTED
GRANTED
KJBLRESP
-------22FD8B24
22FD8B24
X$BH
CLASS
DBARFIL
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------3
0 00
1
8
10
1423699
0
After
X$LE
no rows selected
X$KJBR
KJBRROLE KJBRNAME
0 [0x200000a]
1 253ECF30 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
0 KJUSERNL
[0x200000a][0x0],[BL]
0 KJUSEREX
11-337
KJBLREQUE
--------KJUSERNL
KJUSERNL
KJBLLOCKST
----------GRANTED
GRANTED
KJBLRESP
-------22FD8B24
22FD8B24

The X$BH.STATE goes from 8 to 3, that is, from a PI to a CR buffer.
X$LE shows that no LE locks are covering the CR buffer.
The X$KJBL.KJBLROLE goes to 0, indicating that locks are now local.

Before
X$BH

CLASS
DBARFIL
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253F8A80
1
8
10
0
0
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
0
X$KJBR
no rows selected
X$KJBL
KJBLNAME
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
00
X$BH
CLASS
DBARFIL
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253F8A80
1
8
10
0
0
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
0
X$KJBR
no rows selected
X$KJBL
KJBLNAME
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
00
11-338

The lock becomes X local, as shown in X$KJBL.KJBLROLE.
After
Tables and Views
X$KJBL
Every PCM lock, local or remote
If remote, then associated resource is mastered by
this instance.
Callback routine in kjblftc
X$KJBR
PCM resources mastered by local instance
11-339
Tables and Views

X$KJBL WebIV Note:159906.1
X$KJBR (No WebIV note)
Tables and Views (continued)

X$KJBL
Column
KJBLLOCKP
KJBLGRANT
KJBLREQUEST
KJBROLE
KJBRESP
KJBLNAME
KJBLNAME2
KJBLQUEUE
KJBLLOCKST
KJBLWRITING
KJBLREQWRIT
KJBLOWNER
KJBLMASTER
KJBLBLOCKED
KJBLBLOCKER
Type
RAW(4)
VARCHAR2(9)
VARCHAR2(9)
NUMBER
Notes
PCM lock address
lock grant mode
lock request mode if lock CONVERTING state
0x81 if G1, 0x8 if G0, 0x0 if local
0x00 grant NULL; 0x01 grant S; 0x02 grant X;
0x04 lock has been opened at master;
0x08 global role, otherwise local;
0x10 has one or more PI;
0x20 request CR; 0x40 request S;
0x80 request X
RAW(4)
masterized on local instance: resource address
masterized by other instances: 0
VARCHAR2(30) resource name: [id1(hex)][id2(hex)],[BL]
VARCHAR2(30) resource name: id1(decimal),id2(decimal),BL
NUMBER
0 if on grant-queue, 8 if on convert-queue
VARCHAR2(64) lockstate, GRANTED, OPENING, CONVERTING
NUMBER
4 if asking for write
NUMBER
2 if requesting write
NUMBER
owner instance of this lock
NUMBER
master instance of the resource
NUMBER
different from 0 if CONVERTING
NUMBER
if there is a lock L1 at head of convert-queue
and the grant-mode of this lock is conflicting
with L1 request-mode. 0 if the associated
resource is not masterized by this instance
X$KJBR
Column
KJBRRESP
KJBRGRANT
KJBRNCVL
Type
RAW(4)
VARCHAR2(9)
VARCHAR2(9)
KJBRROLE
NUMBER
KJBRNAME
KJBRMASTER
KJBRGRANTQ
KJBRCVTQ
KJBRWRITER
VARCHAR2(30)
NUMBER
RAW(4)
RAW(4)
RAW(4)
Notes
PCM resource address
Resource held mode
Request mode of lock at head of convert-queue
(KJUSERNL if non existent)
mode and role combined bitwise
0x00 if NULL; 0x01 if S; 0x02 if X;
0X08 if G0 (global role, no PI);
0x18 if G1 (global role, one or more PIs)
resource name, format [id1][id2],[BL]
master instance (always local instance)
lock address at head of grant-queue
lock address at head of convert queue
lock address elected for WRITE
Summary
In this lesson, you should have learned how to

describe the flow control of Cache Fusion for the CR
server.
11-341
Cache Fusion Recovery
Objectives

the following:
Explain Cache Fusion recovery implementation
Examine the recovery/cache interface
Examine the recovery/DLM interface
Describe the basic Cache Fusion recovery
algorithm
12-343
NonCache Fusion OPS

and Database Recovery
12-344
The on-disk version of a block is always the

starting point for recovery.
Only the changes from a single redo thread must
be applied to the disk version.
NonCache Fusion OPS and Database Recovery

In a pre-Oracle9i OPS system, when a buffer that is modified by instance A is requested
by another instance B, A must write its dirty buffer to the disk before B can read it. This is
disk-based cache coherency. The algorithm implies that a given block can only be
different in one instance (both the cache and the redo log thereof) from the disk version.
Note: In Cache Fusion, the two statements in the slide are invalid.
Cache Fusion RAC

and Database Recovery
The starting point for recovery of a block is its

most recent PI version.
Located some place in the global cache
On-disk version used only if no PI is available
12-345
Redo threads for all failed instances must be

merged for instance or crash recovery.
Cache Fusion RAC and Database Recovery

In Oracle9i Cache Fusion, an instance ships the contents of its buffer to the requesting
instance after doing a log force but without writing the block to the disk. The sending
instances buffer will become a past image (PI) and cannot be modified further. The
requesting instance now has the current block in exclusive mode. The on-disk version of
the block does not contain the changes that are made by either instance.
Cache Fusion does not affect media recovery, which starts at the restored backup and
applies changes from the merged redo threads of all instances in the RAC cluster.
Overview of Fusion Lock States
Lock state: Two letters and a digit (mode, role, PI

count)
Example: XG1 is an exclusive mode, global role, 1
past image.
Lock Mode Valid Lock Role, PI Count
12-346
L0, G0, G1
L0, G0, G1
G1, G2
Overview of Fusion Lock States

A PCM Fusion lock has three dimensions: lock mode, lock role, and past-image count.
These dimensions together are used to maintain cache coherency in a fusion environment.
The set of lock modes remains unchanged: Exclusive (X), Shared (S), and Null (N). Lock
roles describe local or global interest in the resource. The past-image count indicates the
number of PI buffers that are maintained in the lock. The set of valid lock states is a subset
of the total combination space: XL0, XG0, XG1, SL0, SG0, SG1, NG1, NG2.
Null (N): No examine or modify rights
Share (S): May examine block
Exclusive (X): May modify and create new version of the block
Local (L): Locally managed lock; block can only be dirty in this cache
Global (G): Globally managed lock; may be dirty in more than one cache; must
coordinate with DLM for write
PI count 0: No past image
PI count 1: Past image present. More than one past image can be present.
Instance or Crash Recovery
SMON from a surviving instance performs the

thread recovery of the failed instance.
Foreground process performs the recovery when
all instances have failed (crash recovery).
Cache fusion recovery builds on the two-pass log
read recovery mechanisms.
First-pass log read
Recovery claim locking
Second-pass redo application
12-347
Instance or Crash Recovery

The SMON from a surviving instance does thread recovery of the failed instance. If a
foreground process detects instance recovery, it posts SMON; foreground processes no
longer do instance recovery.
Crash recovery may be considered a special case of instance recovery whereby all
instances have failed. In both cases, the threads from failed nodes must be merged. The
distinction is that, in instance recovery, SMON performs the recovery. In crash recovery, a
foreground performs recovery.
Cache fusion recovery builds on the two-pass log read recovery mechanisms.
First-pass log read
- Recovery set
- Block Written Records (BWR)
Recovery claim locking
- IDLM Communication
Second-pass redo application
Note: Recovery claim locking is the RAC component of two-pass log read recovery.
SMON Process
SMON performs the instance recovery.

The foreground process performs the crash
recovery.
The PMON or the foreground process performs the
block recovery.
SMON acquires IR enqueue.

This avoids multiple, simultaneous recoveries.
Enqueues are now available before blocks
(optimization).
This allows recovery and remastering to take place
in parallel.
12-348
First-Pass Log Read
Reads and merges redo threads of the failed

instance
Creates a hash table of blocks that are not known
to have been written to disk
Uses the Block Written Records
Does not use the buffer cache
12-349
Does not advance the checkpoint SCN
First-Pass Log Read

The first-pass log read reads redo threads of the failed instance and merges the results:
By SCN
RBA of last incremental checkpoint for each thread
Modified blocks are added to recovery set
The recovery set contains the first-dirty and last-dirty version information (SCN, Seq#) of
each block.
The process relies on Block Written Records (BWRs):
BWRs identify blocks in the recovery set that can be removed
All holders of flushed PIs write out a BWR
The first pass creates a hash table of blocks that are not known to have been written to the
disk; the hash table is the input for the second pass.
Block Written Record (BWR)
DBA information is written to the log stream.
No force of log file

Batched set of DBAs
Written version of DBA (SCN and Seq#)
Written by DBWR in lazy fashion
Recovery not needed if BWR version is greater than
the latest PI
Trims the recovery set
BWRs are logged:

When writing a block by the writing instance
When sending a block by the PI holder;
increases the likelihood of finding BWR for
excluding the block in second recovery pass
12-350
Block Written Record (BWR)

The BWR is placed in the redo buffer. It is usually not flushed to the disk immediately
with the disk writes, but deferred until the next redo buffer flush. A lost BWR, because of
the instance crash, could at most result in a few blocks needlessly being examined for the
redo application in the second pass.
Note: BWRs are logged by the owner instance that did the write and all holder instances.
Because every instance that modified the buffer logs a BWR following the writing of the
buffer, first-pass is more likely to find the BWR when any one of these instances fail.
BWR Dump
Special redo record:

Flagged no valid redo
List of block DBA, SCN
REDO RECORD - Thread:1 RBA: 0x000020.0000106a.0010
LEN: 0x04d8 VLD: 0x02
SCN: 0x0000.00103593 SUBSCN: 1 12/19/2002 07:28:00
CHANGE #1 MEDIA RECOVERY MARKER SCN:0x0000.00000000
SEQ: 0 OP:23.1
Block Written - afn: 7 rdba: 0x01c0368a(7,13962)
scn: 0x0000.00103475 seq: 0x01 flg:0x04
Block Written - afn: 7 rdba: 0x01c03689(7,13961)
scn: 0x0000.00103475 seq: 0x01 flg:0x04
Block Written - afn: 5 rdba: 0x01402611(5,9745)
scn: 0x0000.0010346d seq: 0x01 flg:0x06
...
12-351
BWR Dump
The dump in the slide is from a redo log file dump done with:
SQL> ALTER SYSTEM DUMP LOGFILE 'filename';
Recovery Set
The recovery set is organized in a table hashed by

DBA.
Each hash chain is sorted by increasing the firstdirty SCN in a doubly linked list.
Specifies the order in which to acquire instance
recovery locks
12-352
Each block entry stores the first-dirty SCN that is

encountered for the block.
Updates the last-dirty version (SCN, Seq#) for
subsequent block changes.
Recovery Set
The first read of a blocks change vector in the redo stream sets the first-dirty and lastdirty SCN values in the recovery set. Subsequent reads from the redo stream that occur on
the same block update the last-dirty SCN value in the recovery set.
Recovery Claim Locks
SMON sends a RecoveryClaimLock message to

the IDLM master node for each block entry in the
recovery set.
Each recovery set fusion block maps to a unique
IDLM resource.
If the master node for a resource has failed and
the IDLM remastering has not completed, then the
recovery waits.
Locks granted are used by the IDLM to:
Reconstruct the most restrictive lock that could
have been held by a failed instance
Ship the appropriate copy of the block to the
recovering instance
12-353

SMON (in instance recovery) sends a RecoveryClaimLock message to the IDLM
master node for each block entry in the recovery set.
Multiple requests may be batched into one message.
Indicates to the IDLM that recovery takes ownership of the block and lock.
The IDLM response generally consists of a block and a fusion lock grant. If locks are held
in XL or SL modes, then no recovery is needed (hence no IDLM message is sent).
IDLM Response to RecoveryClaimLock

Message on PCM Resource
12-354
Lock open on
recovering
instance
Locks open
on other
instances
Lock granted
on recovery
buffer
No lock or NL0
See next slide
(X, S)
Local 0
Dont Care
No lock
Recovery
buffer content
Recovery
action
No recovery
buffer needed
No recovery;
remove entry
from recovery set
(X, S)
Global (0, 1)
Dont Care
Share (X, S)
Global lock;
increment PI
count in lock
state, use zero
SCN tag
Initiate write of
current block
(See note 1)
No recovery;
release recovery
buffer, decrement
PI count when
block write
completes
(N) Global (1, 2)
a) An (X, S)
Global
Share NG lock,
increment PI
count
Same as Case 3: (X, S) Global
b) All (N)
Globals
XG1
Get contents
from highest PI,
based on SCN
tags. If NG2, toss
the higher PI
(See note 2)
Apply redo
changes, write out
recovery buffer
when complete
IDLM Response to RecoveryClaimLock Message on PCM Resource

Note 1: Recovery buffer is used for write notification only (no content) and cannot serve a
past-image.
Note 2: Retains PI that is being written. If lock is NG1, it does not determine if PI is being
written, so it must be retained.
No Lock Held by Recovering Instance on

the PCM Resource
Locks open (on other
instances)
12-355
Recovery lock
Recovery
buffer contents
Recovery
process action
No locks open or
all NL0
XL0
Read block from

disk
Apply redo
changes, write out
recovery buffer
when complete
(X, S) Local0
No lock
No recovery
buffer needed
No recovery;
remove block
entry from
recovery set
(X, S) Global (0, 1)
NG1 (with zero SCN tag

because this is not a PI)
Initiate write of
current block;
recovery buffer
used for write
notification only
(no content)
No recovery; write
completion will
release recovery
buffer and lock as
usual
All (N) Global (1, 2)
XG0
Get contents from

highest PI, based
on SCN tags
Apply redo
changes, write out
recovery buffer
when complete
12-356
The shipped block is copied into a recovery buffer

that is covered by the granted lock.
After locks have been acquired on all blocks in the
recovery set, a RecoveryDoneClaiming message
is sent to all DLM master nodes.
After IDLM reconfiguration, only resources that
are locked for recovery are unavailable to the
foreground lock requests.
After a buffer is allocated, an IR buffer cannot be
replaced or aged out except by another recovery
buffer request.

After the IDLM completes reconfiguration, only the resources that are locked for recovery
are unavailable to foreground lock requests:
IDLM validates the PCM lock space.
Until RecoveryDoneClaiming message is received, the PCM lock database
remains frozen clusterwide.
12-357
IR buffers must remain in the cache until they are

released during the second pass of redo
application.
IR locks must be held until the covered block is
fully recovered.
The recovery buffers are held in the recovering
instances' default buffer pool.
Large recovery sets may populate the recovering
instances' buffer cache with nonreusable buffers
Lock down-convert requests for recovery buffers
are serviced after the IR lock is released.
Recovery Claim Locks (continued)

IR buffers must remain in cache until they are released individually during the second pass
of redo application. The exception to this is the spillover scenario, where the recovering
instances' buffer cache cannot hold the entire recovery set; this is described later in this
lesson.
IR locks must be held until the covered block is fully recovered; user lock operations are
not allowed on partially recovered blocks. The buffer cache ensures that IR buffers are not
reused; LEs are tied to buffers through the buffer cache.
Recovery buffers are held in approximately 50% of the recovering instances' default buffer
pool, that is, in the cold half of the lru buffer chain.
Large recovery sets may result in populating the recovering instances' buffer cache with
nonreusable buffers. This impacts the foreground requests for buffers that are unrelated to
recovery and degrades the overall performance of the recovering instances.
Lock down-convert requests (BASTs) for recovery buffers are deferred and serviced only
after the IR lock is released. Locked IR buffers are marked in-recovery to the cache
layer with lock holder SMON. SMON releases the lock only when recovery for the block
is complete.
Second-Pass Log Read
The redo threads of failed instances are again read

and merged by SCN.
The recovery hash table is looked up to decide if
changes are for a recovery set block.
If a recovery buffer matches its last-dirty version
in the recovery set, recovery is complete.
SMON posts DBWR to write the recovery buffer
and clear the in-recovery state of the buffer.
After write completion:
SMON recovery lock on that buffer becomes XL
PI holders on remote instances are invalidated
12-358

Redo threads of failed instances are again read and merged by SCN. For each redo record
in the merged redo stream, the recovery hash table is looked up to decide if the change is
for a recovery set block.
Redo changes are applied to recovery buffers that are guaranteed to be in the cache and the
IR lock on those buffer acquired.
After applying a redo record, if the resulting recovery buffer matches its last-dirty version
(SCN and Seq#) in the recovery set, then the recovery is complete for that block.
SMON requests a write of the recovery buffer.
The block is released for normal operations.
When a recovery buffer write is requested, SMON posts DBWR to write the recovery
buffer and clear the in-recovery state of the buffer.
A recovery buffer can become current only after write completion, unlike a regular
buffer.
The cache layer can resume processing of lock down-converts (BASTs) for this
buffer after it has been made current.
After write completion, the SMON recovery lock on that buffer goes from XO to XL, and
PI holders on remote instances are invalidated by the IDLM master.
Note: Dump a buffer header and identify the in-recovery state field.
Recovery locks differ from regular PCM locks only

in their response to BASTs.
There is no distinction between recovery and
regular locks at the IDLM level.
When the last recovery buffer is released,

recovered threads are checkpointed and closed.
IR is complete when all dead threads have been
checkpointed and closed.
12-359
Second Pass Log Read (continued)

When the last recovery buffer is released, recovered threads are checkpointed and closed.
This requires a wait for write completions on the outstanding requests that were issued
during IR lock acquisition.
Large Recovery Set

and Partial IR Lock Mode
Buffers and LEs for IR are allocated from the SGA

by using existing mechanisms for allocating
recovery buffers (kcbrra).
For RAC systems that are configured for high

availability, recovery sets are small relative to the
size of the buffer cache of the recovering instance.
The largest recovery set is known at the start of IR.
At the end of the first-pass log read, SMON may
switch to Partial IR lock mode.
12-360
Large Recovery Set and Partial IR Lock Mode

The size of the buffer cache of the recovering instance places a limit on the largest
recovery set that can be completely accommodated (that is, a recovery buffer and lock
allocated for every block in the recovery set at the end of first pass and recovery lock
claim).
For RAC systems that are configured for high availability, recovery sets are normally
small relative to the size of the recovering instances buffer cache.
Based on the buffer cache size, the largest recovery set that the recovering instance's SGA
can accommodate is known at the start of instance recovery.
For example, assume that M blocks are available in the cache of the recovering instance. If,
at the end of the first-pass log read, the recovery set is greater than M, then SMON
switches to Partial IR lock mode.
Large Recovery Set

and Partial IR Lock Mode
12-361
Submit RecoveryClaimLock messages for the

first M blocks in the recovery list.
Begin the second-pass log read and redo
application.
If redo is encountered for a block on the recovery
list, a recovery buffer is paged out and reused.
When the reused list is not empty, the recovery list
no longer represents the optimal order to acquire
recovery buffers.
When recovery and reused lists are empty, SMON
issues a RecoveryDoneClaiming message to the
DLM, allowing it to proceed with lock domain
validation.
Large Recovery Set and Partial IR Lock Mode (continued)

Note the difference between the recovery list and the recovery set. The recovery list is a
doubly linked list of recovery set entries that are sorted by increasing first-dirty SCN.
The first-dirty SCN ordering ensures that these are the first M blocks in the merged redo
stream. Remove these M blocks from the recovery list because the recovery list contains
only recovery set blocks for which a buffer and lock have not been acquired.
The PCM lock database remains frozen, because SMON cannot issue the
RecoveryDoneClaiming message. Apply redo changes to the M recovery buffers.
After the buffer is fully recovered and written to disk, issue another
RecoveryClaimLock message for the head block on the recovery list.
When a recovery buffer is reused (and lock released), its recovery set block entry is put on
a reused list. A RecoveryClaimLock request is made for the new block, which is
removed from the recovery list. When redo is encountered for a reused list block, a buffer
and lock are acquired and the block is taken off the reused list.
Large Recovery Set and Partial IR Lock Mode (continued)

When the reused list is not empty, the recovery list no longer represents the optimal order
to acquire recovery buffers. So, when a recovery buffer is released after applying the last
redo change, there is no correct choice for the next block; no lock request is made at this
time. If the reused list becomes empty again, recovery can revert to acquiring locks in
recovery list order when a recovery buffer is allocated.
When both recovery and reused lists are empty, SMON issues a
RecoveryDoneClaiming message to the DLM that allows it to proceed with lock
domain validation.
Lock Database Availability During

Recovery
When an instance dies, the IDLM initiates lazy

remastering.
The PCM lock space remains invalid while:
IDLM master nodes discard locks that are held by
dead instances
SMON issues a RecoveryDoneClaiming message
12-363
Most PCM lock operations are frozen.

User operations that do not require interaction
with the IDLM can proceed.
Lock Database Availability During Recovery

Lazy remastering means that a minimum subset of resources are remastered to maintain
consistency of the lock database. This occurs in parallel with the first-pass log read where
the recovery set is constructed.
The entire PCM lock space remains invalid while the IDLM and SMON complete the
following:
IDLM master nodes discard locks that are held by dead instances; the space
reclaimed by this operation is used to remaster locks that are held by the surviving
instance for which a dead instance mastered.
SMON issues a RecoveryDoneClaiming message.
While the lock domain is invalid, most PCM lock operations are frozen, making the
database unavailable for users requesting a new or incompatible lock. The following lock
operations are allowed in an invalid lock domain:
Closing of lock held by the recovering instance to use its buffer for instance recovery.
Lock operations for locally partitioned tablespaces on a surviving node, provided that
a dead instance was not the owner.
Note: User operations that do not require interaction with the IDLM can proceed (for
example, a foreground process holding an XL lock).
Handling BASTs on Recovery Buffers
12-364
While a recovery buffer still requires redo to be

applied, it is flagged with an in-recovery state.
LCK permits a BAST on an in-recovery buffer to be
suspended indefinitely.
When the in-recovery flag is cleared, normal
down-convert processing is resumed.
IR of Nonfusion Blocks
12-365
During IR, lock acquisition of a nonfusion block is

treated as a local (XL or SL) fusion block.
If surviving instances hold S/X locks, the failed
instance could not have had the block dirty.
If there are no surviving locks, the block must be
read from disk to determine if recovery is needed.
Blocks are removed from the recovery set if the
on-disk version is more recent than the last-dirty
version.
If there are no surviving locks, the block must be read from disk and compared with the
last-dirty version for the block entry to determine if recovery is necessary.
During IR lock acquisition, an X lock is acquired on the block and it is read from disk. If
the on-disk version is more recent than the last-dirty version, then the block is removed
from the recovery set.
The IDLM response to RecoveryClaimLock messages

for nonfusion blocks is listed in the following table:
Current Lock Mode
Exclusive (X)
or Share (S)
Null or No Lock
12-366
Lock
Recovering Process Action
Granted
No Lock
No recovery needed, delete block

entry from recovery set
Read block from disk, do secondpass recovery if needed
Failures During Instance Recovery
Instance recovery restart

Recovery fails without the death of the recovering
instance.
12-367
Death of the recovering instance

Death of a nonrecovering instance
I/O errors
Block corruption during redo application
Failures During Instance Recovery

Restart is allowed only while the lock domain is invalid. After all IR locks have been
acquired and the RecoveryDoneClaiming message is issued, the lock domain is
validated. BASTs are queuing up on recovery locks, so it is not possible for SMON to
release its locks and restart the recovery. Recovery errors that occur after lock domain
validation must either fail the recovering instance or allow the recovery to complete.
If there is a surviving instance, it grabs the IR enqueue and starts the recovery. Crash
recovery is necessary if all instances are down.
Death of a Nonrecovering Instance
If the failure is during lock acquisition, it is detected during the
RecoveryDoneClaiming message that is broadcasted to all IDLM masters. A change
in the lock domain, caused by instance death, is communicated to the recovering SMON
by the IDLM. SMON aborts recovery and releases the IR enqueue. The next live instance
that detects a dubious lock will reattempt the instance recovery.
Failures During Instance Recovery (continued)

Death of a Nonrecovering Instance (continued)
In case of I/O errors, the file is taken offline and IR is restarted. If the I/O error is on a
system tablespace datafile, the recovering instance crashes; eventually, all instances in the
cluster crash. Media recovery is required if the I/O error is not transient.
Online block recovery attempts to clean up corrupted blocks to allow IR to proceed. If
block recovery succeeds, the block should not need further recovery (IR should find
recovery done up to the last-dirty SCN and drop it from the recovery list). If block
recovery fails, the recovering instance crashes and IR is restarted.
Memory Contingencies
Fusion recovery needs additional memory from:

SGA of the recovering instance
PGA of the recovering process (SMON for instance
recovery, foreground for crash recovery)
This memory is needed for:

The recovery set
Log buffers
Instance recovery locks
12-369
Memory Contingencies
The recovery set (hash table and block entries) is stored in the PGA of the recovering
process. There must be enough virtual memory to construct the recovery set in PGA to
complete the first pass.
There must be at least one buffer per thread being recovered in the buffer cache for the
first- and second-pass log reads.
LEs correspond to recovery buffers. If a recovery block is not in the cache, then there is no
lock storage associated with it.
Code References
The main code routines for recovery are:

kcratr: Thread redo application
kcratr1: Pass one: construct recovery set
kcratr_claim: Claim recovery buffers
kcbrbuf: Get a recovery buffer
This is the Buffer Cache Interface.
Call tree: kclclaim, kclcfusion,
KCL_CONVERT_RECOVERY_LOCK
This is the entry into the IDLM Interface.
kcratr2: Pass two: apply change vectors

If not all buffers were claimed in kcratr_claim,
then kcratr2 calls kcratr_claim recursively.
12-370
Code References
A more detailed list that indicates calling depth:
ktm.c
kcv.c
kct.c
kcra.c
kcrp.c
kcb.c
kcl.c
1. ktmmon - smon loop

1. kcvirv - Instance RecoVery (called by SMON, db is open)
1. kctrec - RECover threads - recover and close threads
1. kcratr - Thread Redo application
1. kcratr1 - Pass one of two pass recovery processing
2. kcratr_claim - Claim recovery buffers
1. kcrpclaim - Claim recovery buffer
1. kcrpsend_claim - send recovery buffer claim message
2. kcbrbuf - get a Recovery Buffer BUFFER CACHE INTERFACE
1. kclclaim - Claim a recovery lock
1. kclcfusion - Claim Fusion lock
1. kclcsfusion - start fusion recovery request
1. KCL_CONVERT_RECOVERY_LOCK IDLM INTERFACE
this is kjbrecoveryopen/kjbrecoverconvert.
Code References (continued)

Note, we also issue kjbrecoveryassume when we get the PI.
kcra.c
kcrfr.c
kcrp.c
kcb.c
3.
1.
1.
1.
1.
kcratr2 - Pass two of two pass recovery processing

kcrfrgv - get change Vector header/data
kcrpap - APply change vector
kcbtema - Thread recovery Exam and Maybe Apply
kclrdone - Recovey is Done so clean up buffer
Summary

Explain Cache Fusion recovery implementation
Examine the recovery/cache interface
Examine the recovery/DLM interface
Describe the basic Cache Fusion recovery
algorithm
12-372
SQL
SQL Layer
Layer
SQL
SQL Layer
Layer
Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS
Section
III
II
II
P
P
P
P
Platforms
C
C
C
C
Node
Node Monitor
Monitor
Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS
Node
Node Monitor
Monitor
Cluster
Cluster Manager
Manager
Linux Platform
Objectives

the following:
Outline the distinguishing features of RAC on the
Linux platform
Install, start, and stop RAC on the Linux platform
List the Linux-specific software components
13-377
Linux RAC Architecture
Hardware
Intel-based hardware
Externally shared SCSI or Fiber Channel disks
Interconnected via NIC
Software
OS versions supported:
RedHat 7.1 (9.0.1 and 9.2)
Suse 7.2 and Suse SLES7 (9.0.1 and 9.2)
Oracle-supplied CM, NM, and Watchdog (different

with each version of Oracle)
13-378
Linux RAC Architecture

RAC on Linux requires the following:
Two or more 32-bit Intel servers, maximum 32 nodes
A separate and dedicated intracluster network among the nodes with NICs. If the
cluster has more than two nodes, then a switch or hub in the intracluster network
might be necessary.
An external shared SCSI disk array or external Fiber Channel disk array with
shared disk partitions
At present, Linux is limited to eight nodes. The limitation is in the interconnected disk
system, the CM, the NM, and the Watchdog. These components have different limits.
The component with the lowest limit sets the limit for a RAC system.
Storage: Raw Devices
The supportable storage for RAC is raw devices.

Raw devices are usually named /dev/raw[0-9].
Up to 255 raw devices are possible.
The tool that is used to set up and query raw

devices is raw.
To make a SCSI disk partition a raw device:
raw /dev/raw1 /dev/sda3
13-379
Oracle Cluster File System can be used.
raw Command
Usage: raw /dev/raw<N> /dev/<blockdev>
On Redhat, it is /dev/rawctl - raw io control device (it is in /usr/sbin/raw).
On Suse, it is /dev/raw - raw io control device (it is in /usr/local/bin/raw).
In the slide example sda3 means third partition of SCSI disk 1.
Note: You can store the commands at /etc/rc.d/boot.local. The commands
are executed immediately after booting. Or, store the commands in a file and execute
that file from boot.local.
For example, rawsetup is a file with all the commands for configuring the raw
devices and /etc/rc.d/boot.local contains the line:
. /etc/init.d/rawsetup
After creating raw partitions, you must give correct permissions on /dev/raw*.
Extended Storage
13-380
Logical Volume Manager (LVM), only available on

SuSe
Xraw
Cluster File Systems (CFS)
Extended Storage
LVM
The LVM hides the details about where data is stored: on what hardware as well as
where on that hardware. The management of volume groups and logical volumes can be
done while they are being used by the system. For example, you can increase the size of
a logical volume while it is being mounted; you do not have to unmount.
Cluster File Systems
Linux does not have its own cluster file system. Various third-party suppliers (like
Polyserve) supply a CFS. Oracle supplies its own CFS. This is the only supported option.
Linux Cluster Software
Extended with the Oracle-supplied Cluster

Manager (OCMS)
Kernel tuned with parameter settings:
/proc/sys/kernel/shmmax - 2147483647
/proc/sys/fs/file-max - 81920
config_watchdog_nowayout set to Y.
13-381
Linux Cluster Software

OCMS
Unlike the Oracle Real Application Clusters versions on UNIX platforms, you rely on
any Linux vendor to provide the clusterware layer (the operating systemdependent
modules or the equivalents). OCMS is included with Oracle9i for Linux.
Kernel Settings
echo 2147483647 > /proc/sys/kernel/shmmax.
The config_watchdog_nowayout parameter cannot be changed dynamically. It
should be changed during installation of the OS.
OCMS
13-382
OCMS is included with Oracle for Linux.

OCMS is layered above the operating system and
provides all the clustering services that Oracle
RAC needs to function as a high-availability and
high-scalability solution.
OCMS provides cluster membership services,
global view of clusters, node monitoring, and
cluster reconfiguration.
OCMS Components
OCMS consists of:
13-383
Watchdog daemon (WDD) in Oracle9i and Oracle8i

Hangcheck module (Oracle9i Release 2)
Node monitor (NM)
Cluster Manager (CM)
The binaries are in:

$ORACLE_HOME/lin_nm/latest
OCMS Components
Version Note
The Linux OCMS is ported from the Windows NT/2000 version.
Oracle version 9.0.x and 8.1.x architecture used an Oracle-written watchdog daemon to
monitor for system hangs, running as a process in user-space.
Oracle9i releases 9.2.0.1 and earlier use the Linux supplied softdog module to reset the
node in case of hangs.
Oracle9i release 9.2.0.2 uses a new Oracle-written, loadable kernel module, hangchecktimer, that runs in kernel space. The NM and CM functionality is combined into the
oracm background process (no more nm.log).
The older watchdog (Oracle9i release 1and earlier) could be starved for CPU by heavy
load and high kernel activity, causing many unnecessary node resets (false evictions).
WDD, NM, and CM Flow

(Up to version 9.2.0.1)
Oracle instance
Watchdog
service
Kernel mode
13-384
Instance-level
cluster information
Cluster Manager
Node-level
cluster information
Node Monitor
Watchdog
service
Watchdog daemon
User mode
Watchdog service
Watchdog timer
Watchdog Daemon
13-385
The watchdog daemon monitors the NM and the

CM and passes notifications to the watchdog
timer at defined intervals.
Watchdog services are documented at:
/usr/src/linux/Documentation/watchdog.txt
The WDD is replaced by the hangcheck-timer
kernel module as of Oracle release 9.2.0.2.0.
Watchdog Daemon
The important kernel configuration parameter for the watchdog daemon is
config_watchdog_nowayout.
After you create /dev/watchdog by using mknod, you get a watchdog daemon.
That is, subsequently opening the file and then failing to write to it for longer than one
minute results in rebooting the machine.
The watchdog can stop the timer if the process managing it closes the
/dev/watchdog file, provided that the parameter
config_watchdog_nowayout is set to N. The watchdog cannot be stopped after it
has been started if config_watchdog_nowayout is set to Y. On Redhat, it is N by
default, and on SuSe it is Y by default.
Hangcheck, NM, and CM Flow

(After version 9.2.0.2)
Oracle instance
Cluster Manager (including Node Monitor)
Oracm maintains both node

status view and Oracle
instance status view.
User mode
The hangcheck-timer monitors

the kernel for hangs, and
resets the node if needed.
Kernel mode
Hangcheck-timer
13-386
Hangcheck, NM, and CM Flow

For version 9.2.0.2 and later.
Hangcheck-timer monitors heartbeats from oracm I/O capable clients. A node reset will
occur when the following is true:
(system hang time) > (hangcheck_tick + hangcheck_margin)
Hangcheck Module
Loaded as a kernel module

Specified by the parameter KernalModuleName in
the CMCFG.ORA file
$ cd $ORACLE_HOME/oracm/admin
$ grep KernalModuleName cmcfg.ora
KernalModuleName=hangcheck-timer
13-387
Hangcheck Module
The hangcheck module is implemented from version 9.2.0.2 and later.
This module is not required for the CM operation, but its use is highly recommended.
This module monitors the Linux kernel for long operating system hangs that could
affect the reliability of a RAC node and cause corruption of a RAC database. When such
a hang occurs, this module sends a signal to reset the node.
Node resets are triggered from within the Linux kernel, making them much less affected
by the system load.
The CM on a RAC node can be easily stopped and reconfigured, because its operation is
completely independent of the kernel module.
The features that are provided by the hangcheck-timer module closely resemble the
features found in the implementation of the CM for RAC on the Windows platform, on
which the CM on Linux was based.
Node Monitor (NM)
13-388
Maintains a consistent view of the cluster

Reports the node status to the cluster manager
Uses a heartbeat mechanism
Works with WDD and takes action depending on
the type of failure
Node Monitor (NM)

The node monitors on all nodes send heartbeat messages to each other. Each node
maintains a database that contains the status information on other nodes. The NMs in a
cluster mark a node inactive if the node fails to send a heartbeat message within a
defined time interval.
The heartbeat message from the NM on a remote server can fail for the following
reasons:
Termination of the NM on the remote server
Network failure
Heavy load on the remote server
The NM reconfigures the cluster to terminate the isolated nodes, ensuring that the
remaining nodes in the reconfigured cluster continue to function properly.
Cluster Manager
13-389
The CM maintains the process-level cluster status.

The CM accepts the registration of Oracle
instances to the cluster and provides a consistent
view of Oracle instances.
When an Oracle process that writes to the shared
disk quits abnormally, the CM on the node detects
it and requests WDD to take appropriate action.
Cluster Manager (CM)

If /a:1 is set, and if LMON terminates abnormally, then the CM daemon on the node
detects it and requests the watchdog daemon to stop the node completely. This stops the
node from issuing physical I/O to the shared disk before CM daemons on the other
nodes report the cluster reconfiguration to the Oracle instances on the nodes. This action
prevents database corruption.
Linux Port-Specific Code
rdbms/src/generic/osds/skgxpu.c
rdbms/src/generic/osds/sskgxpu.c
libcmdll.so - rdbms/src/port/cm/dll/
Has one-to-one mapping for skgxn functionality
Cluster implementation is similar to NT

implementation.
13-390
Linux Port-Specific Code

The operating systemdependent (OSD)modules are:
skgxp for communicating with the nodes
libskgxn for communicating with the cluster
Cluster Manager
CM source code is available at:

rdbms/src/port/cm
rdbms/src/port/nm
rdbms/src/port/wdd
Sharable object libraries:

libcmdll.so
libnmdll.so
libwddapi.so
13-391
skgxpt and skgxpu
13-392
Oracle version 9.0.1 (and earlier) has TCP/IP

implementation.
Oracle version 9.2 has UDP implementation.
TCP/IP is not supported in version 9.2 (and later
versions).
skgxpu.c, sskgxpu.c are the same as base
version, except for changes in skgxp_ipcluster.
libskgxpu.a includes skgxpu.o and sskgxpu.o.
libskgxpt.a includes skgxpt.o and sskgxpt.o.
skgxpt and skgxpu

In release 9.0.1, libskgxpt.a and libskgxp9.a will have *skgxpt*.o objects
archived.
In release 9.2, only libskgxpu.a has *skgxpu.o objects.
Installing RAC on Linux
See WebIV note Step-By-Step Installation of RAC

on Linux 184821.1.
Linking of Oracle:
Version 9.0.1 and earlier uses nmliblist while
linking with Oracle RAC. (nmliblist contained
lcmdll.so)
Version 9.2: libcmdll.so copied to libskgxn9.so
13-393
Installing RAC on Linux

In order for cluster detection to occur during the 9.2.0.1 database installation, you must
configure the older watchdog daemon (WDD). Without this configured, the Installer
will not install RAC and will not rcp the Oracle S/W to the other cluster nodes.
You must still set up the WDD as described in the note 184821.1 as an interim step on
your way up to 9.2.0.2 and the new hangcheck timer-based oracm. Do not run the old
WDD in production.
Installing RAC on Linux (continued)

OraCM version 9.2.0.1
Load OS softdog module (all as root). Softdog is included with AS 2.1 (no build
necessary like on RH 7.1):
$> insmod softdog soft_margin=15 nowayout=1
Start watchdog from $OH/oracm/bin.
$> ./watchdogd -g dba d /dev/null l 0
Edit cmcfg.ora file.
WatchdogSafetyMargin=1000
WatchdogTimerMargin=1500
Start oracm.
$> ./oracm /a:0
OraCM version 9.2.0.2
9.2.0.2 uses hangcheck-timer. Latest version is 0.4-2 for IA32 and 0.5-1 for IA64.
Download software from:
http://kernel.us.oracle.com/software/
Install correct version of hangcheck-timer based on your kernel release (uname a):
# rpm ivh hangcheck-timer-2.4.9-e.10-0.4.0-2.i686.rpm
Configure cmcfg.ora & ocmargs.ora according to note 222746.1.Cmcfg.ora
recommended settings:
MissCount must be set to a large value and must be greater than the sum of
hangcheck_tick + hangcheck_margin. Recommended value is 215 seconds.
Load hangcheck-timer at boot via rc.local:
/sbin/insmod hangcheck-timer hangcheck_tick=30
hangcheck_margin=180
Start up oracm as root:
export ORACLE_HOME=/u01/app/oracle/product/9.2.0
sh $ORACLE_HOME/oracm/bin/ocmstart.sh
Installing RAC on Linux (continued)

Example cmcfg.ora
HeartBeat=15000
ClusterName=Oracle Cluster Manager, version 9i
KernelModuleName=hangcheck-timer
PollInterval=1000
MissCount=215
PrivateNodeNames=heartbeat3 heartbeat4
PublicNodeNames=rcbstint3 rcbstint4
ServicePort=9998
CmDiskFile=/ocfsdisk1/quorum/quorumfile
HostName=heartbeat3
Example ocmargs.ora
oracm
norestart 1800
To verify what interface the CM traffic is using:
[rcbstint3 ~]$ netstat -a | grep 9998
udp
0
0 heartbeat3:9998
[rcbstint4 ~]$ netstat -a | grep 9998
udp
0
0 heartbeat4:9998
*:*
*:*
Check hosts file:

$ grep heart /etc/hosts
10.1.1.3
heartbeat3
10.1.1.4
heartbeat4
Running RAC on Linux
Scripts for starting and stopping the cluster:

Startclu
Stopclu
oracm/bin/ocmstart.sh
13-396
ps -efl | egrep 'watchdogd|oranm|oracm'
Starting CM
Starting OCMS involves the following:

WDD
Configuring NM
Starting NM
Starting CM
13-397
Starting WDD
Starting WDD:
watchdogd -g dba
13-398
Starting WDD
WDD is used only in Oracle9i before release 9.2.0.2.
Options to the watchdog command are:
-l: If 0, then no resources are registered for monitoring. This can be used while
debugging system configuration problems.
-t <number>: default 1000 ms (range: 0 ms to 3000 ms). This is the time
interval at which the WDD checks the heartbeat messages from its clients.
The default log file is $ORACLE_HOME/oracm/log/wdd.log.
Starting NM
The defined nodes and CmHostName are defined

in C_OH/oracm/admin/nmcfg.ora.
You must check that WDD is running first.

oranm </dev/null >$OH/oracm/log/nm.out 2>&1 &
13-399
Start Options in NM
nmcgf.ora parameters:
pollinterval: Sends heartbeat messages at this interval. Default value 1000;

range 10 ms to 180000 ms.
watchdogMarginWait: Specifies the delay between a node failure and the

commencement of Oracle RAC cluster reconfiguration. Default value 70000.
autojoin: If 1, NM joins the cluster when NM starts, If 0, then it joins when

CM requests to join. Default value 0.
Switches for oranm
/?: Prints help text
/v: Verbose mode. Prints detailed info about every activity of the NM.
/s: Prints information about NM network traffic info
/r: Shows help NM parameters. NM does not start with this option.
/c: Prints messages sent from CM to NM
Starting CM
1. Check if WDD and NM have started.

2. Confirm that the host name in CmHostName
parameter of nmcfg.ora is in /etc/hosts.
oracm </dev/null> $OH/oracm/log/cm.out 2>&1 &
13-400
Options for oracm

/?: help text
/a: Defines the action taken when the LMON process or any other Oracle process that
can write to the shared disk terminates abnormally. If action is 0, no action is taken. If
action is 1 (default), the CM requests the WDD to stop the node completely. Set /a to 0.
/v : Prints detailed information on every activity of CM
/d : Prints more trace information for debug
Debugging
13-401
For general debugging, use gdb.

For skgxp debugging use IPC tracing.
sskgxp provides dump routines that can be used
for debugging.
Examine cluster code debug, log files, and out
files.
Debugging
sskgxp_dmpsspt - port: dumps port structure.
sskgxp_dmpsspid
Summary

Linux platform
Install, start, and stop RAC on the Linux platform
List the Linux-specific software components
13-402
References
www.sistina.com/lvm
linux.oracle.com
Administrators Guide for Oracle9i for UNIX

Sys-admin: Scott Forten
Cluster-related: Takiba
13-403
HP-UX Platform
Objectives

the following:
HP-UX platform
Install, start, and stop RAC on the HP-UX platform
List the HP-UXspecific software components
14-405
HP-UX RAC Architecture
Clusters are called Multi Computer (MC).

Interconnects can be:
LAN, normal Ethernet architecture and protocols
HyperFabric, a proprietary protocol Cluster
Interconnect (CLIC)
Copper-based, fiber-based, or mixed
Direct node-to-node or via switch (hub)
14-406
Depending on the choice of interconnect, up to

eight nodes can be clustered together.
HP-UX Architecture
For more information on HP-UX hardware variations, refer to
http://docs.hp.com/hpux/onlinedocs/B6257-90031/B625790031_top.html.
HP-UX Cluster Software
HP cluster services are required for:

MC/Service Guard (MCSG), RAC Edition
Nmapi2: implementation of the SKGXN interface
14-407
Shared volume group services
HP-UX Cluster Software

nmapi2 is HPs implementation of the SKGXN interface, which is located in
/opt/nmapi2/lib//libnmapi2.sl.
The Shared Volume Group Service provides the shared volume group services (for raw
devices) along with the volume group services.
HP-UX Port-Specific Code
The following three SKGXP implementations are

present:
TCP: Not recommended and tested
UDP: Most common used
lowfat: Provided by HP, used with CLIC
14-408
HP-UX Port-Specific Code

The lowfat SKGXP implementation is the default if the CLIC interface and software are
present from Oracle, release 9.2 and later versions. Otherwise, the UDP version is used.
The lowfat SKGXP implementation requires a relink because it is supplied to the
customer by HP.
SKGXP (UDP Implementation)
14-409
SKGXP provides a failover mechanism from the

primary network to a secondary network.
The primary network is always a CLIC interface. It
is NULL if CLIC is not present.
The secondary interface is the interface that is
bound to the host name (uses gethostbyname).
SKGXP: Lowfat
HP provides the software directly to customers:

Proprietary protocol by HP
Failover within CLIC interfaces
No failover from CLIC-to-LAN interfaces
14-410
SKGXP: Lowfat
The HP Cluster Interconnect (CLIC) protocol is proprietary and is part of the HyperFabric
cluster system.
Installing RAC on HP-UX
See WebIV note Step-By-Step Installation of RAC on

HP-UX 182177.1.
14-411
Installing RAC on HP-UX

The OS-specific steps are:
Configuring the cluster hardware, including OS patches
Installing and configuring disk arrays
Installing and configuring Cluster Interconnect and Public Network Hardware
Creating a cluster
- Modifying the /etc/lvmrc file
- Creating a Shared Logical Volume
- Installing the cluster software
- Forming a one-node cluster, performing basic cluster administration
Finally, install Oracle RAC software.
Running RAC on HP-UX
Cluster commands:
Cmhaltcl: Stop the cluster.
Cmrunnode: Join the node with the cluster.
Cmhaltnode: Remove the node from the cluster.
Cmviewcl: View the status of the cluster.
Cmruncl: Bring up the cluster.
14-412
Debugging on HP-UX
14-413
Do not pick services from NIS.

Ensure that Lock PV is accessible from all the
nodes.
Check the required permissions in cmclnodelist.
Check the cluster services.
Ensure that SKGXP is using the same network
on all nodes.
Summary
In this lesson, you should have learned about the

platform-specific details of RAC on HP-UX.
14-414
Tru64 Platform
Objectives

the following:
Tru64 platform
Install, start, and stop RAC on the Tru64 platform
List the Tru64-specific software components
15-417
Tru64 RAC Architecture
15-418
Memory Channel Interconnect

Native Cluster File System
Shared Disk Systems
LSMs are easy to create and manage (such as

volume groups, logical volumes).
Distributed Raw Devices (DRD)
Cluster File System (CFS) is on top of AdvFS
Client Server mode
A node can be a client, as well as a server, for
different file systems of the cluster.
15-419
Node failure is handled by Device Request

Dispatcher, which routes to different controller for
shared disks.
Shared Disk Systems

DRD usage is being replaced by CFS, due to the ease of use of CFS.
Logical storage managed disks (LSMs) or partitions have an initial offset of 64 KB.
Tru64 Cluster Software
Native cluster file system

Raw devices
Connection Manager
Cluster Application Availability (CAA)
Resource monitoring
Application Restart Capability
15-420
Cluster alias
Distributed Lock Manager (DLM)
Expanded process IDs
Tru64 Cluster Software

The native cluster file system includes the /usr and /var file systems.
Both CFS and raw devices can be used.
The cluster alias allows TCP/UDP applications to address the cluster as a single system.
Expanded process IDs have 32-bit values and are unique across a cluster. Each cluster
has a block of numbers that it assigns as PIDs.
Tru64 Port-Specific Code
15-421
Node monitor: SKGXN

Inter Process Communication: SKGXP
Other platform-specific code in #ifdef A_OSF
blocks
Tru64 Port-Specific Code

SKGXN uses the Tru64 Cluster Manager; it is based on the reference implementation,
as in Solaris. But instead of using the Oracle CM code for cluster membership, calls are
made to the Tru64 DLM and cluster API. The libskgxn9[8].a library archives the
files.
SKGXP interfaces to the Memory Channel Interconnect.
Node Monitor: SKGXN
The libskgxn9.a library contains modules

skgxn.o, skgxnr.o.
The skgxn0.h source module contains True64specific comments.
15-422
Node Monitor: SKGXN

The programs compiled in an earlier version of the operating system work on later
versions, even if the libraries have changed.
If the libskgxn9.o library contains skgxns.o, then the RAC option was not
installed properly.
The clu_get_info calls get the information about the nodes in the cluster. The
cluster includes cluster_defs.h and /usr/shlib/libclu.so library from
Tru64, version 5.1 and later. The link command line should include -ldlm -lssn
lclu.
The Tru64 DLM library is used for actions such as creating or joining a global
namespace, or finding the condition of a node. Typical calls are: dlm_nsjoin,
dlm_nsleave, dlm_lock, dlm_unlock, dlm_notify, dlm_cvt, and
dlm_get_rsbinfo.
The library used before Oracle9i is libskgxn8.a.
IPC: SKGXP
Tru64 supports two types of IPCs:

Low-latency Reliable Data Gram (RDG)
implementation in skgxpm. This is the default.
UDP implementation in skgxpu
15-423
TCP implementation (skgxpt) is not supported in

Tru64.
IPC: SKGXP
The cluster_interconnects initialization parameter defines which interface is
used.
When set to an IP address, the parameter uses that address and thus disables
processing in the sskgxp module.
When unset, the parameter uses the first available ics0 or mc0 interface (in that
order). ics0 is the name of the memory channel for Tru64, version 5.1 and later.
cluster_interconnects is ignored if the default RDG implementation is used.
skgxp is stored in libskgxpu.a (contains modules skgxpu.o and sskgxpu.o)
and is copied over to libskgxp9.a, if UDP implementation selected.
SKGXPM: RDG
Is part of the OS kernel: rdg.mod
Was developed jointly by Compaq and Oracle

Is on memory channel only
Has the same functionality as skgxpu
Is adjustable through subsystem kernel

parameters
15-424
SKGXPM: RDG
The RDG IPC is one of the most widely tested and proven IPC versions. It is used in the
SAP benchmark for Oracle, release 9.2.
The RDG IPC uses the rdg* kernel calls to create or initialize endpoints. Typical calls
are RdgInit, RdgNodeLookup, RdgEpCreate, RdgEpDestroy,
RdgShutdown, RdgIoCancel, and RdgEpLookup.
The RDG IPC uses the cfg_subsys_query call to find the RDG subsystem
information. Link commands should include -lrdg lcfg.
The RDG subsystem kernel parameters must be set as follows:
max_objs = 5120
msg_size = 32768
max_async_req = 512
rdg_max_auto_msg_wires = 0
rdg_auto_msg_wires = 0
Use the sysconfig -q rdg to verify these values (RDG version : RDG
V39.24b_BL17_BCGM623Z3).
SKGXPM : RDG (continued)

Setting the environment variable SKGXP_TRACE to 1 to trace can yield far too much
data, up to gigabytes in size.
The libskgxpm.a library contains skgxpm.o. It is copied over to libskgxp9.a.
The skgxpm0.h module contains some comments.
The /usr/ccs/lib/librdg.a library is the wrapper for kernel calls from
/usr/opt/TruCluster/sys/rdg.mod.
The /usr/lib/libcfg.a library has the configuration API to query subsystems.
Installing RAC on Tru64
See WebIV note Step-By-Step Installation of RAC on

HP/Compaq Tru64 175480.1.
15-426
Installing RAC on Tru64

The OS-specific installation involves:
Checking the hardware
Configuring the cluster, including the shared mounts
Shared mounts are the clusterwide file systems. One more disk is needed for the cluster
quorum disk; this cannot be used for any other purpose.
Debugging on Tru64
Use the SKGXNTRCFLG OS environment variable

set to TRUE to enable tracing in the SKGXN layer.
Normal SKGXN_TRACE[0-3] skgxn_qry_group,

skgxn_print_bitmap, and so on are available.
Compile the time options -DDEBUG, DSKGXN_DEBUG for more tracing.
15-427
Debugging on Tru64
The value TRUE for SKGXNTRCFLG must be uppercase.
Useful Tru64 Commands
ladebug
cfsstat
volprint
15-428
Useful Tru64 Commands

Debug
ladebug (similar to dbx but more stable and advanced; also has a good GUI
interface): Prints out the line number of files when attached to a process
ladebug $ORACLE_HOME/bin/oracle -pid xxxxx
dis <objectfile>: To disassemble code

dis kcl.o
odump -Dl / ldd: For information about shared libraries linked with the
executable, section headers and so on
/usr/local/bin/trace: To trace and log the executable
/usr/local/bin/truss: Same as /usr/local/bin/trace but better
trace output
Useful Tru64 Commands (continued)

CPU and System Information
Psrinfo: To retrieve information about the processors and state
Psradm: To bring offline/online processors
sizer v: To find the OS release level
setld i: To find all the software patches and packages
sysconfig q <item>: To retrieve information about the subsystem. Items
can be, for example, rdg, proc, or ipc.
sysconfig q ipc
Hwmgr: To view or modify any hardware

Cluster Commands
clu_get_info: To view information about cluster state, cluster ID, IPC name
cnxshow (older form from Tru 4.0) : To view information about the cluster state
sysman: For graphical picture of the cluster
Disk Commands: CFS Related
Cfsstat: To view statistics of CFS subsystem and internode communication
system (ICS) subsystem
cfsmgr: To view status of CFS or find out who is server; to get statistics for a
particular filesystem, use cfsmgr -a statistics.
showfdmn -k <domainname>: To view space left in CFS (more accurate
than du k)
mkcdsl: To make context-dependent symbolic link (available only in clusters
and AdvFS)
mkcdsl /usr/testfile
Disk Commands: Others
volprint: To print information about LVs, including their size in 512-KB block
Showfile: To display the attributes of AdvFS directories and files
advfsstat: To display statistics of AdvFS
asemgr: To maintain DRDs (no command-line interface)
Summary

platform-specific details of RAC on Tru64.
15-430
AIX Platform
Objectives

the following:
Outline the distinguishing RAC features on the AIX
platform
List the AIX-specific software components
16-433
AIX RAC Architecture
The following two cluster configurations are

available:
SP clusters
High Availability Cluster Management Program
(HACMP) clusters
16-434
There is a RAC-supported cluster file system

(GPFS) available from IBM.
AIX RAC Architecture

The following two cluster configurations are available:
Shared Nothing cluster
Shared Disk cluster
The Shared Nothing cluster, which is called SP, uses Parallel System Support Programs
(PSSP) as the Cluster Manager. Because the Oracle server needs a shared disk, it uses the
Virtual Shared Disk (VSD) software to make the disks shared.
The Shared Disk cluster uses HACMP (High Availability Cluster Management Protocol)
as the Cluster Manager.
The SP-series computers are called the P or X series in some versions.
AIX SP Clusters
Highly scalable: up to 128 nodes

High/wide/thin node configurations
Cluster software: PSSP or HACMP
If both PSSP and HACMP are present on the same
machine, Oracle by default uses PSSP. If
PGSD_SUBSYS=grpsvcs, then HACMP is selected.
16-435
IPC traffic: High Performance Switch (HPS)

Raw devices: Virtual Shared Disk (VSD), Hashed
Shared Disk (HSD)
AIX SP Clusters
The term Parallel System Support Programs (PSSP) is also used for SP clusters.
AIX HACMP Clusters
16-436
Scalability is limited due to Concurrent Logical

Volume (CLV) (<= eight nodes)
Nodes: RS6000 machines
Cluster software: HACMP
IPC Traffic: HPS, Ethernet, FDDI
Hard disks must be physically connected to each
node.
Raw device: CLV
AIX Cluster Software
AIX can have the operating kernel extended.

The Oracle server makes use of the kext version
of the Post/Wait (PW) service:
Provides the facility for generic and IPC PW
Has placeholders for extending PW service to I/O
events and miscellaneous events
Uses the loadext.c facility for loading, unloading,
or status check (status_chk) of kernel extensions
16-437
AIX Cluster Layer
Commands to check subsystems on AIX:

Group Services:
On PSSP: hags
On HACMP grpsvcs
Event Management (EM) Services (on HACMP

only) : emsvcs
SRC commands:
16-438
startsrc -s <sname>
stopsrc -s <sname>
lssrc -ls <sname>
lssrc -a
AIX Port-Specific Code
Object and archive files

libskgxnr.a contains the NM code.
libha_gs64_r.a has to be linked to the Oracle
server.
16-439
Implementation uses pthread condition

variables to synchronize between the threads.
AIX Port-Specific Code

Before Oracle release 9.2, libskgxnr.a was two object files: skgxn(r).o and
sskgxn.o.
libha_gs64_r.a is IBMs client library for Group Services.
On AIX, NM(skgxn) is implemented by Oracle using the grpsvcs API - GSAPI
(IBMs group services that runs as part of the cluster).
In versions before Oracle, release 9.0.1, the files skgxn.o, skgxnr.o, and sskgxn.o
were located in $ORACLE_HOME/rdbms/bin. Starting with Oracle, version 9.2, these
files are archived in libskgxnr.a. When linking, you must use IBMs
libha_gs64_r.a to make the group service functions available.
Because the basic services come from the cluster or the OS, you need not perform a
preinstallation.
RAC on AIX Stack

Node 1
Instance
Clusterwide
disks
Node n
Instance
LCK0
LCK0
LMON
SKGP
SKGXP
SKGXN
SKGFR
HAGS
EM
Group Services services
AIO
VSD/CLV
KEXT
NET
LMON
Cluster layer,
CM
Operating
system
Net 1 and Net 2
16-440
RAC on AIX Stack

The Oracle instance has the LMON process as the primary communication process. Other
Oracle processes also contain the SKG-routines. For easier understanding, only LMON is
shown in the slide.
EM and HAGS are cluster layer components that implement the vendor-supplied CM.
The Advanced I/O (AIO) component handles the Virtual Storage Device (VSD) or Cluster
Logical Volumes (CLV) storage. This is connected from the clusterwide to the shared
disk (connection not shown).
KEXT is the kernel extension.
Node Monitor (NM)
The NM uses the AIX Group Services API (GSAPI).

GSAPI is supported on both HACMP and PSSP
platforms.
Logical flow:
The primary member initializes and joins the group,
monitors slaves joining the group, and checks the
status of the slaves.
Slaves join the group.
16-441
Node Monitor (NM)

The flow of the Node Monitor is the same as for other platforms. The list on the
following Notes pages shows the IBM AIX calls that are used in the AIX Group
Services.
Node Monitor (NM) (continued)

NM Flow Logic
Primary Member Primary Thread Logic:
Initializes the connection with Group Services (ha_gs_init) and Spawn of the
GS thread, which waits for responses on the GS socket
Joins the public Group, in which I is the sole provider (ha_gs_join)
Publishes my Public Data (ha_gs_change_state_value)
Subscribes to the RVSD group, if PSSP (ha_gs_subscribe)
Joins the Process Group, which is joined by ALL Primary members, mounting the
same database (ha_gs_join)
Publishes my Private Data (ha_gs_send_message)
Spawns the Primary Accept Thread, which is used by slave members to detect
primary members death
Monitors membership changes (skgxnpstat)
Primary Member GS Thread Logic:
Loops on a select() on GS socket
Calls ha_gs_dispatch, which calls one of the Global Callback functions based
on the response:
- sskgxn_gs_delayed_error_cb: To process Async Error notification
- sskgxn_gs_subscription_cb: To process changes in the subscribed
group
- sskgxn_gs_approved_cb: To process any proposal that has been
approved in the Process group
- sskgxn_gs_announcement_cb
- These callbacks in turn call Local Call back function based on the current
state of SKGXN.
Primary Member Accept Thread Logic:
Creates UNIX Domain Socket; Loop indefinitely
Inside the loop; Wait on accept
If a new slave connects, handshake member information, then add the
connection to the Array of Slave Connections.
Check All Slave Connections to see if they are Alive. If any Slave has died, remove
the Connection from the Array.
Go and Wait in accept.
Slave Member Logic: (Main & Read Thread)
Connect to Primary Member Socket; Handshake member information
Subscribe to Primary Members Process Group (ha_gs_subscribe)
Spawn of Slave Read Thread which blocks on read() on the socket
If read() returns and no EINTR, then (error) exit.
Installing RAC on AIX
For information on installing RAC on AIX:

Refer to the WebIV note Step-By-Step Installation
of RAC on AIX 199457.1
Refer to the RAC-Pack public folder:
http://files.oraclecorp.com/content/AllPublic/
Workspaces/RAC%20Pack-Public/Technical%20
Papers/CookBook%20AIX%20V2_2.pdf
16-443
Installing RAC on AIX
Identifying the domain to Group Services:

If PSSP, set
HA_SYSPAR_NAME=`/usr/lpp/ssp/bin/spget_syspar n`
If HACMP, set
HA_DOMAIN_NAME=`/usr/sbin/cluster/utilities/cldomain`
PGSD_SUBSYS=grpsvcs
16-444
Debugging on AIX
More named dump routines available

Tracing can be done by:
Setting SKGXNTRCFLGS to any non-zero number
Turning skgxn tracing on
Trace macros available:

sskgxn_log(ctx, "", )
sskgxn_trace(ctx, mask, "", )
where ctx is the skgxn context pointer,
mask=1, and sskgxn tracing is turned on
16-445
Debugging on AIX
There are many more dump routines in addition to the standard X$TRACE/KST and
DIAG. Refer to the source code for a list.
Summary

platform-specific details of RAC on AIX.
16-446
References
Product availability, support, required IBM patches

(incomplete) Web site:
http://ibmhome.us.oracle.com/
Oracle functionality specific to Aix:

http://mercury.us.oracle.com/merc/owa/MCGUI.dis
play_object?objectID=3531
External documentation and notes on AIX, SP

PSSP, and HACMP:
http://www.rs6000.ibm.com/resource/aix_resource/
Pubs/
16-447
References
Contacts
Oracle on Aixrelated: Vijay.Sridharan@oracle.com
System-related issues: File ES1 Ticket
SP Sys Admin: David.Ong@oracle.com
HACMP Sys Admin: John.Tomicich@oracle.com
IBM-specific queries: Dennis Massanari: massanar@us.ibm.com
Other Platforms
Objectives

the following:
Outline the distinguishing features of RAC on
Windows, Solaris, and OpenVMS platforms
List the specific software components for these
platforms
17-449
Objectives
The platforms covered in this lesson are:
Windows
Solaris
OpenVMS
RAC Architecture: Solaris
Cluster is limited to a maximum of four nodes.
17-450
RAC Architecture: Solaris

Solaris has a clusterwide file system (GFS). This is not supported by RAC.
RAC Architecture: Windows
Cluster is limited to a maximum of four nodes.
17-451
RAC Architecture: Windows

The Microsoft Cluster System (MCSC) is not required for RAC implementation. It is
required for RAC Guard or RAC high availability.
RAC Architecture: OpenVMS
Native clusterwide file system

Disc Cluster Configurations can be:
Hardware-shared
Host-shared
17-452
Port-Specific Code
The VMS IPC uses the TCP reference implementation.
17-453
Installing RAC
For information about installing RAC, refer to:

WebIV note Step-By-Step Installation of RAC on
Solaris 175465.1
Windows NT or 2000 178882.1
OpenVMS 180012.1
17-454
Summary

Outline the distinguishing features of RAC on
Windows, Solaris, and OpenVMS platforms
List the specific software components for these
platforms
17-455
SQL
SQL Layer
Layer
SQL
SQL Layer
Layer
Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS
Section
IV
II
II
P
P
P
P
Debug
C
C
C
C
Node
Node Monitor
Monitor
Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS
Node
Node Monitor
Monitor
Cluster
Cluster Manager
Manager
V$ and X$ Views
and Events
Objectives
This lesson provides a reference of useful dictionary

views and tables.
18-459
V$ and GV$ Views
18-460
V$ views are instance-specific.

GV$ views retrieve the V$ content from all instance
members by using the Parallel Query subsystem.
PARALLEL_MAX_SERVERS must be large enough
on all instances.
V$ and GV$ Views

Note: For the purpose of brevity, all views are shown as V$<name>, and it is assumed
that there is also a corresponding GV$<name> (except where noted otherwise).
List of Views
See documentation for column descriptions.
V$ACTIVE_INSTANCES
V$BH
V$CACHE
V$CACHE_LOCK/_TRANSFER
V$CR_BLOCK_SERVER
V$ENQUEUE_LOCK/_STAT
V$FALSE_PING
V$FILE_CACHE_TRANSFER
V$GC_ELEMENT
V$GC_ELEMENTS_WITH_COLLISION
S
V$GCSHVMASTER_INFO
V$GCSPFMASTER_INFO
18-461
V$GES_BLOCKING_ENQUEUE
V$GES_CONVERT_LOCAL
V$GES_CONVERT_REMOTE
V$GES_ENQUEUE/_RESOURCE
V$HVMASTER_INFO
V$INSTANCE
V$LIBRARYCACHE
V$LOCK
V$LOCK_ELEMENT/_ACTIVITY
(V$PQ_SESSTAT, V$PX_*)
V$RESOURCE_LIMIT
V$ROWCACHE_PARENT
List of Views
The slide lists the views that are documented in the manuals. Views marked with are
created with the script CATCLUST.SQL. The V$GES_* views are synonyms for
V$DLM_* views and are also created with the script CATCLUST.SQL. Other internal
views are listed in V$FIXED_TABLE and expanded in X$KQFVI/X$KQFVT. Additional
views are:
V$DLM_ALL_LOCKS: Shows every DLM lock in the instance (PCM or not)
V$DLM_CONVERT_LOCAL: See V$GES_CONVERT_LOCAL
V$DLM_CONVERT_REMOTE: See V$GES_CONVERT_REMOTE
V$DLM_LOCKS: Blocked or blocking locks; a subset of V$DLM_ALL_LOCKS
V$DLM_MISC
V$DLM_RESS: See V$GES_RESOURCE
V$DLM_TRAFFIC_CONTROLLER
V$PING
V$FILE_PING
V$TEMP_PING
For columns and meanings, use WebIV folder Server.Internals.General.V$Views.

Old and New Views
18-462
Old View
New View (bigger/better)
V$LOCK_ELEMENT
V$GC_ELEMENT
V$DLM_CONVERT_LOCAL
V$GES_CONVERT_LOCAL
V$DLM_CONVERT_REMOTE
V$GES_CONVERT_REMOTE
Old and New Views

The naming changes from DLM to GRD, non-PCM to GES, and PCM to GCS are
partially reflected in the newer views. The newer views have the proper newer names
and may also have more columns. The older views remain available for backward
compatibility.
V$ Views for Lock Information
18-463
V$DLM_ALL_LOCKS: All locks in the DLM

V$DLM_CONVERT_LOCAL: Statistics on local lock
conversions
V$DLM_CONVERT_REMOTE: Statistics on remote
lock conversions
V$DLM_LOCKS: All blocking or blocked locks
V$DLM_MISC: DLM statistics
V$DLM_RESS: All DLM resources
V$RESOURCE_LIMIT: SGA resources
V$ Views for Lock Information

V$DLM_LOCKS is useful in diagnosing RAC hangs because the output is similar to that
dumped by lkdebug O.
V$DLM_RESS has one record for every DLM resource.
V$RESOURCE_LIMIT is useful for determining whether DLM LM_% resources have
been set correctly, by looking at INITIAL, CURRENT, and MAXIMUM.
X$ Tables
x$bh
x$kccfe
x$kcfio
x$kclcrst
x$kglst
x$kjbr
x$kjdrhv
x$kjdrpcmhv
x$kjdrpcmpf
x$kjicvt
x$kjirft
18-464
x$kqrfp
x$ksimsi
x$ksqeq
x$ksqrs
x$ksqst
x$ksurlmt
x$ksuse
x$ksuxsinst
x$kvit
x$le
x$quiesce
X$ Tables
The X$ tables listed in the slide are the ones used by the V$ views on the previous slide.
WebIV note 208093.1 shows a good relation between V$ views and X$ tables.
WebIV note 22241.1 gives a reasonably complete listing of X$ tables.
Additional useful RAC X$ tables are x$kjbrfx.
Events
10704, level 10: ksq

10706, level 10: ksi
Kernel Service enQueue

Kernel Service Instance Locks
10254, level 1
Trace Cross Instance Calls
18-465
Events
Triggering events for DLM:
29700 Enable lock convert statistics
29712-29713 Lock open convert cancel close operations
29714 DLM state object
29715 Reconfiguration
29716 Post wait and AST
29717 GRD or DLM freeze/unfreeze
29718 CGS or DLM CM interface
29720 GES or DLM SCN service
29722 GES or DLM process death
KST and X$TRACE
Objectives

the following:
Explain how KST gathers information for X$TRACE
19-467
Explain the DIAG architecture
KST: X$TRACE
KST is kernel service tracing.
19-468
KST: X$TRACE
Background
The Kernel Service Tracing (KST) facility was an existing component in the VOS layer
that was used by a few components for limited tracing. In Oracle9i, this mechanism has
been reworked to provide simpler yet more powerful interfaces for recording the execution
history of interesting components. This reworked mechanism also provides extensible
interfaces that allow the clients to customize instrumentation to satisfy their tracing needs.
KST output can be examined in the X$TRACE table.
KST Concepts
Kernel service tracing focuses on the execution

history of a component.
SGA provides an in-memory circular buffer to
each process:
Each buffer is associated with a unique ID matching
the Oracle PID.
Time-ordering of traces is guaranteed.
Trace buffers are released upon process exit.
Reuse of trace buffer with the same Oracle PID
19-469
KST Concepts
The KST facility provides a mechanism to log the execution history of a component with
minimum performance impact. This is done by providing an in-memory trace buffer to
each Oracle process, because tracing with an in-memory buffer has less performance
impact than logging traces on disk.
Each Oracle process (whether foreground or background) is assigned its own trace buffer
that is allocated from the SGA. The buffer is accessible by other Oracle processes if any
process dies unexpectedly, increasing the availability of trace information for later
diagnosis.
Circular buffers are used to minimize the memory usage for tracing purposes by removing
stale data. However, users must specify a large enough buffer so that wrapping does not
cause data loss. Note that the faster a process generates tracing data, the larger the buffer
size that must be specified.
KST Concepts (continued)

When a process is created (ksucrp) during instance startup time, a trace buffer is
assigned to this process. Each trace buffer is associated with a unique ID that matches
the Oracle process ID and is never shared among processes. The unique ID guarantees
the trace isolation among processes and the time ordering of tracing within a process.
When a process exits, its trace buffer is still kept in SGA, retaining trace information in
case it is needed for diagnosis of any problem that may occur later. The retention also
reduces the overhead of repeated memory allocation and deallocation of trace buffers, if
processes are created and exited frequently.
Any unused trace buffer is reassigned to a new process whose Oracle PID matches the
assigned buffer ID. If no such buffer exists, it is allocated from SGA.
KST Concepts
Multilevel, event-based tracing

Supports up to 256 levels
1000 event IDs (1000010999) available for RDBMS
19-471
256 opcodes to further categorize the traces within

an event ID
Always-on minimal tracing
Support for optional trace archiving
KST Concepts (continued)

The KST facility uses event-based tracing with event IDs ranging from 10000 to 10999.
To further control the extent or detail of tracing with the same event ID, 256 levels (0 to
255) can be used.
In addition, KST supports opcode filtering in each trace of the same event ID. This adds a
second dimension to tracing, so that a single event ID is used for a component and each
functionality of the component is categorized by different opcodes. Furthermore, a level
can be used to control details of the tracing that was logged by the facility.
One of the features in the KST facility is the support of always-on minimal tracing.
Trace instrumentation with level 0 is always-on tracing, and all the level-0 traces are
always logged when KST is enabled through the initialization parameter
trace_enabled. Note that the event ID is not required to enable the always-on
feature. However, it can be disabled through the command ALTER TRACING DISABLE
<event-spec> that disables tracing for the specified event at all levels.
The KST facility also provides optional trace archiving to users so that traces in memory
buffer are logged to files during run time when the buffer wraps around. This increases the
amount of data that is available for diagnosis if the size of the allocated buffer is not large
enough to cover tracing for a longer period of execution. This feature is not recommended
for production systems. However, it is very useful for diagnosing problems during
development.
Circular Buffer
X$TRACE
SGA
P1
Trace Buffer Process 1
Pn
Trace Buffer Process n
19-472
Circular Buffer
All trace buffers reside in the SGA, and each buffer is assigned to a single Oracle process.
During run time, trace data from each process is logged to its own buffer. Users can query
the content of trace buffers and the status of tracing behavior through some fixed table
views, that is, X$-tables.
Data Structure kstrc
Fixed-size, fixed-format trace records

Metadata section
Trace data section; maximum of 48 bytes
19-473
Trace records populated with KSTRC0, KSTRC1,

KSTRC6, and KSTRCX
Formatting callback registered with kstdfcb
Data Structure kstrc

The KST mechanism uses the data structure kstrc of 64 bytes to record a trace. Each
trace record has a fixed format for the header or metadata and trace data.
The metadata contains the time stamp, the sequence number (unique for an instance), the
Oracle process ID, the user session ID, the event ID, and the opcode.
The trace data has a maximum size of 48 bytes. KST supports two types of data for tracing:
Up to six ubig_ora numbers. Handled by the KSTRC0, KSTRC1 macros. The
number suffix defines the number of ubig_ora trace values in the argument list.
One specially defined data structure that has a size of 48 bytes to fill up the data
structure of the record. Handled by the KSTRCX macro.
All the macros require the event_id, level and opcode apart from the data.
The callback is registered with the kstdfcb function tying a specific event_id to a
formatting routine. The callback is used only when data is examined in X$TRACE or
written to file. If no callback exists, then all trace data of an event_id is output as six
ubig_oras in hexadecimal mode. Defined formatting callback should be present for a
special trace data structure. The code avoids pointer dereferencing, as invalid pointers
would show illegal values or possibly crash the trace.
Trace Control Interfaces
You can control tracing characteristics with:

Initialization parameters
trace_enable
Underscore parameters
SQL statements
ALTER TRACING
ALTER SYSTEM SET
19-474
Trace Control Interfaces

Users can specify the controls either through the initialization parameters during instance
startup or SQL statements during run time.
During instance startup, tracing behavior can be configured through initialization
parameters. Only the trace_enabled parameter is visible to customers, enabling or
disabling the tracing mechanism.
Use the ALTER TRACING statement or ALTER SYSTEM SET statement to change the
value of initialization parameters whose scope is dynamic for altering the tracing with
SQL.
KST Initialization Parameters

<event-string> = <event-spec>:<level>:<proc-spec>
<event-spec>
= <event>|<event>,<event-spec>
<event> = ALL|<event-id>|<event-id>-<event-id>
<proc-spec> = <proc>|<proc>,<proc-spec>
<proc> = ALL|BGS|FGS|<pid>|<pid>-<pid>|<procname>
<level> = 0-255
trace_enabled
= {TRUE|FALSE}
_trace_archive
= {TRUE|FALSE}
_trace_events
= <event-string>
_trace_processes = {<proc-spec>|ALL}
_trace_buffers
= <proc-spec>:<size>
_trace_flush_processes = {<proc_spec>|ALL}
_trace_file_size = {<integer>|64K}
_trace_options
= {text|binary},
{multiple|single}
19-475

Initialization parameters that control KST behavior are used during instance startup.
trace_enabled: Turn on/off KST tracing mechanism
_trace_archive: Turn on/off KST trace archiving
_trace_events: Events, level, and processes to be traced
_trace_processes: Which process tracing is enabled
_trace_buffers: Buffer size on per-process basis (default 256:ALL)
_trace_flush_processes: Processes with trace archiving enabled
_trace_file_size: Maximum size for archive/flush trace file
_trace_options: Output in binary or text format, and per-process (multiple) or
per-instance (single) file mode (default: text, multiple)
Note: _trace_events can be specified multiple times in the same block in the
init.ora file. Note that its setting is overwritten by the last entry of this parameter if it is
specified separately from the block.

Parameter
Class
Scope
Trace enabled
Dynamic
Global
_trace_archive
Dynamic
Global
_trace_events
Dynamic
Local
_trace_processes
Static
Local
_trace_buffers
Static
Local
Dynamic
Local
_trace_file_size
Static
Local
_trace_options
Static
Global
_trace_flush_processes
19-476
KST Initialization Parameters (continued)

The class of a parameter defines whether the parameter is static or dynamic. The scope of
a parameter defines the coverage of a parameter in RAC instances. For global parameters,
all instances have the same value of the parameter.
Some parameters can be modified by the ALTER TRACING statement, although they have
the class static.
KST Trace Control Interfaces
Use SQL to modify tracing behavior at run time:

ALTER TRACING [ON|OFF]
[ENABLE <event-string>|DISABLE <event-spec>]
[FLUSH <proc-spec>]
19-477

SQL statements provide users the means to modify the tracing behavior of KST at run time.
ALTER TRACING ON: Enables tracing at run time (trace_enabled must be set to
TRUE for this to take effect)
ALTER TRACING OFF: Disables tracing at run time (regardless of the value of
trace_enabled)
ALTER TRACING ENABLE <event-string>: Enables trace events at run time
ALTER TRACING DISABLE <event-spec>: Disables trace events at run time. This
also disables level-0 tracing for the specified event.
ALTER TRACING FLUSH <proc-spec>: Flushes trace to file immediately. Note
that flushing is performed at a delayed mode if multiple file mode is used in the
current release (Oracle9i, release 1).
ALTER SYSTEM SET
19-478
trace_enabled
_trace_archive
_trace_flush_processes
_trace_events
KST Trace Control Interfaces (continued)

The SQL command ALTER SYSTEM SET can also be used to alter the trace parameters
that are marked dynamic.
KST Fixed Table Views
Dynamic views for tracing characteristics and

trace buffers
X$TRACE_EVENTS
EVENT, TRCLEVEL, STATUS, PROCS
X$TRACE
EVENT, OP, TIME, SEQ#, SID, PID, DATA
19-479
KST Fixed Table Views

There are two fixed table views that are related to the KST mechanism. They are used for
online monitoring of tracing characteristics and viewing the contents of the trace buffers in
the SGA.
Attributes for X$TRACE_EVENTS (table for trace characteristics): Event, trclevel,
status, procs
Attributes for X$TRACE (trace buffers in SGA): event, opcode, time, seq#, SID,
PID, data
Note that X$TRACE shows the current trace data in all trace buffers and can be used as an
online tool to view traces and spot problems during run time.
KST Trace Output
Output trace data to file as required

User request
Process state dump
Crash dump in RAC instances
19-480
Output in binary or text format

Circular trace files with .trw as extension
KST Trace Output

The trace data that is recorded in memory buffers can be output to files for future reference.
There are three situations in which traces are output to files:
Users can request trace flushing from memory to files either through the initialization
parameter _trace_archive or the SQL statement ALTER TRACING FLUSH.
Trace data is dumped to files along with process state dump when an exception
occurs for any fatal process.
Trace data is dumped to files across all RAC instances when one of the instances
crashes.
The trace data can be written to a file in either binary or text format. If binary format is
used, the data is written in hexadecimal format to the files. If text format is used, all data is
output either as six ubig_oras or in a user-defined format if a formatting callback is
specified for the event ID. Binary format is preferred if performance is a concern during
dumping.
All KST trace files have .trw as their extension to distinguish them from regular process
trace files (which have .trc as their extension). Also, these trace files are circular (similar
to the memory buffers to limit the file size).
KST Trace Output
Trace files can be on either a per-process or a

per-instance basis.
Name for per-process trace file:
<SID>_<proc_name>_<pid>.trw
Name for per-instance trace file:

trace_<SID>.trw
DIAG writes traces to per-instance file; otherwise

each process outputs its own traces to files.
19-481
KST Trace Output (continued)

KST trace files are on either a per-process or a per-instance basis. For per-process files,
each process has its own trace output file and the process itself writes its traces from
memory to file. For a per-instance file, there is only a single trace file used for trace output
for all processes of the instance, and DIAG performs trace writing for all processes. In
case of process death, traces of the dead process are dumped to the trace file along with the
process state dump.
Naming convention for per-process trace files:
<SID>_<proc_name>_<pid>.trw
Example: db_lmon_1010.trw for LMON with SID=db

Naming convention for per-instance trace files:
trace_<SID>.trw
Example: trace_db.trw for SID=db

These trace files are created in the directory defined by the initialization parameter
background_dump_dest.
KST Trace Output
Trace data can be output in two modes:

Archiving
Flushing
19-482
KST uses an initialization parameter to enable

archiving.
Trace buffers are flushed to the file system with
the ALTER TRACING FLUSH statement.
KST Trace Output (continued)

There are two modes of outputting trace data to files:
Archiving mode: Set through _trace_archive. Archiving remains active until it is
turned off.
Flushing mode: Performed when users issue an ALTER TRACING FLUSH statement.
In archiving mode, traces are written to files whenever the number of unarchived traces in
the buffer is half the size of the buffer or the buffer wraps around. However, flushing
occurs only when users issue the SQL statement. Note that flushing is performed in a
delayed mode in Oracle9i, release 1.
KST Current Instrumentation
Oracle9i components with trace instrumentation:
19-483
DLM layer
IPC layer
Space management layer
Shared servers (MTS)
PQ module
Transaction layer
Level-0 (always-on) tracing is enabled as default.
KST Current Instrumentation

In version 9.0.1, trace instrumentation was done in several kernel components by using the
KST tracing facility.
Event numbers used by various components:
DLM: 10425 to 10435
IPC: 10401
Space management: 10907
Shared servers (MTS): 10249
PQ: 10371
Transaction layer: 10810 to 10812
For RAC production systems, KST tracing is enabled for all events with level 0 as the
default behavior.
KST Performance
19-484
Tracing affects the overall performance.

Customers agree on a 5% to 10% trade-off in
overall performance.
Global tracking uses 58 extra cycles for disabled
events versus 176 extra cycles for enabled events.
About 3% overhead (or less) for enabling tracing
at level 6 or lower.
Overhead is a function of instrumentation.
KST Performance
Tracing definitely affects the overall performance of a system, regardless of any tracing
mechanism or design. The question is: How much performance degradation are users
willing to sacrifice in exchange for enhancing the diagnosability of the system when a
problem occurs?
In general, most customers are willing to have about 5% to 10% for the trade-off between
diagnosability and system performance.
In version 9.0.1, CPU instruction cycles used by KST tracing were measured. Regardless
of whether a trace event is enabled or not, some extra cycles are used after global tracing is
enabled (trace_enabled is TRUE) because certain cycles are required to perform the
event checking.
When global tracing is enabled, 58 extra cycles are used for event checking of a disabled
event and 176 extra cycles are used for an enabled event.
An average of less than 3% overhead was found when the regression test for RAC was run
with all events enabled at level 6 or less. Note that only a few components use KST
tracing in Oracle9i, release 1. Tracing overhead increases as instrumentation is done in
more RDBMS components.
Note that tracing overhead is a function of instrumentation. The performance may vary in
different releases.
KST: Examples
Sample instrumentation
Sample usage for KSTRC[0-6] in kju.c
Sample usage for KSTRCX in kjdd.c
Sample format callback in kji.c (kjdgtfmt)
19-485
Sample trace file
KST: Examples
Following are the code examples on KSTRC[0-6], KSTRCX, formatting callback, and
kstdfcb. Note that the formatting callbacks should be registered at the notifier function
of the component.
KST: Examples (continued)

KSTRC[0-6]
/* kjuef - End function, a.k.a. Convert completion funct */
void
kjuef(cookie,endstat)
kjuvoidp cookie;
kjustat endstat;
{
kjatsst *stat;
text
rbuf[64];
kjuresn *rn;
if (cookie)
{
stat = (kjatsst *)cookie;
rn = &stat->resname_kjatsst;
KJDGTRACEBYTYPE(rn, (ub4)8, KJDGTT_AST, 0, 0,
("[AST][kjuef][%s][ast fired]\n",
(char*)kjqfrn(rbuf, rn)));
KSTRC3(KJDGTT_LKEVT, KJDGTT_ASTFIRED, 8, KJURN_ID1(rn),
KJURN_ID2(rn), rn->nam_kjurn[2]);
stat->ast_fired_kjatsst = TRUE;
stat->cookie_kjatsst
= cookie;
stat->endstat_kjatsst
= endstat;
}
return;
}
kstdfcb
void kjinfy(nfytype, ctx)
ub4 nfytype;
dvoid *ctx;
{
...
else if (nfytype == KSCNOPCR)
{
...
/* Register KST trace format callback */
kstdfcb(KJDGTT_LKEVT, (KSTFPTR)kjdgtfmt);
/* Register KST trace format callback for kjdd layer */
kstdfcb(KJDGTT_DD, (KSTFPTR)kjddtfmt);
/* Register KST trace format callback for IPC layer */
kstdfcb(KJDGTT_IPC, (KSTFPTR)kjdgfmtipc);
/* Register KST trace format callback for TRFC layer */
kstdfcb(KJDGTT_TRFC, (KSTFPTR)kjdgfmttrfc);
}
...
}

KSTRCX
/*
** Validate the deadlock by traversing the clusterwide deadlock graph
*/
STATICF word
kjddvald(bp)
kjddb *bp;
{
kjl *lockp;
kjr *resp ;
ub4 lkver;
word level = ksepec(OER(KJDGTT_DD));
kjsolk *sghead = &(kjiudb->dd_stat_kjga.sgh_kjddstat);
kjsolk *pqhead = &(kjiudb->dd_stat_kjga.prq_kjddstat);
/* node originating the deadlock search */
ub2 origin = bp->req_kjddb.dd_master_node_kjxmddi;
/* node responsible for printing the graph to the trace file */
ub2 prnode = KJGA_FDTONODE(0); /* the lowest node */
boolean
kjftnid
kjddsg
kjddsg
kjsolk
boolean
kjddtrc
dd_invalid = FALSE;
lk_node = kjiudb->node_id_kjga;
*pp = KJSOSTRUC(kjsolfs(sghead), kjddsg, link_kjddsg);
*pp2; /* to check for duplicate locks in the wait for graph */
*qp;
dd_victim = FALSE;
trcctx;
/* Prepare the KST trace record */

CLRSTRUCT(trcctx);
trcctx.ddtyp_kjddtrc = ((kjiudb->dd_stat_kjga.txs_kjddstat) ? 1:0);
KJDEF_SETQUAD(trcctx.time_kjddtrc, kjiudb->dd_stat_kjga.t_kjddstat);
trcctx.snode_kjddtrc = kjiudb->node_id_kjga;
/* Log a trace record */
KSTRCX(KJDGTT_DD, KJDD_DDFND, 5, (void *)&trcctx);
...
}

Formatting Callback
/*
** NAME
**
kjdgtfmt - LK Trace format callback
**
** DESCRIPTION
**
A format callback function for KST trace data
*/
void kjdgtfmt(action, op, data, buf, len)
uword
action;
ub1
op;
dvoid *data;
char
*buf;
ub4
len;
{
ubig_ora *darray = (ubig_ora *)data;
switch(op)
{
case KJDGTT_ASTFIRED:
{
text
buf1[64];
kjuresn rn;
KJDG_SET_RESN(&rn, darray[0], darray[1], darray[2]);
(void) sprintf(buf, "kjuef: %s", (char*)kjqfrn(buf1, &rn));
break;
}
case KJDGTT_SYNCCVT:
{
text
buf1[64];
kjuresn rn;
KJDG_SET_RESN(&rn, darray[0], darray[1], darray[2]);
(void) sprintf(buf, "kjuscv: %s[lockp " KPPTPTRFMT "][level %d]",
(char*)kjqfrn(buf1, &rn), KPPTPTRWRP(darray[3]),
(word)darray[4]);
break;
}
...
}
KST Sample Trace File

1
1020304 1 2048 384 32 1
Oracle9i Enterprise Edition Release 9.0.1.0.0 - Production
With the Partitioning and Real Application Clusters options
JServer Release 9.0.2.0.0 - Beta
ORACLE_HOME = /ade/ilam_rdbms_lrg/oracle
System name:
SunOS
Node name:
dlsun1932
Release:
5.6
Version:
Generic_105181-14
Machine:
sun4u
Instance name: lrg
Oracle process number: 13
Unix process pid: 20723, image: oracle@dlsun1932 (TNS V1-V3)
8392EACE:0000000E
83BEBE73:0000000F
83BEBE97:00000010
83BED062:00000011
83BED107:00000012
83BED2C2:00000013
83D32C1A:00000042
83D32C30:00000043
83D32C32:00000044
83D32C33:00000045
83D32C34:00000046
83D32C46:00000047
83D32C47:00000048
19-489
5
5
5
5
5
5
5
5
5
5
5
5
5
0
0
0
4
4
4
4
4
4
4
4
4
4
10280
10401
10401
10429
10427
10401
10429
10429
10429
10429
10429
10429
10429
1
28
27
7
10
14
2
2
2
2
2
2
2
0x00000005
KSXPUNMAP: client 1
KSXPMAP: client 1 base 0x80048000 size
MB SO Al: Allocated MBSO 82b5eac4
Init ctx: Initialize ksxp for 1 ports
KSXPTIDCRE: tid(1,1,0x83bed2b6)
AllocBuf: buf 824bf624, pool 800084b0,
AllocBuf: buf 824bfe44, pool 800084b0,
AllocBuf: buf 824c0664, pool 800084b0,
AllocBuf: buf 824c0e84, pool 800084b0,
AllocBuf: buf 824c16a4, pool 800084b0,
AllocBuf: buf 824c1ec4, pool 800084b0,
AllocBuf: buf 824c26e4, pool 800084b0,
0x37b8000
size
size
size
size
size
size
size
2080,
2080,
2080,
2080,
2080,
2080,
2080,
out(i)
out(i)
out(i)
out(i)
out(i)
out(i)
out(i)
1,
2,
3,
4,
5,
6,
7,
out(s)
out(s)
out(s)
out(s)
out(s)
out(s)
out(s)
0
0
0
0
0
0
0
KST Sample Trace File

The very first line of the trace file contains the metadata about trace information:
Binary or text is indicated by 0 or 1, respectively
File
Magic number (4 bytes)
Version number of trace file (4 bytes)
File block size (4 bytes)
Data record size (4 bytes)
Wrapping (4 bytes)
This is followed by the general information about the tracing process and the machine in
the standard trace file header.
The actual trace data is in the following format:
time stamp, sequence #, process id, level, event #, opcode, data
KST Demonstration
Trace control manipulation
19-490
KST Demonstration
Demonstration on user interfaces for modifying tracing behavior of KST mechanism:
Initialization parameters
Alter tracing
Alter system set
X$TRACE and X$TRACE_EVENTS
DIAG Daemon
RAC instance #1
SGA
RAC instance #2
Trace
buffer
Trace
buffer
Tracing
Tracing
Process
SGA
DIAG
DIAG
Process
Communication
19-491
DIAG Daemon
The diagram in the slide shows the architecture of the DIAG daemon in an RAC
environment.
Note that there is a difference between DIAGs in RAC and those in a single instance,
although both processes have the same name:
DIAG in a single instance is responsible for trace archiving and flushing only.
DIAG in a RAC instance provides other diagnosability services, in addition to trace
archiving and flushing.
DIAG Daemon: Features
DIAG Daemon:
Is an integrated service for all the diagnosability
needs of an instance
Provides a scalable framework for RAC
diagnosability
Works independently from an instance
Relies only on services provided by underlying OS
Is a lightweight daemon process, one per instance
19-492

The design goal of the DIAG process is to be an integrated service for all the
diagnosability needs of an RAC instance. Although there are several debugging and
diagnostic tools in versions before Oracle9i, they do not provide a single interface for a
cluster environment and are not cluster-ready, making diagnosis across multiple instances
difficult before Oracle9i.
The DIAG process is designed to meet the following requirements:
The framework scales in cluster environments in which the number of nodes or
instances vary, accommodates variation, and works seamlessly without interrupting
any service provided.
The framework does not interfere with or affect the normal operation of the system.
In any condition, the framework should not adversely affect the performance of a
system regardless of the state of the system. Therefore, the DIAG daemon does not
use any service or resource from the RDBMS kernel. Optimally, the DIAG process
uses the services that are provided by the underlying OS.
DIAG is a lightweight daemon process that does not affect overall system performance,
although it is integrated with the RDBMS kernel for startup and shutdown, and its need to
access the SGA for trace buffers.
DIAG Daemon:
Is highly available and is tolerant of common
failures
Monitors the health of a local RAC instance
Coordinates the collection of diagnosability data
from all the nodes in a RAC server
Services clusterized ORADEBUG
Provides an extensible interface for future projects
19-493
DIAG Daemon: Features (continued)

The DIAG process is resilient to failures, as its goal is to diagnose errors, problems, or
failures that have occurred in the system. DIAG does not share any resource with other
Oracle process and it has no dependency on the RDBMS kernel (except the VOS layer for
bare OS services, to minimize the possibility of a crash due to other processes). PMON
restarts a new DIAG process to continue its service if the DIAG process dies.
Another feature of the DIAG daemon is to monitor the health of the local RAC instance.
On failure of an essential process, DIAG can capture the system state and other useful
information for later diagnosis, and notify DIAG on the other instances to capture similar
information. This provides a snapshot view of the entire cluster environment. In addition,
the DIAG process serves as the base framework to execute clusterized oradebug
commands in RAC instances.
Improvements and new diagnostic projects will be adopted into the DIAG infrastructure in
future versions. An example of a planned extension is hang management for a highavailability (HA) configuration; DIAG will be responsible for monitoring the liveliness of
operations of the local RAC instance and performing any necessary recovery, if an
operational hang is detected.
DIAG Daemon: Design
DIAG process group:

Is analogous to Cluster Group Service (CGS)
group
Provides peer-to-peer communication among
DIAGs
Identifies a master DIAG for synchronization and
coordination
Reconfigures for membership change
Rolls back partial operations when reconfiguring
19-494
DIAG Daemon: Design

The DIAG process group is analogous to the CGS group but is independently registered
with the same cluster node monitor for cluster services. The DIAG process group provides
an abstraction of group services to the registered DIAG processes on different nodes.
These services include communication, synchronization, and coordination among
members on different instances.
There is a single DIAG process group for each database cluster, and only the DIAG
process can register as a member in each instance. In the group, the node with the lowest
node ID (as defined by the node monitor) is elected to be the master of the group. Master
DIAG is responsible for all synchronization and coordination among members. For
example, all multicast messages are first sent to the master DIAG, which then forwards
them to all destination nodes to guarantee global message ordering.
Reconfiguration occurs in the DIAG process group when there is a membership change. If
any member joins or leaves the process group, then all existing members synchronize their
local membership information. This synchronization is also coordinated by the master
DIAG.
In the case of a DIAG group reconfiguration, all ongoing tasks are aborted and rolled back
to a previous consistent state. All tasks are then resubmitted as a new request.
DIAG Daemon: Design
Orthogonal to instance:
Does not use latches or locks
Does not use shared resources from the database
kernel
Does not affect the instance and is not affected by
the instance
Does not share the communication channel with
other processes
19-495
DIAG Daemon: Design (continued)

Orthogonality is another key feature in the design of the DIAG daemon. All services
provided by DIAG do not interfere with or allow interference from any operations
performed by other Oracle processes. This creates a protected or isolated domain in the
DIAG process for diagnosability.
To be orthogonal to the instance, the DIAG process does not use any shared resource from
the RDBMS kernel, such as latch or lock. Also, the implementation does not have any
dependency on rdbms components, except the VOS layer, which provides an abstraction of
basic OS functionalitythe fundamental building block for the DIAG process.
To prevent any interference with the RDBMS kernel, the DIAG daemon creates its own
communication model to isolate itself from potential issues in the shared model of
communication used by other Oracle processes. The DIAG process of each instance owns
its own IPC port for messaging and has a different implementation of message protocol
instead of sharing the common IPC channel provided by CGS. This design provides an
alternative communication channel in case of a problem occurring in CGS because of
database operations.
DIAG Daemon: Design
Communication model:
Based on the IPC service from the OSD layer
Owns unique IPC port and message protocol
Supports multicast messaging
Supports memory-mapped copy for large data
transfer
19-496

Characteristics of the communication model in the DIAG daemon are:
It is based on the preliminary IPC service from the OSD layer to eliminate any
potential problem or contention with the RDBMS kernel.
It has separate communication channels (IPC port and memory-mapped region
privately defined by the DIAG process, instead of those used by the cache fusion
layer) based on the OSD IPC service.
The DIAG process has its own message protocol (flow control and message
semantics) on multicasting and memory-mapped copying.
DIAG Daemon: Design
Master DIAG:
Coordinates message ordering
Coordinates DIAG group reconfiguration
Synchronizes all DIAG group communications
19-497

The master DIAG is located at the node with the lowest node ID defined in the node
monitor of clusterware. Its responsibilities include task synchronization, guarantee of
message ordering for multicasting among DIAGs at different nodes, and performing group
reconfiguration in case membership changes in the DIAG process group.
If the master DIAG leaves or dies, the DIAG process with the next-lowest node ID
becomes the new master. Here it is assumed that the node monitor provides a consistent
view of membership in the DIAG process group among all nodes.
All group-related communications are synchronized through the master DIAG. For
example, a multicast message must be first sent to the master DIAG, which then forwards
the message to the destination DIAGs. When DIAGs receive the message and finish
processing the message, they send an acknowledgment to the message sender. When the
originating DIAG receives acknowledgments from all receivers, it then sends a complete
message to the master DIAG so that the next multicast message can be forwarded from the
master DIAG. Through this protocol, the message ordering can be guaranteed and
synchronization can be achieved. Also, memory-mapped copying can happen only after a
DIAG receives a multicast message and before it sends the acknowledgment back to the
message sender. This is required because no semantics of synchronization (overhead for
this infrequent operation) are enforced for memory-mapped copying among the DIAG
processes.
DIAG Daemon: Startup and Shutdown
Instance startup brings up DIAG.

Second process (after PMON) to start
Instance shutdown terminates DIAG.

Failure resilience
Restarted by PMON in case of failure
19-498
DIAG Daemon: Startup and Shutdown

Although DIAG works independently from the instance, it is integrated with the RDBMS
kernel so that it can access the SGA for diagnosability purposes. DIAG is the second
process to be brought up during an instance startup. Being the second process to start up, it
can ensure that diagnosability service is available as soon as possible for any potential
startup problem.
DIAG terminates gracefully during normal shutdown of a RAC instance. Ordering is not
important during normal shutdown.
The DIAG process is resilient to failure. Upon discovery of its death, PMON starts a new
DIAG process, enhancing the availability of the diagnosability framework in a RAC
database. Note that DIAG is a nonfatal process for the instance so that its termination, for
any reason, does not affect any operation of the instance.
DIAG Daemon: Crash Dumping
Performs a crash dump (clusterwide) by DIAGs

upon detecting the death of an essential Oracle
process (FG or BG)
Survives RAC instance crashes:
Penultimate process to terminate
Five seconds (adjustable) allowed to dump traces
19-499
Flushes KST data to files on demand in RAC

Crash dump is one of the most important features of the DIAG daemon. DIAG dumps
KST traces to file and notifies the remote DIAGs after it discovers the death of an essential
Oracle process in the local instance. During the instance cleanup procedure, it is the
penultimate process to be terminated because it needs to perform trace flushing to the file
system. By default, the terminating process, usually PMON, gives five seconds to DIAG
for dumping.
The allowed time to dump traces on shutdown is controlled by the
_ksu_diag_kill_time parameter.
DIAG flushes KST trace data to files on demand with the ALTER TRACING FLUSH
statement. DIAG performs the flushing when per-instance (single) file mode is used.
Coordinates the dumping of trace buffers on all

nodes:
Notifies peer DIAGs to dump traces
Instance freeze is not required; the interest is the
execution history captured in buffers within a time
interval that includes the crash moment.
19-500
cdmp_<timestamp> is the directory for dumping

trace during crash.
DIAG Daemon: Crash Dumping (continued)

During an instance crash, DIAG sends out a dump message to peer DIAGs in the cluster
and then dumps traces to file.
When a DIAG process receives a dump message, it dumps the local trace data to the file
system so that a snapshot of the entire cluster can be obtained for diagnosis later.
Instance freezing is not required to obtain the snapshot of traces across all instances. The
reason is that all traces with execution history required for diagnosis are already stored in
the memory buffer and are dumped to the file after the DIAG process receives the crash
notification. Traces for the moment of crash are likely to be in the history.
A dump directory named cdmp_<timestamp> is created in the
background_dump_dest location, and all trace dump files are placed in this directory.
Summary

KST and X$TRACE
19-501
DIAG architecture
ORADEBUG
and Other Debugging Tools
Objectives
After completing this lesson, you should be able to

use ORADEBUG for flash freeze, tracing, and hang
analysis.
20-503
ORADEBUG
ORADEBUG is RAC-aware.
Commands can execute in one or several

instances:
SETINST to list instances to affect
G or R to debug in parallel
SQL> ORADEBUG SETINST "ALL"
SQL> ORADEBUG -G "1 2" LKDEBUG -A LOCK
20-504
ORADEBUG
You can use the options -G or -R to execute ORADEBUG across instances.
-G means the debugging data and result are written to the trace file of the
executing DIAG daemon at each participating instance.
-R means the same data is returned to the initiator DIAG daemon, which then
outputs them to its trace file.
ORADEBUG: List of Commands

SQL> ORADEBUG HELP
SETMYPID
SETOSPID
<ospid>
SETORAPID <orapid> ['force']
DUMP
<dump_name> <lvl> [addr]
DUMPSGA
[bytes]
DUMPLIST
EVENT
<text>
SESSION_EVENT <text>
DUMPVAR
<p|s|uga> <name> [lev]
SETVAR
<p|s|uga> <name> <value>
PEEK
<addr> <len> [level]
POKE
<addr> <len> <value>
WAKEUP
<orapid>
SUSPEND
RESUME
FLUSH
CLOSE_TRACE
TRACEFILE_NAME
LKDEBUG
NSDBX
-G
<Inst-List|def|all>
-R
<Inst-List|def|all>
Debug current process

Set OS pid of process to debug
Set Oracle pid of process to debug
Invoke named dump
Dump fixed SGA
Print a list of available dumps
Set trace event in process
Set trace event in session
Print/dump fixed PGA/SGA/UGA variable
Modify a fixed PGA/SGA/UGA variable
Print/Dump memory
Modify memory
Wake up Oracle process
Suspend execution
Resume execution
Flush pending writes to trace file
Close trace file
Get name of trace file
Invoke global enqueue service debug
Invoke CGS name-service debug
Parallel oradebug commands prefix
Parallel oradebug prefix (return
output)
SETINST
<instance# .. | all>
Set instance list in double quotes
SGATOFILE <SGA dump dir>
Dump SGA to file; dirname in "-quotes
DMPCOWSGA <SGA dump dir>
Dump&map SGA as COW; dir in "-quotes
MAPCOWSGA <SGA dump dir>
Map SGA as COW; dirname in "-quotes
HANGANALYZE [level]
Analyze system hang
FFBEGIN
Flash Freeze the Instance
FFDEREGISTER
FF deregister instance from cluster
FFTERMINST
Call exit and terminate instance
FFRESUMEINST
Resume the flash frozen instance
FFSTATUS
Flash freeze status of instance
SKDSTTPCS <ifname> <ofname>
Helps translate PCs to names
WATCH
<address> <len> <self|exist|all|target> Watch a region of memory
DELETE
<local|global|target> watchpoint <id>
Delete a watchpoint
SHOW
<local|global|target> watchpoints
Show watchpoints
CORE
Dump core without crashing process
IPC
Dump ipc information
UNLIMIT
Unlimit the size of the trace file
PROCSTAT
Dump process statistics
CALL
<func> [arg1] ... [argn] Invoke function with arguments
Flash Freeze
Use ORADEBUG commands to stop the activity in

instances in order to examine SGA content.
ffbegin: Freezes an instance
ffderegister: Deregisters an instance from the
cluster
ffterminst: Exits and terminates the instance
ffresumeinst: Resumes normal running on a
frozen instance
ffstatus: Checks the status of the instance
(frozen or not)
20-506
Flash Freeze
Flash freeze permits the freezing of an entire instance. This permits the dumping of any
normal dumps via ORADEBUG without moving the system. Other instances may time
out or hang as a result of freezing one instance. Output for flash freeze commands
(including ffstatus) is written to the alert log. When ffbegin is issued, each
process notification is put in the alert log, as is the response from each process.
Likewise, messages appear in the alert log for ffresumseinst.
Use the SETINST command to specify which instances to freeze; default is the local
instance only.
LKDEBUG
Global Enqueue Service debugger (Lock debug):

Invoked with ORADEBUG LKDEBUG <items>
ORADEBUG LKDEBUG HELP for some commands
20-507
LKDEBUG
Output is to the trace file (except for the help list).
SQL> oradebug lkdebug help
Usage:lkdebug [options]
-l [r|p] <enqueue pointer>
-r <resource pointer>
-b <gcs shadow pointer>
-p <process id>
-P <process pointer>
-O <i1> <i2> <types>
-a <res/lock/proc/pres>
-a <res> [<type>]
-a convlock
-a convres
-a name
-a hashcount
-t
-s
-k
Enqueue Object
Resource Object
GCS shadow Object
client pid
Process Object
Oracle Format resname
all <res/lock/proc/pres> pointers
all <res> pointers by an optional type
all converting enqueue (pointers)
all res ptr with converting enqueues
list all resource names
list all resource hash bucket counts
Traffic controller info
summary of all enqueue types
GES SGA summary info
NSDBX
CGS Name Service debugger:

Invoked with ORADEBUG NSDBX <items>
ORADEBUG NSDBX HELP for some commands
20-508
NSDBX
Output is to the trace file (except for the help command).
SQL> oradebug nsdbx help
Usage:nsdbx [options]
-h
Help
-p <owner> <namespace> <key> <val> <nowait>
Publish a name-entry
-d <owner> <namespace> <key> <nowait>
Delete a name-entry
-q <namespace> <key>
Query a namespace
-an <namespace>
Print all entries in namespace
-ae
Print all entries
-as
Print all namespaces
HANGANALYZE
Attempts to search through the state objects and

dump a hang tree for a hung instance or cluster.
Invoked with ORADEBUG HANGANALYZE <level>
For a first pass, use level 3.

Use SETINST to perform the analysis across
multiple instances.
20-509
HANGANALYZE
This is similar in intent to what is performed manually through system states.
The level is between 1 and 10. Level 3 is good for a first pass.
Lev. Description
1,2 Only HANGANALYZE output, no process dump at all
3
Level 2 + Dump only processes thought to be in a hang (IN_HANG state)
4
Level 3 + Dump leaf nodes (blockers) in wait chains
(LEAF,LEAF_NW,IGN_DMP state)
5
Level 4 + Dump all processes involved in wait chains (NLEAF state)
10
Dump all processes (IGN state)
Remember to use SETINST to make it a clusterwide hang analysis.
Summary

following ORADEBUG commands:
FLASHFREEZE
HANGANALYZE
LKDEBUG
NSDBX
20-510
References
ORADEBUG usage notes in WEBIV

149691.1 FlashFreeze
175006.1 HANGANALYZE
70032.1
ORADEBUG on Windows
154670.1 Debug Events for 9iRAC GES and GCS
20-511
References
See Note 178683.1 Tracing GSD, SRVCTL, GSDCTL, and SVRCONFIG for details
about tracing on the RAC utilities.

DSI408 Real Application Clusters Internals

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSI408 Real Application Clusters Internals

Uploaded by

Copyright:

Available Formats

DSI408: Real Application Clusters

Copyright 2003, Oracle. All rights reserved.

This documentation contains proprietary information of Oracle Corporation. It is provided under a

Technical Contributors and