Professional Documents
Culture Documents
Internals
Electronic Presentation
D16333GC10
Production 1.0
April 2003
D37990
Authors
Xuan Cong-Bui
John P. McHugh
Michael Mller
Publisher
Glenn Austin
Use, duplication or disclosure by the Government is subject to restrictions for commercial computer
software and shall be deemed to be Restricted Rights software under Federal law, as set forth in
subparagraph (c)(1)(ii) of DFARS 252.227-7013, Rights in Technical Data and Computer Software
(October 1988).
This material or any portion of it may not be copied in any form or by any means without the express
prior written permission of the Education Products group of Oracle Corporation. Any other copying is
a violation of copyright law and may result in civil and/or criminal penalties.
If this documentation is delivered to a U.S. Government Agency not within the Department of
Defense, then it is delivered with Restricted Rights, as defined in FAR 52.227-14, Rights in DataGeneral, including Alternate III (June 1987).
The information in this document is subject to change without notice. If you find any problems in the
documentation, please report them in writing to Worldwide Education Services, Oracle Corporation,
500Oracle Parkway, Box SB-6, Redwood Shores, CA 94065. Oracle Corporation does not warrant
that this document is error-free.
Oracle and all references to Oracle Products are trademarks or registered trademarks of Oracle
Corporation.
All other products or company names are used for identification purposes only, and may be
trademarks of their respective owners.
D16333GC10
Edition 1.0
April 2003
37988
Authors
Xuan Cong-Bui
John P. McHugh
Michael Mller
Technical Contributors
and Reviewers
Michael Cebulla
Lex de Haan
Bill Kehoe
Frank Kobylanski
Roderick Manalac
Sundar Matpadi
Sri Subramaniam
Harald van Breederode
Jim Womack
Publisher
Glenn Austin
All other products or company names are used for identification purposes only, and
may be trademarks of their respective owners.
D16333GC10
Edition 1.0
April 2003
D37989
Authors
Xuan Cong-Bui
John P. McHugh
Michael Mller
Technical Contributors
and Reviewers
Michael Cebulla
Lex de Haan
Bill Kehoe
Frank Kobylanski
Roderick Manalac
Sundar Matpadi
Sri Subramaniam
Harald van Breederode
Jim Womack
Publisher
Glenn Austin
All other products or company names are used for identification purposes only, and
may be trademarks of their respective owners.
Contents
Preface
I
Section I: Introduction
1
Introduction to RAC
Objectives 1-2
Why Use Parallel Processing? 1-3
Scaleup and Speedup 1-5
Scalability Considerations 1-7
RAC Costs: Synchronization 1-9
RAC Costs: Global Resource Directory 1-10
RAC Costs: Cache Coherency 1-12
RAC Terminology 1-14
Terminology Translations 1-16
Programmer Terminology 1-18
History 1-19
History Overview 1-20
Internalizing Components 1-21
Oracle7 1-22
Oracle8 1-23
Oracle8i 1-24
Oracle9i 1-25
Summary 1-26
iii
vi
12-13
xii
Course Overview
Prerequisites
I-2I-2
Prerequisites
The prerequisites ensure that the course is useful to you, instead of being too hard, and that
the instructor need not cover basic material.
You must have your TAO account ready for examining source code.
Course Overview
Introduction
Architecture
Platforms
Debug
I-3I-3
Course Overview
This course contains four sections. It is scheduled to take four days but does not require
one day per section. Most of the time is spent on the Architecture section.
Introduction
The Introduction section provides a summary of the public RAC architecture and its
accurate terminology. An overview of architecture changes between versions is also given.
Architecture
The Architecture section covers the theory of operation of RAC. The RAC code stack is
examined from the bottom up. There are many references to the source code.
Platforms
The Platforms section covers the differences and architectural details of RAC
implementation on different platforms. Installation issues and known gotchas are
included.
Practical Exercises
I-5I-5
Practical Exercises
The cluster hardware is shared between students and other classesthis prevents practices
that involve node shutdown, or breaking the interconnect.
SQL
SQL Layer
Layer
SQL
SQL Layer
Layer
Buffer
Buffer Cache
Cache
Buffer
Buffer Cache
Cache
Section
I
CGS
II
II
CGS
CGS
CGS
P
P
P
P
GES/GCS
GES/GCS Introduction
GES/GCS
GES/GCS
C
C
C
C
Node
Node Monitor
Monitor
Node
Node Monitor
Monitor
Cluster
Cluster Manager
Manager
Introduction to RAC
Objectives
1-10
1-11
Time
1-13
Time
Hardware
Time
Hardware
Time
Up to
200%
of
task
100% of task
Up to
300%
of
task
Hardware
Time
Hardware
Time
50% of task
50% of task
Scalability Considerations
1-15
Scalability Considerations
It is important to remember that if any of these six areas are not scalable (no matter how
scalable the other areas are), parallel cluster processing may not be successful.
Hardware scalability: High bandwidth and low latency offer the maximum scalability.
A high amount of remote I/O may prevent system scalability, because remote I/O is
much slower than local I/O.
Bandwidth of the communication interface is the total size of messages that can be
sent per second. Latency of the communication interface is the time required to place
a message on the interconnect. It indicates the number of messages that can be put on
the interconnect per unit of time.
Operating system: Nodes with multiple CPUs and methods of synchronization in the
OS can determine how well the system scales. Symmetric multiprocessing can
process multiple requests to resources concurrently.
1-17
Levels of Syncronization
Row-Level (Database)
Maximize concurrency
SCN coerency
1-18
Instance
fg1
Update row1
fg2
Database
Update row2
Block 100
1-19
Block 101
Enqueues are local locks that serialize access to various resources. This
wait event indicates a wait for a lock that is held by another session (or
sessions) in an incompatible mode to the requested mode. See
<Note:29787.1> (about V$LOCK) for details of which lock modes are
compatible with which. Enqueues are usually represented in the format
"TYPE-ID1-ID2" where
"TYPE" is a 2 character text string
"ID1" is a 4 byte hexadecimal number
"ID2" is a 4 byte hexadecimal number
Instance
BCache
Update
row1
fg1
fg2
Update
row2
Database
Block 100
1-20
Block 101
Instance
BCache
BCache
Update
row1
Update
row2
fg1
fg1
Database
Block 100
1-21
Block 101
global resources
Inter-instance synchronization mechanisms that provide cache coherency for
Real
Application Clusters. The term can refer to both Global Cache Service (GCS)
resources and Global Enqueue Service (GES) resources.
We need a cache
Serialize
Evolutions of
Oracle
minimize the
set of tasks
that are
serialized
Sequencing
operations
guarantee
consistency of
data
And : time to
complete
sequence of
operations
depends by the
slower element :
disks
fg
fg
fg
fg
fg
Serialize
Database
Block
1-22
Block
Block
Give a set of Tasks: [T1,T2,Tn] that arrive at the times [t1 <t2 <<tn],
Suppose that the system have a number of processing unit that allow the
potentially the maximum level of parallelism for such tasks .
You can approach the problem of run all the task minimizing the time (max
throughput)
At least in two modes .
1) executes the task sequentially , as they came, the last arrived wait
Until the previous ones are terminated . This not use the potential
parallelism
Of your machine .
Good : easy to implement
Bad : performances
2) implement a LOCK/WAIT infrastructure an allow all the task to run
freely until the
Are blocked by some other task(s) . The effective degree of parallelism
will be maximum
DSI408: Real Application Clusters Internals I-22
Coerency
Res: 1,0x100
S
S
BCache
BCache
select
row1
fg1
select
row2
scn:900
fg1
Scn: 1010
Start SC#
1010
Start SC#
900
scn: 800
Block 100
1-23
1-24
Instance
(*)starting 9i
removed fixedlocking mode
Database
Block 100
Block 101
1-25
Block 102
Block 103
Block 104
GC_FILES_TO_LOCKS = 1=100:2=0:3=1000:4-5=0EACH
GC_FILES_TO_LOCKS ={ file_list= lock_count[! blocks][EACH][:...]}
DM Database Mount
PF Password File
DX Distributed Recovery
PR Process Startup
FS File Set
RT Redo Thread
IN Instance Number
IR Instance Recovery
SM SMON
IS Instance State
SN Sequence Number
MM Mount Definition
MR Media Recovery
TT Temporary Table
TA Transaction Recovery
TX Transaction
False Pinging
Global Cache
(iDLM)
LE: 23
Instance
updating
fg1
BCache
dba:10
dba:103
dba:105
Database
Block 100
1-26
Block 101
Block 102
Block 103
Block 104
Another instance need access to dba:100, the owning instance must ping all the dirty blocks
that are covered by LE
LE: 100
LE: 105
Instance
updating
fg1
BCache
dba:101
dba:103
dba:105
Database
Block 100
Block 101
1-27
Block 102
Block 103
Block 104
break on GC_ELEMENT_NAME
select inst_id,GC_ELEMENT_NAME,CLASS,MODE_HELD
from gv$gc_element where GC_ELEMENT_NAME>20970000
order by GC_ELEMENT_NAME;
INST_ID GC_ELEMENT_NAME
CLASS
MODE_HELD
20971522
20971523
20971913
20971914
20976209
20976210
3
0
=====
===
=====
============
Scalability
Scaleup
Scaleup is the capability to provide continued increases in
throughput in the presence of limited increases in processing
capability while keeping time constant:
Scaleup = (volume parallel) / (volume original)
Speedup
Speedup is the capability to provide continued increases in speed in
the presence of limited increases in processing capability, while
keeping the task constant:
Speedup = (time original) / (time parallel)
1-28
1-29
1-31
Node 1
Node 2
Instance A
1-32
Instance B
Node 3
Instance C
SGA
SGA
SGA
GES/GCS
GES/GCS
GES/GCS
RAC Terminology
1-33
Cache coherency
Resources and locks
Global and local
GCS and GES, or PCM and non-PCM
GRM or DLM
Node, instance, cluster, and process
RAC Terminology
Cache coherency means that the contents of the caches in different nodes are in a welldefined state with respect to each other. Cache coherency identifies the most up-to-date
copy of a resource, which is also called the master copy. In case of node failure, no vital
information is lost (such as committed transaction state), and atomicity is maintained. This
requires additional logging or copying of data but is not part of the locking system.
A resource is an identifiable entity; that is, it has a name or reference. The entity referred
to is usually a memory region, a disk file, or an abstract entity; the name of the resource is
the resource. A resource can be owned or locked in various states, such as exclusive or
shared.
By definition, any shared resource is lockable. If it is not shared, there is no access
conflict. If it is shared, access conflicts must be resolved, typically with a lock. The terms
lock and resource, although they refer to entirely separate objects, are therefore
(unfortunately) used interchangeably.
A global resource is one that is visible and used throughout the cluster. A local resource
is used by only one instance. It may still have locks to control access by the multiple
processes of the instance, but there is no access to it from outside the instance.
DSI408: Real Application Clusters Internals I-33
Terminology Translations
1-35
Terminology Translations
RAC = OPS. OPS is the older term. See the History slide (#19) in this lesson.
Row Cache = Dictionary Cache. Row Cache is the older term. It is the SGA area to cache
database dictionary information. It is a global resource.
Distributed Lock Manager (DLM) = Global Resource Manager (GRM). DLM is the older
term; GRM has slightly more functionality. The terms are used for any locking system that
can handle several processes, typically (but not necessarily) on several nodes.
DLM = IDLM = UDLM. The DLM term is a very general term, but also refers to the
external operating systemsupplied DLM used by Oracle7. IDLM refers to the Integrated
DLM introduced in Oracle8. UDLM is the Universal DLM, that is, the reference
implementation of a DLM made on the Solaris platform. It is often called by its code
reference skgxn-v2.
Some of the RAC processes have retained their old names but are described with a
different purpose:
LMON: Global Enqueue Service Monitor, previously Lock Monitor
LMD: Global Enqueue Service Daemon, previously Lock Monitor Daemon
LMS: Global Cache Service Processes, previously Lock Manager Services
DSI408: Real Application Clusters Internals I-35
Programmer Terminology
1-37
Programmer Terminology
Inside the code, comments often refer to the programmers point of view.
Client and User are used interchangeably, and refer to the calling code.
Client code can register interest in a service by giving a pointer to a data structure that is to
be updated or a routine that is to be called, when the service has completed the required
action.
History
1-38
History
Oracle Parallel Server (OPS) historically had a bad reputation; it was not scalable. Most
applications ran slower on an OPS system than on a single instance. There was a need to
carefully determine which instance performed DML on which tables or (more accurately)
on which blocks. With RAC this need has been eliminated, resulting in true scalability.
Although RAC borrows much code from OPS, the official policy is not to mention that
RAC is an evolved version of OPS. Oracle does not want the bad reputation of OPS to
adversely affect the reputation of RAC in the market. Internally (in the code), the OPS
heritage in RAC is evident.
History Overview
1-39
History Overview
Some components have undergone changes in scope and name. The system that ensures
that access to a block is coherent is the Global Cache Manager in Oracle9i. In Oracle8i and
Oracle8, this was the Integrated Distributed Lock Manager. Earlier it was an external
operating systemsupplied service that the Oracle processes called. The Cluster Group
Service of Oracle9i and Oracle8i was the Group Membership Services module in Oracle8
and (before that) part of the external Distributed Lock Manager.
Although there have been many changes to the architecture in the instance, the database
structure has changed only marginally. Separate redo threads and undo spaces are still
used.
Internalizing Components
Oracle7
Simulated
callback,
enqueue
translation
RDBMS
DLM API
No local
state in
instance
Oracle8
DLM,
CM
&
Op.Sys.
RDBMS
IDLM
Callbacks,
enqueues
CM
&
Op.Sys.
Local state in
SGA memory
1-40
Internalizing Components
The development of RAC has internalized more operating system components for each
version. As an example, the diagram on the slide shows the internalization of the
Distributed Lock Manager (DLM) in the development of Oracle7 to Oracle8. Instead of
calling the external operating system whenever any lock status needed checking by the
DLM API module, the IDLM module in the Oracle server only needs to examine its SGA.
The RDBMS routines did not in principle need to reflect the change.
The earlier versions had the DLM external, which limited the functionality (lowest
common denominator effect) that the Oracle server could rely on, and the need to pass
data to external services. Data transfer used pipes or network communication to the
external processes; control for I/O completion used Asynchronous Trap (AST)
mechanisms, polling mechanisms, or blocked waits. Internal communication inside the
Oracle servereven between the various background processescan use the common
SGA memory area that includes latches and enqueues.
This is merely illustrative and is not an accurate summary of the changes made.
The Oracle8 to Oracle9i development similarly internalized the GMS interface (that is, the
Node Monitor (NM) functionality), relying on only the Cluster Manager (CM) interface
routines.
DSI408: Real Application Clusters Internals I-40
Oracle7
1-41
Oracle7
OPS in Oracle7 consisted of the database structural changes for cluster operation (as in all
versions) and the addition of the LCK process that communicated with the external DLM.
The instances not only coordinated global cache coherency through the DLM but also used
the DLM as the communication channel for registering into the OPS cluster.
The method for sending the SCN or other messages was platform specific.
External DLM
The external DLM usage had the following characteristics:
It had to be running before any instance started.
Resources and locks had to be adequately configured.
Death of the DLM on a node implied death of all its clients on the node.
OPS/DLM diagnostics had to have port-specific lock dumps.
Internode parallel query code had to be port specific.
Oracle8
1-42
Oracle8
The internal DLM meant that resource allocation was inside the Oracle server. Diagnostic
lock dumps no longer needed to be port specific. The Oracle server, version 8 (and later),
started communicating with the cluster services of the operating system. The interface
consisted of the GMS that was an Oracle-specified API. The GMS functionality included:
Supplying each instance with the current set of registered members, clusterwide
Notifying other members when a member joins or leaves
Automatically deregistering dead processes/instances from their groups
Interfacing with the node monitor for cluster events
Oracle8i
1-43
Oracle8i
The Cache Fusion Stage 1 satisfied some types of block requests across the cluster
communication paths (rather than via disk) and made use of the messaging services.
The Oracle8 GMS has been split into OSD and Oracle kernel components. Node monitor
OSD skgxn is extended from monitoring a single client per node to arbitrarily named
process groups. The rest of the GMS functionality is moved into Oracle as CGS. A
distributed name service is added to CGS.
LMON executes most of the CGS functionality:
Joins the skgxn process group representing the instances of the specified group
Connects to other members and performs synchronization to ensure that all of them
have the same view of group membership
Oracle9i
1-44
Oracle9i
The remainder of this course is based on Oracle9i.
Summary
1-45
Objectives
2-47
Node
Instance
(SGA,
processes)
Node
Instance
(SGA,
processes)
Instance
(SGA,
processes)
Cluster
disk/file
system
2-48
Node
Instance
SGA
DBW0
PMON
DIAG
LMS
LMD
LCK
LMON
CM
2-49
Node
Instance
kql
kqr
kqlm
ksi
GCS kjb/GES kju
CGS kjxg
NM skgxn.v2
IPC: Interprocess
Communication
2-50
kcl
s
I k
Pg
Cx
p
CM
OCI
UPI
OPI
KK
KX
K2
NPI
KZ
KQ
RPI
KA
KD
KT
KC
KS
KJ
KG
S
2-52
ORACLE
DLM (GRD)
GCS
GES
CGS/IMR
DRM/FR
IPC
KSXP
NM
SKGXN
SKGXP
2-53
Client
code
PQ
kcl
ksq
KSXP SKGXP
ksi
DLM
CGS
2-54
2-55
2-56
Platform-Specific RAC
Higher layers
SQL, Transaction, Data
Cache KC*
Service KS*
Operating System
Routines
2-57
Platform-Specific RAC
Many RAC problems are platform specific. The Operating System Dependency (OSD)
layer therefore must be examined for the platform concerned. The subdirectory is called
sosd or osds.
This cannot be examined in TAO with cscope; you need the vobs access.
OSD code is partially available at
/export/home/ssupport/920/rdbms/src/server/osds.
SKGXP
2
U
D
P
T
C
P
H
M
P
SKGXP
module,
3 alternative
versions
3
5
4
OS routines
2-58
skgxp.h
Generic interface
skgxp.c
Reference
implementation
sskgxpu.c
UDP implementation,
port-specific
sskgxph.c
HMP implementation,
port specific (HP-UX)
Summary
2-60
References
WebIV
Check folder Server.HA.RAC
2-61
Cluster Layer
Cluster Monitor
Objectives
3-63
Caches
ksi/ksq/kcl
GRD
CGS
NM
I
P
C
Other
nodes
(not
shown)
CM
3-64
Generic CM Functionality:
Distributed Architecture
3-65
Generic CM Functionality:
Cluster State
3-66
State change
Cluster Incarnation Number
Cluster Membership List
IDLM Membership List
Generic CM Functionality:
Node Failure Detection
3-67
Node
Instance
NM
CM
3-68
Oracle-Supplied CM
3-69
Oracle-Supplied CM
The Oracle-supplied CM is covered in the Linux platform lesson later in this course.
In the Oracle-supplied CM the integration to the RAC cluster is somewhat closer, blurring
the distinction.
Summary
3-70
Objectives
4-73
Caches
ksi/ksq/kcl
GRD
CGS/GMS
NM
I
P
C
Other
nodes
(not
shown)
CM
4-74
4-75
4-76
Group Membership
A process can register with a group on behalf of an instance that includes multiple
processes. It is important that, when the member deregisters from the group, the other
instance processes do not access the shared cluster resources (such as shared disk) after the
remaining group members have been informed of the deregistration. Otherwise, the
deregistered instance may overwrite changes that are made by the surviving instances.
To protect against this situation, the processes of an instance can share the membership of
the process that is registered with the group. These processes register as slave members,
specifying the member ID of the member that registered as a normal (primary) member.
The deregistration of the primary member must not be propagated to the groups other
primary members until all the associated slave members have also deregistered.
NM Groups
4-77
NM Groups
On registration, a process provides:
Private member data that can be retrieved only by other members and that consists of
IPC port ID and other bootstrap information
Public member data that can be retrieved by any skgxn client and that consists of
node name and other information for administrative tools to use
Primary members should ensure that all slaves are terminated on deregistering from the
group. Failure to do so is a bug or malfunction. LMON is the primary member of an
instance.
Slave members are all I/O-capable clients.
NM Internals
4-78
NM Internals
Source Notes
The basic concept is Process Groups. This implementation relies on the UNIX Distributed
Lock Manager (UDLM) architecture, the same as the first version of the DLM that was
external.
skgxnpstat receives group membership changes. The client must call it to get group
changes (passive). The interface itself does not call back; the caller must check the state bit
in the context.
The process must call skgxnpstat to receive any state event changes. An example is
skgxpwait in IPC to receive an event such as I/O completion.
These routines are normally part of a daemon loop (LMON).
Node Membership
4-79
Node Membership
The bitmap is stored globally in the cluster by using the UDLM as a global repository to
store global information, and uses the global notification mechanism of the UDLM. The
global repository stores the bitmap.
The UDLM reserves a storage space for each resource in a Resource Value Block (RVB).
That space is limited to 16 bytes. Multiple resources and RVBs can be used for large
clusters. These are stored in persistent resources. Persistent resources survive crashes and
are recoverable. They are stored in the UDLM space struct kjurvb. (see kjuser.h
for more information).
Instance 2
Instance n
LMD0
1
LMON/NM
2
LMON/NM
3
LMON/NM
3
Cluster Layer - CM
4-81
NM Membership Death
4-83
NM Membership Death
Given a bitmap composed of eight nodes, all of which are up, skgxnpstatus is called.
This call also calls skgxn_neighbor to determine the right-side neighbor and
skgxn_test_member_alive to determine its status rather than scanning the entire
bitmap. This avoids all nodes calling skgxnpstatus to read the entire bitmap. This is a
protected read. When invalidating the bitmap, it is a lock:write.
Note: Reconfiguration may not happen simultaneously in all nodes. This is why the CGS
layer above must do the synchronization.
Instance A
2
LMD0
LMON
3
CM
Instance B
LMD0
LMON
4-84
Instance A
CGS/GMS
NM
Communication via CM
Instance B
CGS/GMS
NM
4-85
Instance A
2
LMD0
LMON
3
CM
Instance B
LMD0
LMON
4-86
Instance A
CGS/GMS
NM
Communication via CM
Instance B
CGS/GMS
NM
4-87
4-88
4-89
Configuration Control
4-90
Configuration Control
In the CGS, the most important data value is the incarnation value and synchronization.
Valid Members
4-91
Valid Members
The CGS checks whether members in the database group are valid. It ensures that all
members are operating on the same configuration.
All members vote, detailing which incarnation they are voting on and a bitmap of
membership as they perceive it to be.
The member that tallies the votes waits for all members of the last incarnation to register
that they have received the reconfiguration.
Instance Membership Reconfiguration (IMR)
This is a component part of the CGS layer. Source is in kjxgr.h and kjxgr.c.
Membership Validation
Instance A
Instance B
LMON
CGS (IMR)
Instance C
LMON
CKPT
CGS (IMR)
CKPT
LMON
CGS (IMR)
CKPT
Control file
4-94
Membership Validation
The CKPT process updates the control file every three seconds, an operation known as the
heartbeat. CKPT writes into a single block that is unique for each instance; thus no
coordination between instances is required. This block or record is called the checkpoint
progress record and is handled specially. The CREATE DATABASE MAXINSTANCE
parameter controls the number of these block records. The heartbeat also occurs in single
instance mode.
LMON sends messages to the other LMON processes. If the send fails or no message is
received within the timeout, then reconfiguration is triggered. The LMON message send
failure detection is controlled by _cgs_send_timeout. The default value is 300
seconds.
Control file update failure is controlled by _controlfile_enqueue_timeout. The
default value is 900 seconds.
Reducing these values could cause false failure detection under heavy load. Using values
that are too large could cause hang-like conditions, where a bad instance member remains
undetected.
Note: Although the description is of a process doing a particular job, the code is part of the
CGS layer.
DSI408: Real Application Clusters Internals I-94
Membership Invalidation
4-95
Membership Invalidation
IMR-initiated eviction of a member is not performed if a group membership change occurs
before the eviction can be executed.
Deciding the Membership
All members attempt to obtain a lock on a control file record (the Result Record) for
updating. The instance that obtains the lock tallies the votes from all members.
The group membership must conform to the decided membership before allowing the
GCS/GES reconfiguration to proceed; a skgxn reconfiguration with the correct
membership must be observed.
Vendor Clusterware
Vendor clusterware may also perform node evictions in the event of a cluster split-brain.
IMR detects a possible split-brain and waits for the vendor clusterware to resolve the splitbrain. If the vendor clusterware does not resolve the split-brain within
_IMR_SPLITBRAIN_RES_WAIT (default value of 600 milliseconds), then the IMR
proceeds with evictions.
Communications error
Initiated by IMR
Caused by communications error to either LMON or
GES/GCS
4-97
4-98
Reconfiguration Steps
Step 1:
a. Complete pending broadcast with RCFG status.
b. Freeze name service activity.
c. Freeze the lock database
Step 2:
a. Determine valid membership, Instance Membership
Recovery.
b. Synchronize incarnation.
c. Increment incarnation number.
Step 3:
Verify instance name uniqueness.
4-99
Reconfiguration Steps
LMON trace file excerpt
*** 2002-08-23 17:26:01.262
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 1 0.
*** 2002-08-23 17:26:01.266
Name Service frozen
kjxgmcs: Setting state to 1 1.
*** 2002-08-23 17:26:01.367
Obtained RR update lock for sequence 1, RR seq 1
*** 2002-08-23 17:26:01.370
Voting results, upd 0, seq 2, bitmap: 0 1
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 2 2.
Performed the unique instance identification check
kjxgmps: proposing substate 3
Reconfiguration Steps
Step 4:
Delete nonlocal name service entries.
Step 5:
a. Republish local name entries.
b. Resubmit pending requests.
Step 6:
a. Publish LMD processes IPC port-ids in the name
service.
b. Unfreeze name service.
Step 7:
Return reconfiguration RCFG event to GES/GCS.
4-100
Instance A
Instance B
LMON
CGS (IMR)
Instance C
LMON
CKPT
CGS (IMR)
CKPT
LMON
CGS (IMR)
CKPT
Control file
4-101
The CFVRR is stored in the same block as the heartbeat in the control file checkpoint
progress record (see kjxgr.c/h).
Alert log in Instance C
Errors in file
/export/oracle/app/admin/rac/bdump/rac2_lmon_10911.trc:
Instance C is evicted. Its bit does not show up in the other members list of valid
members, thus it must leave the cluster.
ORA-29740: evicted by member 0, group incarnation 3
LMON: terminating instance due to error 29740
The instance which obtained the RR lock tallies the vote result from all nodes and
updates the CFVRR.
*** 2002-08-19 15:26:46.592
Obtained RR update lock for sequence 2, RR seq 2
*** 2002-08-19 15:27:29.198
kjxgfipccb: msg 0x80000001002babe8, mbo
0x80000001002babe0, type 22, ack 0, ref 0, stat 3
kjxgfipccb: Send timed out, stat 3 inst 1, type 22, tkt
(32144,0)
:
:
*** 2002-08-19 15:28:27.526
kjxgrrecp2: Waiting for split-brain resolution, upd 0,
seq 3
*** 2002-08-19 15:28:28.127
Voting results, upd 0, seq 3, bitmap: 0
Code References
4-103
Summary
4-104
Objectives
5-107
Caches
ksi/ksq/kcl
GRD
CGS
NM
I
P
C
Other
nodes
(not
shown)
CM
5-108
Instance R
Instance H
1
4
Instance M
5-109
Asynchronous Traps
5-110
Asynchronous Traps
When a process requests a lock on a resource, the GES sends an acquisition AST to
notify the processes that currently own locks on that resource in incompatible modes.
Upon notification, owners of the locks can relinquish them to permit access to the
requestor.
When a lock is obtained, an acquisition AST is sent to tell the requester that it now owns
the lock.
To determine whether a blocking AST has been sent by a requestor or whether an
acquisition AST has been sent by the blocker (or owner of the lock), query the fixed
view GV$LOCK_ELEMENT or X$LE and check which bits are set. Examples for
incompatible modes are shared and exclusive modes.
An acquisition AST acts like a wakeup call.
5-111
Message Buffers
5-112
Message Buffers
Any sender or receiver allocates a message structure (or message buffer) before sending
or receiving a message.
KJCCMSG_T_BATCH is mostly used in reconfiguration or in remastering, or after
delivering a buffer in cache fusion.
There are three pools of messages:
REGULAR: With initial #buffers = processes*2 + 2*10 + 10 + 20
BATCH: With initial #buffer = processes*2 + 2*10 + 10 + 20
RESERVE: With initial #buffer = min(2*processes, 1000)
If the REGULAR pool is exhausted, then more allocations are done from the shared
pool.
MsgPool
OutstandingQueue
OutstandingQueue
FreeMsgQueue
FreeMsgQueue
Release
Send-done
PendingSendQueue
PendingSendQueue
Direct send
Send
SendQueue
Indirect send
5-113
Messaging Deadlocks
5-114
Messaging Deadlocks
Messaging can cause deadlocks to appear. If you are waiting to send a message to
acquire a lock and there is another process waiting on the lock that you hold, then you
will not be checking on BASTs and so will not see that you are blocking someone. If
many writers are trying to send messages and no one is reading messages to free up
message buffer space, there can be a deadlock.
Like the interface, messaging protocol is port specific. The message is typically less
than 128 bytes, so the interconnect must be low latency. In addition, the number of
messages can be high. It typically depends on the number of locks or resources.
Basically the more locks or resources, the higher the traffic. In Oracle8, the number of
message buffers depended on the number of resources; in Oracle7, the number depended
on the number of locks.
5-115
TRFC Tickets
5-116
TRFC Tickets
You use flow control to ensure that the remote receivers (LMD or LMS) have just the
right amount of messages to process. New requests from senders wait outside after
releasing the send latch, in case receivers run out of network buffer space. Tickets are
used to determine the network buffer space available.
Clients that want to send first get the required number of tickets from the ticket pool and
then send. The used tickets are released back to the pool by the receivers (LMS or LMD)
according to the remote receiver report of how many messages the remote receiver has
seen. Message sequence numbers of sending nodes and remote nodes are attached to
every message that is sent.
The maximum number of available tickets is a function of the network send buffer size.
If at any time tickets are not available, senders have to buffer the message, allowing
LMD or LMS to send the message on availability of the ticket. A node relies on
messages to come back from the remote node to release tickets for reuse. In most cases
this works, because most of the client requests eventually result in an ACK or ND.
TRFC Flow
Node 1
Node 2
Tickets are sent back
to requestor side by
attaching the number
of ACK tickets in the
message header.
Msg.
Msg.
Queued messages
waiting for tickets
LMD
LMS
No more
tickets
LMD
LMS
Msg.
sender
Tickets available
Msg.
Tickets depleted,
NULL_REQ message
5-118
TRFC Flow
At the beginning, the number of available tickets is 500. One sent message consumes
one ticket. Each node maintains several counters for each communication partner.
AvailBuf: Number of buffers that are available to receive new messages (buffers
attributed to KSXP interface)
RecMsg: Number messages received, where message type is different from TEST,
NULL-REQ, or NULL-ACK
AvailMsg: Number of messages received (all types)
The pseudocode is:
if AvailBuf >= AvailMsg (if there are sufficient buffers)
then AckTickets = AvailMsg
else if RecMsg == AvailMsg (no NULL-REQUEST yet)
then AckTickets = AvailBuf
else if AvailMsg - RecMsg > AvailBuf (too many NULL-REQUEST)
then AckTickets = 0
else AckTickets = AvailBuf (AvailMsg - RecMsg)
5-120
104
148
178579
0
1
1587
1867
177013
30485
59070
6104
16
154
0
0
16
1177
15
Value
25
10
0
16
2
1212
0
0
0
0
4
6
0
0
0
0
248
9731
203910
499892
311453
178600
6
46
0
0
7
177777
2
224
21
10
IPC
5-123
IPC
Because IPC was more synchronous in the releases before Oracle9i, the OPS systems
were more prone to hanging in this component. IPQ used its own interface (SKGXF).
IPQ client
Cache client
DLM client
CGS client
KSXP
SKGXP
5-124
IPC Code
The SKGXP module is the OSD module. The source that is available on tao includes
the reference implementation. This has extensive comments in skgxp.h.
Reference Implementation
For internal QA
Simple code for easy portability
Interface example
Uses standard protocols for communication
TCP/IP
UDP
5-125
Reference Implementation
There are several reference implementations because there are several standard
protocols that can be used. These are available for the various ports.
Hardware vendors use the reference implementation as a starting point and replace the
protocol with their own optimized high-speed interconnect software by using their
hardware. This makes it very platform dependent.
kslwat
ksl wait
facility
IPC
5-126
Default
IO
skgpwait
Net
ksxpwait
ksldwat
ksnwait
skgxpwait
skgfrwat odm_io
nsevwait
KSXP Tracing
Event 10401
Bit flags
5-127
KSXP Tracing
For more information, refer to ksxp.c of 10401. KST tracing is covered in a later
module.
522683FB:000182BD
6
5 10401 39
KSXPQRCVB: ctx 2ec5a84 client 2 krqh 301c1bc srqh 301c218
buffer 2faca80
5-128
SKGXP Interface
5-129
SKGXP Interface
The Port Connection is for asynchronous usethe client code submits a number of
requests to the interface and attempts to overlap the completion of these requests with
useful computation. This overlap of communication with computation acts to hide the
latency costs of remote communication.
Ports represent communication endpoints. Connections are used to cache information
regarding communication endpoints. Request handlers represent outstanding requests to
the interface (primarily outstanding message receives and sends).
Synchronization is provided by skgxpwait. Synchronization is integrated with the
standard VOS layer post/wait mechanism allowing Oracle processes to block the
waiting for outstanding network IPC or post from another process in the local instance.
The buffer cache uses the memory-mapped interface for cache fusion and parallel query
clients.
Regions are large areas of memory (such as the SGA). Clients that want to receive data
into their region prepare buffers in the region to receive data via the prepare call. The
output of the prepare call is a buffer ID or BID. BIDs are copy-by-value structures that
are transferred to remote instances via the lock manager. The BIDs are then used to
transfer data directly to the prepared buffer of the requesting process in the remote
instance.
DSI408: Real Application Clusters Internals I-129
5-130
SKGXP Tracing
Event 10402
Bit flags in level
5-131
KSXP_OSDTR_ERROR
KSXP_OSDTR_META
KSXP_OSDTR_SEND
KSXP_OSDTR_RCV
KSXP_OSDTR_WAIT
KSXP_OSDTR_MCPY
KSXP_OSDTR_MUP
0x01
0x02
0x04
0x08
0x10
0x20
0x40
SKGXP Tracing
The levels for the event have changed considerably in Oracle9i Release 2. Examine
source skgxp.h, ksxp.c for details in older versions. In Oracle9i Release 1 (and
earlier), it was:
0x00040000
trace meta functions
0x00080000
trace send
0x00100000
trace receive
0x00200000
trace wait
0x00400000
trace cancel
0x00800000
trace post
0x02000000
trace unusual or error conditions
0x04000000
trace remote memory copies
0x08000000
trace buffer update notifications
5-132
5-133
Code References
5-134
Summary
5-135
Objectives
6-137
SCN
Other
nodes
(not
shown)
CM
6-138
6-139
Basics of SCN
SCN Wrap
6-140
SCN Base
Basics of SCN
Much can be said about the SCN and the nature of causality.
The essentials are:
The SCN must always increase and may skip a number of values.
The SCN must be kept in sync between multiple instances.
- In RAC: Between all instances mounting the database
- In distributed databases: All instances that are involved in a distributed
transaction (that is, when using database links)
- Synchronizing means using the highest known SCN. Otherwise it conflicts with
the requirement to increase.
Dependencies (causality) between changes must be maintained (for example, in
multiple changes to the same block by different transactions).
For more information, refer to Note 33015.1.
There is some distinction between the Current SCN that is used for a commit and the
Snapshot SCN that is used for a Consistent Read (CR) operation. The Snapshot SCN is the
highest SCN seen or used by the instance.
SCN Latching
CAS Primitive
None
32-bit CAS primitives
64-bit CAS primitives
6-142
Latch-Free Access
Reads
Reads and writes
Reads and writes
SCN Latching
If the operation to update or increment the SCN cannot be performed as an atomic or
single CPU instruction, you must latch or lock the SCN data structure so that the other
processes do not see an invalid SCN.
Latchless CAS operations are controlled by the following initialization parameters:
_disable_latch_free_SCN_writes_via_32cas
The default is False (that is, enabled by default).
_disable_latch_free_SCN_writes_via_64cas
The default is True (that is, disabled by default, even if it is supported on the platform).
Lamport Implementation
6-143
Lamport Implementation
Earlier, Oracle OPS had a choice of SCN propagations, some of them using platformspecific hardware protocols. The Lamport scheme was the reference implementation.
Lamport SCN
6-144
Lamport SCN
The Lamport SCN propagation assumes that there is a constant exchange of messages. If
an instance does many commits on blocks where it has cached all data, the SCN will not
change at the other nodes, as there are no messages sent. This is solved with a periodic
SCN update.
The SC global resource or lock is used to communicate the SCN for the periodic update.
Its value field contains the current SCN, and the instance holding the exclusive lock can
update the field. You can think of the SC lock as a dummy lock that is used if the SCN
has not been propagated recently through other lock or message activity.
For more information, refer to kjm.c.
Source References
The message sending routines in kjc.c will insert the current SCN into every message at
scn_kjctmsg. Messages that are received by LMD (9.0) or LMS (9.2) compare and
update the local SCN if the local SCN is lower.
The SCN is shown in message dump/traces.
701
SCN sync.
Time
702
Tx1 Start
|
Commit
Tx3 Tx7
707
708
Tx2 Start
|
Commit
Tx8 Start
|
Commit
Instance 1
6-145
702
707
SCN sync.
Instance 2
Copyright 2003, Oracle. All rights reserved.
max_commit_propagation_delay
6-146
max_commit_propagation_delay
With Lamport SCN, every instance maintains locally generated SCNs. When they generate
a new SCN, the instance does not need to synchronize the SCN within the
max_commit_propagation_delay amount of time. Instances can increase their
locally generated SCN based on global SCNs
max_commit_propagation_delay < 1 second
Each time LGWR writes to the redo log (that is, with every commit):
- LGWR sends a message to the SCN resource (SC, 0, 0) master to update SCN.
- LGWR sends a message to every active instance to update SCN.
1 second < max_commit_propagation_delay < 7 seconds
Each time LGWR writes to the redo log, it also sends a message to the SCN
Resource Master to update the SCN.
If a Snapshot SCN is required by an instance and more than the
max_commit_propagation_delay time has elapsed since the last
synchronization event, then the process sends a message to the SCN resource master
to update the SCN.
7 seconds < max_commit_propagation_delay
Every three seconds, the LCK process sends a message to the SCN resource master
to update the SCN.
DSI408: Real Application Clusters Internals I-146
Instance A
FG
LMS
clk_val_kjxreqh
> local SCN
Message
scn_kjctmsg
SCN
SCN
scn_kjctmsg
LMS
Periodic Synchronization
LCK0
LMD0
2: Simple ACK, includes SCN
Node 2
Node 1
6-148
Periodic Synchronization
The LCK0 timeout event, kcsmto, checks whether it is time for an SCN update.
6-149
Code References
6-150
Summary
6-151
Objectives
7-153
Instance
Caches
ksi/ksq/kcl
GRD/GCS/GES
CGS
NM
I
P
C
Other
nodes
(not
shown)
CM
7-154
DLM History
7-155
DLM History
The Oracle DLM comes out of the development that is performed primarily on SP2 and
HP DLMs for Oracle7, which were used where the vendors did not provide any DLM.
In Oracle 7, version 3, Digital, Sequent, NCR, and Pyramid used their own DLMs. They
were all different, as were the debugging tools and the output. The particular
functionality that was supported in each case also varied, which made it difficult for
Oracle to implement certain functions on some platforms at certain releases. Groupbased locking is an example.
In Oracle7 DLMs, pipes facilitated the communication between the DLM daemons and
the client processes. In Oracle8, clients of the DLM have direct access to the DLM
structures in the SGA. This permits optimization of the communication path by allowing
clients to modify the structures directly and by waiting only on an LMD process to send
messages to remote nodes where remote operations must be performed. Therefore, local
lock operations can be considerably faster.
The DLM has been continuously improved with more views, better deadlock detection,
and changed message paths to eliminate needless context switches. The Cache Fusion
improvements are more of a change in how the client buffer handling routines use the
DLM.
DSI408: Real Application Clusters Internals I-155
7-156
7-157
Resources
A resource is just a name. Each resource can have a list of locks that are currently
granted to users. This list is called the Grant Q. Similarly there is a Convert Q, which is
a queue of locks that are waiting to be converted. In addition, a resource has a 16-byte
lock value block (LVB) that contains a small amount of data. The LVB is used in some
resources. For example, the PS resource for parallel query slaves uses it to pass the
kxfpqd structure to the other nodes.
The two resource types have different data structures.
Grant Q
Convert Q
Lock value block
7-158
lockp
PID
GID/DID
Locks
If the lock before use rule has not been followed by the Oracle programmer, then that
is a bug. It may not show up as system or data corruption for some time.
The DLM lock modes and the Oracle locking modes are not identical. The locking
matrix for the DLM is covered in later slides. The lock matrix depends on the type of
lock.
Locks are placed on a resource. When a process has a lock on the grant queue of the
resource, it is said to own the resource. Imprecise usage also talks of owning the
lock.
The example in the slide shows a lock on the Grant Q of the resource. The lock may be
either process- or group-owned. If it is process-owned, the PID field shows which
process holds the lock. In the case of group-owned locks, the GID field has a group
number, and the DID field has the Transaction ID (TxID) of the client transaction.
7-159
Grant Q
lockp
Convert Q
PID
Procp
GID/DID
PID
[0x10000f8][0x1],[BL]
7-160
[0x10000f8][0x1],[BL]
Grant Q (local)
Convert Q (local)
Shadow node
Persistent Resources
The shadow resource exists on any other node that has an interest in a resource, that is,
any node on which a lock is open against that resource.
A persistent resource is maintained in a dubious state in the DLM following the closure
of all locks on it when the processes holding the locks exited abnormally while holding
a lock in PW or EX mode.
Recovery Domain (rdomain)
A recovery domain is the mechanism by which persistent resources can be recovered.
Each persistent resource is linked to a recovery domain. There is one such domain per
database.
Grant Q
Convert Q
Master node
7-161
[0x10000f8][0x1],[BL]
Grant Q
Convert Q
Shadow node
lockp
lockp
Copy lock
Owner node
7-162
7-163
DLM Structures
GCS:
Resource table: kjbr
Lock table: kjbl
7-164
DLM Structures
The separation of GES and GCS resource handling is new to Oracle9i. The earlier
versions had more common structures and code paths.
There are differences in these structures between versions 9.0.1 and 9.2
kjr (partial)
kjurvb
kjurn
kjsolk
kjsolk
kjsolk
kjsolk
ub2
ub1
ub1
kjuvlst
ub2
kjsolk
kjsolkl
ub1
ub1
kjulevel
valblk_kjr;
/* the value of the lock
resname_kjr;
/* the resource name
grant_q_kjr;
/* list of granted resources
convert_q_kjr;
/* list of resources being converted
req_q_kjr;
/* list of open reqs when master_node unknown
scan_q_kjr;
/* For the DLMD to perform move_scan_cvt etc
grant_count_kjr[6];
/* count of # of locks at each level
granted_bits_kjr;
entry_kjr;
/* dir, master, local
valstate_kjr;
/* state of valblk
master_node_kjr;
/* ID of the node mastering the resource
hash_q_kjr;
/* hash list : hp
*hp_kjr;
options_kjr;
/* same as open option
remaster_kjr;
next_cvt_kjr;
/* Global next cvt. mode
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
/* tab latch */
/* parent of group locks */
/* number of lock freelist */
/* FreeList Latch */
/* 68 bytes on sun4u
/* hash list : hp
/* the resource name
lmd scan q of grantable resources
/* list of granted resources
list of resources being converted
/* scn(base) known to be on disk
/* scn(wrap) known to be on disk
/* scn(wrap) requested for write
/* scn(base) requested for write
/* lock elected to send block
/* version# of above lock
/* version# of lock below
/* lock elected to write block
'n', 's', 'x' && one of 'l' or 'g'
/* ignorewip, free etc.
/* refuse ping counter
/* resource operation history
/* split transaction ID
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
New lock
requested
GRANT
QUEUE
Conversion
granted
7-167
CONVERT
QUEUE
Compatible
In-place
conversion
Incompatible
conversion
Lock Changes
Locks are placed on the resource grant or convert queue. If the lock mode changes, then
it is moved between the queues.
If several locks exist on the grant queue, then they must be compatible. Locks of the
same mode are not necessarily compatible with another of the same mode. The
compatibility matrix of the various locks differs between GES and GCS locks.
Compatible in-place conversions are typically downgrades, converting to a lesser mode.
Some exceptions exist and are covered later.
A lock can leave the convert queue under any of the following conditions:
Process requests the lock termination, (that is, removes the lock).
Process cancels the conversion; the lock is moved back to the grant queue in
previous mode.
The requested mode is compatible with the most restrictive lock in the grant queue
and with all the previous modes of the convert queue, and the lock is in the head of
the convert queue. Convert requests are processed first in, first out (FIFO).
7-168
Res.
Res.
Res.
Res.
Res.
A:CR
Grant
Convert
NL: Null
CR: Concurrent Read
EX: Exclusive Write
A:CR
B:CR
A:CR
B:CR
C:CR
A:NL
B:CR
C:CR
Grant
A:NL
B:CR
Convert
C:CR EX
Grant
Convert
Grant
Convert
Grant
Convert
Res.
Res.
Res.
Res.
A:CR
B:CR
Grant
B:CR
C:CW
Convert
A:CREX
Grant
C:CW
Convert
A:CREX
Grant
C:NL
Convert
A:CREX
Grant
Convert
NL: Null
7-169
C:CW
B:CRPR
B:CRPR
DLM Functions
7-170
DLM Functions
Interprocess communication is critical to the DLM because it is distributed. Being
distributed permits the DLM to share the load of mastering (administering) resources.
The result of this is that you may lock a resource on one node but actually have to
communicate with the LMD processes on another node entirely. Fault tolerance requires
that no vital information about locked resources is lost irrespective of how many DLM
instances fail.
The durability of the database (that is, being able to recover blocks that are lost in an
aborted instances buffer cache) is not a DLM function, but global cache handling of
blocks still uses the same log before write rule to ensure durability.
DLM Functionality in
Global Enqueue Service Daemon (LMD0)
7-171
DLM Functionality in
Global Enqueue Service Monitor (LMON)
7-173
DLM Functionality in
Global Cache Service Process (LMS)
7-174
DLM Functionality in
Other Processes
DIAG process:
Provides low-overhead in-memory tracing and
logging
Manages and maintains the diagnosability across
multiple instances
Helps execute ORADEBUG on all nodes of the RAC
cluster
All processes:
Process PING for BUFFER-CACHE
Process-deferred queue and CR log-flush queue
Adjust local SCN (Lamport) when receiving DLM
messages
7-175
7-176
7-177
7-178
7-179
7-180
hash_node_kjga
maps logical to
physical node.
hash_node_kjga[0]
always contains one
live node.
This array is updated
in a three-step
reconfiguration.
N1
Dead node
N2
Live node
2
3
5
-
N3
N4
N5
-
hash_node_kjga
7-181
Resource N
Hash value
of name
N1
0
0
0
res_hashed_val_kjga
Resource M
0
0
N2
N3
N4
N5
0
pcm_hv_kjga
7-182
hash_node_kjga
N1
0
1
0
res_hashed_val_kjga
Resource M
2
N2
4
-
0
1
N3
N4
N5
-
1
pcm_hv_kjga
7-183
hash_node_kjga
weight 6331
node 0 is 6331
node 1 is 6331
12662
res_hashed_val_kjga
Step 1
N2
7-185
N1
N3
N4
N5
hash_node_kjga
U
1
pcm_hv_kjga
Copyright 2003, Oracle. All rights reserved.
N1
0
1
3:Send
master node
ID
N2
N3
0
1
7-186
N4
4: Master sends
hash tables
1: Send
hash_node_kjga[0]
N5
7-187
7-188
7-189
DLM Functions
7-190
DLM Functions
kjual is called when the Oracle shadow process is started.
kjpsod is called before the Oracle shadow process leaves.
The other functions are used to manage only non-PCM resources and locks.
OS process PID
Process node number
Process flags (such as DEAD, RMOT, LOCL)
List of process-created DLM locks
Queue of pending AST for the process
Various statistics on lock conversion activity
7-191
flg_kjp;
/* process flag
KJP_DEAD 0x0001
/* process is dead, pending cleaned up
KJP_LMON 0x0002
/* process is the DLM-MON
KJP_DLMD 0x0004
/* process is DLMD
KJP_RMOT 0x0008
/* remote process
KJP_LOCL 0x0010
/* local process
KJP_IOPENDING 0x0020
/* has i/o pending, dont remove
KJP_IID
0x0040 /* 'Important' process: death =>inst termn
KJP_DLMS 0x0080
/* process is LMS
KJP_DIAG 0x0100
/* process is DIAG
KJP_RMRDR 0x0200 /* p. is reading a PT/HV struct, critical sec
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
lock_q_kjp;
/* list of locks created by this process
ast_q_kjp;
/* ast queue
pid_kjp;
/* OS pid of process
node_kjp;
/* ID of the node the proccess belong to
orapnum_kjp;
/* oracle process number
*oraproc_kjp;
/* oracle process structure address
loc_lck_cvt_tm_kjp[KJST_CONVTYPE];
/* cumulative time of local converts
loc_lck_cvt_ct_kjp[KJST_CONVTYPE];
/* cumulative number of local converts
rem_lck_cvt_tm_kjp[KJST_CONVTYPE];
/* cumulative time of remote converts
rem_lck_cvt_ct_kjp[KJST_CONVTYPE];
/* cumulative number of remote converts
*/
*/
*/
*/
*/
*/
*/
*/
*/
*/
kjual Flow
Pid-1
1: Allocate and
initialize
P1
Procs.
LMD0
Locks
2: Update ges_procs
in v$resource_limit
7-193
Res.
LMON
kjpsod Flow
1: Flag procp
2: Clear ASTs,
put to freelist
P1
Pid-1
KJP_DEAD
Procs.
LMD0
Locks
3: Update ges_procs
in v$resource_limit
7-194
Res.
LMON
7-195
P1
Node 1
P2
P3
Node 2
Enq.
Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSERPR KJUSERPR
0
0 13354
0
0
7-196
Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSERPR KJUSERPR
0
0 13354
0
0
7-197
Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSERPR KJUSERPR
0
0 13354
0
0
7-198
Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
0
1
1 KJUSERNL
GRANT_LEV REQUEST_ TX_ID0 TX_ID1
PID OPENDEADLOCK OWNER_NODE
--------- -------- ------ ------ ------ ------------- ---------KJUSEREX KJUSEREX
0
0 13354
0
0
7-199
Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
1
1
1 KJUSERNL
GRANT_LEV
--------KJUSEREX
KJUSERNL
7-200
Instance 2
RESOURCE_NAME
ON_CONVERT_Q ON_GRANT_Q MASTER_NODE NEXT_CVT_
------------------ ------------ ---------- ----------- --------[0x6dfd][0x0],[TM]
1
1
1 KJUSERNL
GRANT_LEV
--------KJUSERNL
KJUSERPR
7-201
ktaiam
ksqgtl
Get an enqueue
ksqcmi
ksipget
kjusuc
7-202
lock_q_kjp resp_kjl
Proc.1
Res.1
procp_kjl Lock1
KJL_OPENING
1:Allocate
P1
2:Set
3:Allocate
4:Compute
Instance 1
7-203
6:Lock CVT
9:AST
LMD0
2:Allocate
3:Send
LMD0
KJX_CONV_AST_IND
P1
Lock1
1:Allocate
7:Loop
5:Send
KJX_OPEN_CONVERT_DIR_REQ
10:Complete
Instance 1
7-204
Instance 2
Lock1
Proc.2
Res.1
Lock2
1:Allocate
P2
KJL_OPENING
KJL_CONVERTING
2:Set
3:Allocate
4:Complete
Instance 1
7-205
ktaidm
ksqrcl
Release an enqueue
ksqcmi
ksiprls
kjuscl
7-206
Lock1
Proc.2
Res.1
Lock2
KJL_CLOSING
2:Remove
P2
1:Set
3:Free
4:Complete
Instance 1
7-207
ktagetg0
ksqcnv
Convert an enqueue
ksqcmi
ksipcon
kjuscv
7-208
Res.1
Proc.1
Lock1
KJL_CONVERTING
P1
1:Set
3:Deadlock queue
2:Re-Queue
Instance 1
7-209
Res.1
Lock1
6:Granted
7:AST
LMD0
2:Send
LMD0
KJX_CONV_AST_IND
P1
1:Convert
5:Loop
4:Send
KJX_CONVERT_REQ
8:Complete
Instance 1
7-210
Instance 2
Res.1
Proc.1
Lock1
Proc.3
1:Allocate
LMD0
5:Send
KJX_CONV_AST_IND
P3
Lock3
KJL_OPENING
2:Set KJL_CONVERTING
4:Queue
Instance 1
7-211
Instance 2
Res.1
Res.1
Proc.1
Lock1
Lock1
KJL_CLOSING
Proc.3
5:Free
1:Set
Lock3
2:Change
P1
4:Release
3:Send
LMD0
KJX_CONVERT_REQ
6:Complete
Instance 1
7-212
Instance 2
Proc.1
Lock1
Proc.3
Lock3
P3
1:Convert
LMD0
4:Complete
3:AST
2:Grant
Instance 2
7-213
Code References
7-214
Summary
7-215
7-216
Enqueues/Non-PCM
Objectives
8-219
Other
nodes
Instance
Caches
ksi/ksq
GRD(GES)
CGS
I
P
C
NM
CF Control Files
CI Cross Instance
Call
DM Mount Lock
LB Library Cache
Lock
IR Instance
Recovery
CM
8-220
Enqueue Types
8-223
Enqueue Types
Refer to WebIV Note 1020008.6 for a lock decoding script. The standard supplied
CATBLOCK script creates the view DBA_LOCK and DBA_LOCK_INTERNAL. These DBA
views do not expand the RAC-only enqueues.
User mode enqueues are created and used by applications, they are a simple named
resource without relation to server data structures.
Enqueue Structure
Owners
Waiters
Converters
Lock structures:
ksqlk
(showing modes)
S -> X
SX
8-224
Enqueue Structure
When access is required by a session, a lock structure ksqlk is obtained and a request is
made to gain access to the resource at a specific level (mode). The lock structure is placed
on one of the three linked lists (called the owner, waiter, and converter lists) that hang off
of the resource.
Examining Enqueues
8-225
Examining Enqueues
In V$LOCK, the mode held (LMODE) and request (REQUEST) columns determine if the
enqueue is an owner, waiter, or converter:
Held
Request Enqueue is
Nonzero
Zero
Owner
Nonzero
Nonzero Converter
Zero
Nonzero Waiting
For V$ENQUEUE_STAT, the average time waited in milliseconds is
CUM_WAIT_TIME / TOTAL_WAIT#.
Convert
ksqcnl
Release
ksqrcl
ksq
Local
Enqueue processing
ksqcmi
ksipget
ksipcon
ksi
kjusuc
kjuscv
kju
Global
DLM
8-226
KSQ
KGL
KQR
Misc.
Clients
KQLM
KSI
KJU
8-227
Lock Modes
8-228
Value
Local
Granted (Owner)
Other Grants
GCS
NULL
CR
CW
PR
PW
EX
0
1
2
3
4
5
NULL
SS
SX
S
SSX
X
No Access
Read
Read or Write
Read
Read or Write
Read or Write
Anything
Read or Write
Read or Write
Read
Read
No Access
9
9
Lock Modes
These are the GES lock modes. The naming differences between the DLM and the kernel
lock mode names result from historical reasons.
For GCS locks, only the NULL, Share, and Exclusive locks are used.
Lock Compatibility
NL:NL
CR:SS
CW:SX
PR:S
PW:SSX
EX:X
8-229
NL:NL
Yes
Yes
Yes
Yes
Yes
Yes
CR:SS
Yes
Yes
Yes
Yes
Yes
No
CW:SX
Yes
Yes
Yes
No
No
No
PR:S
Yes
Yes
No
Yes
No
No
PW:SSX
Yes
Yes
No
No
No
No
EX:X
Yes
No
No
No
No
No
Lock Compatibility
Compatible locks can exist on the grant queue at the same time. The locks on the request
queue are incompatible with the locks on the grant queue and are incompatible with other
locks on the convert queue.
Note that although a PR or S mode is more restrictive, it is not compatible with the lesser
mode CW. This prohibits simple downgrading of the lock mode from PR to CW.
A special case exists for the PR and CW combination. A PR lock on the convert queue can
be compatible with the most restrictive mode lock on the grant queue (for example,
another PR lock) and still not be compatible with a less restrictive lock (the CW lock) on
the grant queue.
The GCS lock modes are underlined.
Deadlock Detection:
The Classic Deadlock
time
Process 1
Locks resource
R1 in mode X
Process 2
OK
Requests
resource R2 in
mode X
Waits
Waits
Locks resource
R2 in mode X
OK
Requests
resource R1 in
mode X
Waits
Deadlock
8-230
Deadlock Detection:
The Classic Deadlock
L2
P1
R1m
L3 P2
R2m
N1
Node in a cluster
R2s
L4
Node x in a cluster
Px
Process x on a node
Lx
Lock x
Rym Resource y (master)
Rys Resource y (shadow)
Nx
P2
R1s
L5
P3
P1
N2
8-231
Deadlock Detection:
A More General Example
L2
P1
P1
R1m
R1s
L1
L3 P2
P3
L8
R4m
R2m
P2
Node in a cluster
R2s
P2
R4s
R13
R3m
L5
P3
P1
P1
Node x in a cluster
Px
Process x on a node
Lx
Lock x
Rym Resource y (master)
Rys Resource y (shadow)
Nx
L7
P2
L6 P3
N2
8-232
N4
N1
L4
Distributed resource
N3
8-233
8-234
/users/t920r/admin/t920r/bdump/t920r_1_lmd0_24675.trc
Oracle9i Enterprise Edition Release 9.2.0.1.0 Production
With the Partitioning, Real Application Clusters, OL
JServer Release 9.2.0.1.0 - Production
ID1
ID2 TYPE
Deadlock Flow
DI-0-0
resource
EX
NL
LMD0
LMD0
Begin deadlock detection
L3
L2
L1
Deadlock queue
Node 2
Node 1
8-236
Deadlock Flow
When an enqueue lock enters the convert queue and if it can be deadlocked (that is, if it is
of the type TM, TX, or UL), then the lock information is also put in the deadlock queue.
At this time, you compute a time to deadlock detection, time_to_dd (expressed in
seconds for this lock), as the number of active nodes / 2 + _lm_dd_interval as a
timestamp, which is now + time_to_dd.
LMD0 checks the deadlock queue every five seconds and starts a deadlock search if the
deadlock queue is not empty and if the lock at the head of the deadlock queue is in the
queue for more than time_to_dd. Otherwise, LMD0 moves the lock in the head of the
deadlock queue to the tail and returns to normal activity.
If a deadlock detection starts on node 1, then LMD0 converts its lock on DI,0,0 from
NULL to EXCLUSIVE; in the whole cluster, only one node is allowed to start DD.
Deadlock Flow
DI-0-0
resource
EX
NL
LMD0
L3
LMD0
L2
L1
Deadlock queue
Node 1
8-237
Node 2
Copyright 2003, Oracle. All rights reserved.
EX
NL
LMD0
LMD0
Deadlock graph
R1
X11
X12
X13
R132
X1
Node 1
8-238
Node 2
Copyright 2003, Oracle. All rights reserved.
EX
NL
LMD0
LMD0
Deadlock graph
R1
Deadlock graph
KJX_DEADLOCK_IND
R1
message
X11
8-239
X12
X13
X11
X12
X13
R132
R132
X1
X1
8-240
8-241
8-244
Code References
8-245
Summary
8-246
Blocks/PCM Locks
Objectives
9-249
Other
nodes
Instance
Caches
kcb/kcl
GRD(GCS)
CGS
I
P
C
NM
CM
9-250
Holder
9-251
Requestor
9-252
9-253
Oracle8i CR Server
The holder of a data block, on receiving a consistent read (CR) request, uses the undo
data (the blocks of which were locally resident in the cache) to construct the block.
Light Work Rule and Fairness Counter
If creating the consistent read version block involves too much work (such as reading
blocks from disk), then the holder sends the block to the requestor, and the requestor
completes the CR fabrication. The holder maintains a fairness counter of CR requests.
After the fairness threshold is reached, the holder downgrades it to lock mode.
Requesting instance:
Foreground process prepares the buffer.
Sends the message to the master and waits
Gets CR buffer or a lock to read from disk
Master:
Checks the lock mode
Forwards the request to the holder if X mode held
Grants shared lock to the requestor on other modes
Holder:
Sends CR buffer
9-254
9-255
9-256
9-257
Lock Modes
9-258
X
S
N
+
+
+
+
+
Lock Modes
A lock mode describes the access rights to the resource.
The compatibility matrix is clusterwide. For example, if a resource has an S lock on one
instance, then there cannot be an X lock for that resource anywhere else in the cluster.
Lock Roles
9-259
Lock Roles
A lock role describes how the resource is to be handled. The treatment differs if the
block resides in only one cache.
Past Image
Is an indication
0: It is absent.
1: It is present.
9-260
9-261
9-262
Block Classes
9-263
Block Classes
Class
Description
1
DATA
2
SORT. These are never protected by PCM locks, because they are private to
one instance.
3
SAVE UNDO BLOCK, used for TBS management
4
SEGMENT HEADER
5
SAVE UNDO SEGMENT HEADER, used for TBS management
6
FREE-LIST
7
EXTENT MAP, used for unlimited extents
8
BITMAP BLOCK for locally managed tablespaces
9
BITMAP INDEX BLOCK for locally managed tablespaces
>=11 If odd, it is an UNDO HEADER, and the block type is (RBS_number*2) +
11, used for the transaction table.
If even, it is an UNDO BLOCK, and the block type is (RBS_number*2) + 12,
used for undo blocks.
9-264
Lock Elements
The lock elements (LE) are also known as BL type enqueues.
Allocation of New LE
9-265
Allocation of a New LE
The block that is to be covered by the LE has an absolute file ID (AFN) and a block
number (BNO).
Note: Cache fusion applies only to blocks other than UNDO.
The default value of _kcl_undo_grouping is 32.
The default value of _kcl_undo_locks is 128. This represents the number of locks
per UNDO segment.
Hash Chain of LE
Every active releasable LE is in one hash chain.
Hash
Chain
head
of LE
9-266
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
Hash Chain of LE
The number of hash chain heads or buckets (NBH) is the nearest prime lower than
_db_block_buffers.
The hash algorithm for LE is ID1 modulus NBH.
Block to LE Mapping
BEGIN
LE
with same
Id1, Id2
in chain
End
no
Take LE
from freeyes
list and
initialize
with id1,
id2
Some
LE in
free-list
no
Post LMS to
free some LE
9-267
Wait 20 ms
on "Global
cache
freelist
wait"
Link LE
into the
hash chain
End
Block to LE Mapping
When LEs need to be freed, you must post the LMS that is associated with the <id1,
id2> LE. The statistics global cache freelist waits is incremented.
Latch
9-268
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
LE
Down-convert queue
LE with BAST
Lazy-close queue
when WRITE is done
Deferred-ping queue
when timeout
LE
Long-flush queue
wait for log flush
LMSn Free of LE
BEGIN
Choose a LE
from associated
lazy close
queue
Get Latch of
queue
no
Buffer
linked
to LE
yes
Get buffer's
HashList
Compute rdba,
tsn from LE;
get HashList of
(rdba,tsn)
Got
Hash-Latch
in Shared,
NoWait
yes
no
Free queue
latch
9-269
go through
code path
of BAST
management
Get Latch in
Shared Wait
End
Initial state:
D
Master
A
9-270
1008
+-> (2)
1:LReq (S,C)
2:Grant SL0
1008
SL0
3:Read
B
9-272
4:Notify
1008
2:Ping(S,B)
1008
SL0
4:Assume(SL0)
3:Send(SL,SL0)
1:LReq (S,B)
1008
SL0
9-273
1008
2:Ping(X,B)
1008
SL0CR
4:Assume(SL0)
3:Send(X, Close)
1:LReq (X,B)
1009
XL0
9-274
1008
1013
4:Assume(XG0,
NG1,1009)
XG0
CR
3:Send(XG,NG1)
2:Ping(X,A)
1009
XL0NG1
9-275
1008
1008
2:Ping(S,C)
4:Assume
(SG0,SG1,1013)
D
Master
1:LReq(S,C)
1013
XG0SG1
3:Send(SG,SG1)
1009
NG1
9-276
1013
SG0
1008
2:Ping(S,B)
1013
SG1
3:Send
(SG,SG1)
SG0
4:Assume
(SG1,SG1,1013)
1:LReq (S,B)
1013
NG1SG1
9-277
1013
1008
5:W Notify
2:ReqW
1013
6:Flush PI
XG0XL0
1:Req W( )
9-278
4:Notify
1009
NG1
3:Write
1013
2:Ping(S,B)
D
Master
4:Assume(SL0,SL0)
1:Req(S,B)
1013
XL0SL0
9-279
1013
C
3:Send(SL,SL)
SL0
1013
2:Ping(S,B)
1013
SL0
1013
1:LReq(S,B)
SL0
3:Send(SL,SL)
4:Assume(SL0,SL0)
1013
SL0
9-280
1013
2:Ping(S,B)
D
Master
4:Assume(SG0,SG1,1015)
1:LReq(S,B)
1015
XL0 SG1
9-281
1015
C
3:Send(SG,SG)
SG0
1013
1.1:CRreq
1025 3:Create CR
XL0
4:Send CR image
1.2:NoCRavailable
1013
9-282
2:Make CR
1022
Views
9-283
Views
X$BH: see WebIV note 33568.1
Views (continued)
V$LOCK_ELEMENT
lock_element_addr: raw address for the lock element covering a buffer
indx:
lock element number
class:
block class (1 = data/index, 2 = sort, etc.)
lock_element_name:
flags:
status of the lock element (1 = fusion lock, 2 = no buffer on
LE, 4 = has deferred ping, 8 = LE waiting for log flush,
16=LE is being evicted, 32 = LE has been deactivated, 64 =
LE is fixed)
mode_held:
lock mode held (0 = null, 3 = S, 5 = X)
block_count:
number of blocks covered by the PCM lock
releasing:
Release flags. Non-zero if PCM lock is being downgraded.
acquiring:
Acquiring flags. Non-zero if PCM lock is being upgraded.
invalid:
Non-zero if PCM lock is invalid, always 0 in
V$LOCK_ELEMENT
Release Flags
Value Description
KCLLEBP
01
Process has sent a request to DLM
KCLLEAP
02
Acquisition Pending, the lock
operation has been started.
KCLLERECON
04
CR request aborted since because
reconfig
KCLLEINVAL
08
CR request could not started
because RECONFIG.
KCLLECOMM
10
CR request failed because time out.
KCLLENRN
20
No recovery needed.
KCLLESUSP
40
PI is suspect.
KCLLEHIGH
80
Our PI is the highest (can be made
current).
Acquire Flags
Value Description
KCLLEBA
01
BAST has been delivered.
KCLLESHR
02
Downgrad to SHARE mode
KCLLECLS
04
About to be closed
KCLLESCP
08
Scan completed
KCLLERP
10
Release processing, enables down
convert
KCLLEDCL
20
On Down Convert list
KCLLEDCS
40
Down-convert has been started.
KCLLEREAL
80
Real BAST has arrived during fake
bast.
KCLLEDFR
100
BAST has been deferred once.
More detail in kcl0.h
DSI408: Real Application Clusters Internals I-284
Views (continued)
V$BH
file#:
block#:
class#:
status:
datafile number
block number
class of the block
status of the block (free=not in use, xcur=exclusive, scur=shared
current, cr=consistent read, read=reading from disk; mrec=mr mode,
irec=ir mode)
xnc:
# of PCM lock conversions
lock_element_addr: raw lock element address
lock_element_name:
lock_element_class:
dirty:
(Y) block modified
temp:
(Y) temporary block
ping:
(Y) block pinged
stale:
(Y) block is stale
direct:
(Y) direct block
new:
(Y) new block
objd:
object number
ts#:
tablespace number
Column state of X$BH can contain following value :
0 or FREE
1 or EXLCUR
2 or SHRCUR
3 or CR
4 or READING
5 or MRECOVERY
6 or IRECOVERY
7 or WRITING
8 or PI
Parameters
_LM_LMS
Default value min(#CPU/4, 10)
0 if cluster_database is false
GC_FILES_TO_LOCKS
Same value as Oracle8i, but setting this disables
Cache Fusion for the specified files
9-286
Summary
9-287
Cache Fusion 1
CR Server
Objectives
10-289
Other
nodes
Instance
Caches
kcb/kcl
GRD(GCS)
CGS
I
P
C
NM
CM
10-290
10-291
Getting a CR Buffer
ktrget:
Initializes a buffer cache CR scan request
Calls kcbgtcr for the best resident buffer to start
from to build the CR buffer
Calls ktrgcm to build the CR buffer by applying
undo
Returns CR buffer to the requestor
kcbgtcr:
If successful, returns the best candidate
(performed by ktrexf or examination function)
Scans the hash bucket for the DBA for buffers that
may be used to build a CR buffer
If not successful, calls kcbget
10-292
Getting a CR Buffer
Any and all queries start with getting a CR buffer version of the block.
Getting a CR Buffer
kcbget:
Retries the scan just tried by kcbgtcr
If you find a buffer, you return it now.
If not, then if it is being READ in or there is a current
mode buffer, you wait until it is available and then
rescan the buffer.
If these fail, you cannot use any locally cached
buffers.
10-293
Owner instance
Requesting instance
UNDO
Current
CR
10-294
CR
Requestor
LMS
2. No conflict
mode:
grant LOCK
3. AST for
conversion
Interconnect
message
Holder
9. Send CR
buffer.
LMS
8
4. Read since
LOCK is granted.
FG
4. Build CR
block and stop
when completed
or IO required.
5. Ask LGWR
to flush REDO.
6,7
LGWR
Database
10-295
Log
CR Requests
10-296
CR Requests
10-297
10-298
Fairness
10-299
Statistics
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
10-300
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
gets
get time
converts
convert time
cr blocks received
cr block receive time
current blocks received
current block receive time
cr blocks served
cr block build time
cr block flush time
cr block send time
current blocks served
current block pin time
current block flush time
current block send time
freelist waits
defers
convert timeouts
blocks lost
claim blocks lost
blocks corrupt
prepare failures
skip prepare failures
Wait Events
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
global
10-301
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
cache
open s
open x
null to s
null to x
s to x
cr request
cr disk request
busy
freelist wait
bg acks
pending ast
retry prepare
cancel wait
cr cancel wait
pred cancel wait
domain validation
assume wait
recovery free wait
recovery quiesce wait
claim wait
10-302
CR Requestor-Side Algorithm
kcbgtcr
ktrget
BEGIN
BEGIN
call kcbgtcr
to get a best
buffer
"consistent get"++
compute and follow hash bucket;
for each buffer:
call ktrexf to find best buffer
call ktrgcm to
apply UNDO (if
any) to
produce a good
CR buffer
END
best
buffer found
in local
cache?
yes
no
call kcbzib
to get the
buffer
END
10-303
CR Requestor-Side Algorithm
The following statistics are incremented by ktrgcm:
cleanouts and rollbacks - consistent read is incremented if UNDO is applied to
BUFFER and CLEANOUT is performed.
rollbacks only - consistent read gets is incremented if UNDO is applied to
BUFFER and no CLEANOUT is performed.
cleanouts only - consistent read gets is incremented if no UNDO is applied and
CLEANOUT is performed.
no work - consistent read gets is incremented if no UNDO is applied and no
CLEANOUT is performed.
When UNDO is applied to produce a CR BUFFER, other UNDO blocks should be read.
When CLEANOUT is performed, the TX transaction table must be read.
CR Requestor-Side Algorithm
kcbzib
for CR
request
BEGIN
no
bit KCBBHFCR
is set or
LE mode>=requested
mode?
yes
db
mounted
shared?
no
yes
Call kclgclk
asking for convert
LE in SHARED mode
with KCLCVCR option
10-304
END
CR Requestor-Side Algorithm
kclgclk
BEGIN
kclcls
BEGIN
Find or locate LE
LE in
transition?
no
Some
LE to convert
yes
or to open
wait 1 sec on
?
no
"global cache busy"
yes
DLM
requested
Call kclscrs to
END
END
mode > LE held
start CR
no
mode?
Call kclwcrs
to wait for CR
yes
to complete
Allocate a lock
Set bit 0X1 of
ctx and link buffer
LE->acquiring
END
to LE
10-305
CR Requestor-Side Algorithm
kclscrs
BEGIN
no
some LE
left?
END
yes
Take LE and setup a CR request
Call kjbcropen
Set bit 2 of
LE->acquiring
yes
LE lock
not opened
yet?
no
LE
lockmode
NULL?
no
Call kjbpredread
Set bit 2 of LE->acquiring
10-306
CR Requestor-Side Algorithm
kclwcrs
BEGIN
req not
examined?
Next CR request
yes
req
completed?
no
Increment "global cache
current blocks received" and
"global cache current block
receive time"
Set CR request status
"completed"
Increment "global cache CR
blocks received" and "global
cache CR block receive time"
Set CR request status
"completed"
yes
Set request status
"completed"
AST has fired; lock granted S
10-307
no
yes
req type
"predread"
and buffer
received
?
END
req not
completed?
no
yes
no
Wait 1 sec on
"global cache
CR request"
Get
message
yes
req type
"open" or
"convert" and buffer
received?
bit 2
LE->acq
no
cleared?
yes
no
kclwcrs
The description of kclwcrs is simple, and the code path for error management is not
displayed.
1:
Locate
LE.
6: Unset bit
0X2 of LE with
AST callback
provided by FG.
LMS
7:
Post FG.
FG
4: Wait on
"global cache
CR request".
Master node
LMS
5: Notify
that LOCK
is granted.
Master node
acquiring
LE
2: Set
bit 0X2
of acq.
1:
Locate
LE.
LMS
5: Build
CR buffer.
FG
LMS
3: CR submit along
with lock request with
(ip,port) information.
4: Wait on
"global cache
CR request".
6: Deliver CR buffer
with (ip,port)
information.
10-309
CR Server-Side Algorithm
Call kcbgtcr to get block with
kclexf as examination function to
retain only CURRENT block
BEGIN
request
for CURRENT
block?
no
yes
error
from kcbgtcr
or ktrget
?
yes
REQCUR++
REQ{DATA|UNDO|TX}++
REQCR++
REQDATA++
no
error
KCBOERLWRx
?
Call ktrget to
fabricate CR buffer
yes
ERROR++
LIGHTx++
buffer
state is
CR?
yes
10-310
no
Send ERROR
to requestor
RESFAIL++
FLUSH LOG
SEND BACK BUFFER
FAIRNESS MANAGEMENT
END
CR Server-Side Algorithm
X$KCLCRST.LIGHTn is incremented if the light work rule fires while the CR block is
building, because of the following reasons:
A buffer is found with the same AFN and BLOCKNUM but the object-id in the
buffer is different from the object-id that is submitted by the requestor (object was
DROPPED or TRUNCATED after consistent read <is> started and before the end).
A wait for WRITE COMPLETE
A wait because the buffer is in READING state
Buffer is suspended and a free buffer is needed
A wait for free buffer wait
A read block from disk to buffer-cache
A wait for space for redo
A wait for ITL
X$KCLCRST.LIGHT1 is incremented if a block is found with bit modification
started set; in this case the process sleeps some seconds, and when it wakes up, the
same process is still modifying the block.
X$KCLCRST.LIGHT2 is incremented if a buffer is in instance RECOVERY state.
This description of kclgcr is simplified.
DSI408: Real Application Clusters Internals I-310
CR Server-Side Algorithm
kclgcr
REDO
ondisk?
BEGIN
yes
no
X$KCLCRST.FLUSH++
no
room
in logflush
queue?
yes
Add new element
in logflush queue
X$KCLCRST.FLUSHQ++
X$KCLCRST.FLUSHF++
call kcrfisd and wait on
"log file sync" but only once
10-311
END
kclgcr
FLUSH LOG
Note: There are no more than 255* processes elements in the logflush queue.
CR Server-Side Algorithm
BEGIN
Increment
LE.FAIRNESS_COUNTER
Queued in
LOG FLUSH
phase
LOGFLUSH
queued?
yes
END 1
no
Send CR buffer
to requestor
no
LE heldmode is
EXCLUSIVE and
LE.FAIRNESS_COUNTER >=
fairness_threshold?
yes
Update statistics
requested
block is
UNDO or UNDO
header?
X$KCLCRST.FAIRDC++
downgrade LE
to SHARE mode
END 2
yes
END 3
no
10-312
kclgcr (continued)
Send back buffer fairness management.
At END 1 the buffer is not sent; this is done in LOGFLUSH queue processing.
The following statistics are updated after the CR buffer is sent to the requestor:
global cache cr block build time with time spent in ktrget or kcbgtcr
global cache cr block log flush time with time spent in LOG FLUSH phase
global cache cr block send time with time spent in CR block sending
Note: LE.FAIRNESS_COUNTER is reset at each buffer modification.
CR Server-Side Algorithm
BEGIN
kclqchk
Next element
dequeue
element
yes
element
on LOGFLUSH
queue?
caller
asks for
wait?
yes
no
END
no
call kcrfisd
to check if REDO
is on disk
REDO
on disk?
no
yes
send CR buffer to requestor
10-313
kclqchk
LOGFLUSH queue processing.
After the CR buffer is sent to the requestor, the following statistics are updated:
global cache cr block build time with time spent in ktrget or kcbgtcr
global cache cr block log flush time with time spent in LOG FLUSH phase
global cache cr block send time with time spent in CR block sending
Summary
10-314
Cache Fusion 2
Objectives
11-317
Other
nodes
Instance
Caches
kcb/kcl
GRD(GCS)
CGS
I
P
C
NM
CM
11-318
LE_ADDR
X$LE
PCM DLM resource in
kjbr structure
X$KJBR
11-319
KJBLLOCKP-0x60
KJBLRESP
KJBRRESP
Three instances
One block is
SELECT on I3
selected and
3
2
updated
UPDATE on I2
SELECT on I2 Instance 2 is the
4
master of the
UPDATE on I1
block resource
Start
Write on I1
7
SELECT on I3
SELECT on I3
11-320
Initial State
X$LE
no rows
X$BH
no rows
X$KJBR
no rows
X$KJBL
no rows
selected
selected
selected
selected
Instance 1
Instance 2
(Master)
Instance 3
11-322
Initial State
Initially, nothing has been read into cache or locked, so the queries do not return any
rows.
In displaying the X$KJBR.KJBRNAME in subsequent slides, the column has been
truncated to fit. It has the same value as the X$KJBL.KJBLNAME for these examples.
Step 1:
Instance 3 Performs SELECT
Instance 2
Instance 1
(Master)
2 Grant(SL0)
1 CRREQ(S)
3 Read
Instance 3
4 Notify
11-323
Before
no rows selected
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
After
X$BH
no rows selected
X$LE
no rows selected
X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FE343C KJUSERPR KJUSERNL
0 [0x200000a]
1 22884D40 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSERPR KJUSERNL GRANTED
22FE343C
11-324
Before
no rows selected
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
X$BH
STATE
MODE_HELD LE_ADDR CLASS
DBARFIL
DBABLK
CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------2
0 24FF9030
1
8
10
0
0
X$LE
NAME LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
3
0
1
X$KJBR
no rows selected
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST
KJBLRESP
------------------------- ---------- --------- --------- ------------ -------[0x200000a][0x0],[BL]
0 KJUSERPR KJUSERNL GRANTED
00
11-325
After
Step 2:
Instance 2 Performs SELECT
1 CRREQ(S)
Instance 1
2 Grant(SL0)
Instance 2
(Master)
3 Read
4 Notify
Instance 3
11-326
X$BH
no rows selected
X$LE
no rows selected
X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FE343C KJUSERPR KJUSERNL
0 [0x200000a]
1 22884D40 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSERPR KJUSERNL GRANTED
22FE343C
X$BH
STATE
MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------2
0 253F3A10
1
8
10
0
0
After
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
3
0
1
X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FE343C KJUSERPR KJUSERNL
0 [0x200000a]
1 22884D40 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
0 KJUSERPR
[0x200000a][0x0],[BL]
0 KJUSERPR
11-327
KJBLREQUE
--------KJUSERNL
KJUSERNL
KJBLLOCKST
----------GRANTED
GRANTED
KJBLRESP
-------22FE343C
22FE343C
Step 3:
Instance 2 Performs UPDATE
5 ASSUME(XL0,close)
to master
1 LREQ(X)
to master
Instance 2
Instance 1
(Master)
4 Send
Buffer to requestor
2 PING(X,Node2)
Instance 3
3 Make
Buffer CR
11-328
X$BH
no rows selected
X$LE
no rows selected
X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSERPR KJUSERNL
0 [0x200000a]
1 22882980 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSERPR KJUSERNL GRANTED
22FD8B24
X$BH
STATE
MODE_HELD
LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253ECED0
1
8
10
0
0
After
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
1
X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL KJBRROLE
KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL 0
[0x200000a]
1 253ECF30 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSEREX KJUSERNL GRANTED
22FD8B24
11-329
X$BH
Before
X$LE
NAME LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
--------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
3
0
1
X$KJBR
no rows selected
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSERPR KJUSERNL GRANTED
00
X$BH
STATE MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------3
0 00
1
8
10
1423681
0
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
11-330
After
Step 4:
Instance 1 Performs UPDATE
2 PING(X)
3 Set lock
to NG1
1 LREQ(X)
Instance 2
Instance 1
(Master)
5 Send block
4 Buffer
X CURRENT
to PI
6 ASSUME(XG0, NG1)
Instance 3
11-331
X$BH
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
1
X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL
0 [0x200000a]
1 253ECF30 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSEREX KJUSERNL GRANTED
22FD8B24
X$BH
STATE MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------8
0 253ECED0
1
8
10
1423699
0
After
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
0
0
0
X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL
8 [0x200000a]
1 253ECF30 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
24 KJUSERNL
[0x200000a][0x0],[BL]
8 KJUSEREX
11-332
KJBLREQUE
--------KJUSERNL
KJUSERNL
KJBLLOCKST
----------GRANTED
GRANTED
KJBLRESP
-------22FD8B24
22FD8B24
X$BH
Before
no rows selected
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
X$BH
STATE MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253F8A80
1
8
10
0
0
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
0
X$KJBR
no rows selected
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
8 KJUSEREX KJUSERNL GRANTED
00
11-333
After
Step 5:
Instance 3 Performs SELECT
2 Build CR buffer
Instance 2
Instance 1
(Master)
3 Send CR buffer
1 CRREQ(S)
Instance 3
11-334
Before
X$BH
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
X$BH
STATE
MODE_HELD LE_ADDR CLASS
DBARFIL
DBABLK
CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------3
0
00
1
8
10
1423681
0
3
0
00
1
8
10
1423821
0
After
X$LE
no rows selected
X$KJBR
no rows selected
X$KJBL
no rows selected
11-335
Step 6:
Instance 1 Performs WRITE
1 REQW
7 Make
PI buffer
to CR
2 REQW
6 Set role
Local to LE &
DLM lock
5 WNOTIFY
Instance 1
3 WRITE
Instance 2
(Master)
4 NOTIFY
Instance 3
11-336
X$BH
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
0
0
0
X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL
8 [0x200000a]
1 253ECF30 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
24 KJUSERNL
[0x200000a][0x0],[BL]
8 KJUSEREX
KJBLREQUE
--------KJUSERNL
KJUSERNL
KJBLLOCKST
----------GRANTED
GRANTED
KJBLRESP
-------22FD8B24
22FD8B24
X$BH
STATE MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------3
0 00
1
8
10
1423699
0
After
X$LE
no rows selected
X$KJBR
KJBRRESP KJBRGRANT KJBRNCVL
KJBRROLE KJBRNAME
KJBRMASTER KJBRGRAN KJBRCVTQ KJBRWRIT
-------- --------- --------- ---------- --------------- ---------- -------- -------- -------22FD8B24 KJUSEREX KJUSERNL
0 [0x200000a]
1 253ECF30 00
00
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT
------------------------- ---------- --------[0x200000a][0x0],[BL]
0 KJUSERNL
[0x200000a][0x0],[BL]
0 KJUSEREX
11-337
KJBLREQUE
--------KJUSERNL
KJUSERNL
KJBLLOCKST
----------GRANTED
GRANTED
KJBLRESP
-------22FD8B24
22FD8B24
X$BH
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
0
X$KJBR
no rows selected
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
8 KJUSEREX KJUSERNL GRANTED
00
X$BH
STATE MODE_HELD LE_ADDR
CLASS
DBARFIL
DBABLK CR_SCN_BAS CR_SCN_WRP
---------- ---------- -------- ---------- ---------- ---------- ---------- ---------1
0 253F8A80
1
8
10
0
0
X$LE
NAME
LE_CLASS
LE_RLS
LE_ACQ
LE_MODE
LE_WRITE
LE_LOCAL
---------- ---------- ---------- ---------- ---------- ---------- ---------33554442
0
0
0
5
0
0
X$KJBR
no rows selected
X$KJBL
KJBLNAME
KJBLROLE KJBLGRANT KJBLREQUE KJBLLOCKST KJBLRESP
------------------------- ---------- --------- --------- ----------- -------[0x200000a][0x0],[BL]
0 KJUSEREX KJUSERNL GRANTED
00
11-338
After
X$KJBL
Every PCM lock, local or remote
If remote, then associated resource is mastered by
this instance.
Callback routine in kjblftc
X$KJBR
PCM resources mastered by local instance
11-339
KJBRESP
KJBLNAME
KJBLNAME2
KJBLQUEUE
KJBLLOCKST
KJBLWRITING
KJBLREQWRIT
KJBLOWNER
KJBLMASTER
KJBLBLOCKED
KJBLBLOCKER
Type
RAW(4)
VARCHAR2(9)
VARCHAR2(9)
NUMBER
Notes
PCM lock address
lock grant mode
lock request mode if lock CONVERTING state
0x81 if G1, 0x8 if G0, 0x0 if local
0x00 grant NULL; 0x01 grant S; 0x02 grant X;
0x04 lock has been opened at master;
0x08 global role, otherwise local;
0x10 has one or more PI;
0x20 request CR; 0x40 request S;
0x80 request X
RAW(4)
masterized on local instance: resource address
masterized by other instances: 0
VARCHAR2(30) resource name: [id1(hex)][id2(hex)],[BL]
VARCHAR2(30) resource name: id1(decimal),id2(decimal),BL
NUMBER
0 if on grant-queue, 8 if on convert-queue
VARCHAR2(64) lockstate, GRANTED, OPENING, CONVERTING
NUMBER
4 if asking for write
NUMBER
2 if requesting write
NUMBER
owner instance of this lock
NUMBER
master instance of the resource
NUMBER
different from 0 if CONVERTING
NUMBER
if there is a lock L1 at head of convert-queue
and the grant-mode of this lock is conflicting
with L1 request-mode. 0 if the associated
resource is not masterized by this instance
X$KJBR
Column
KJBRRESP
KJBRGRANT
KJBRNCVL
Type
RAW(4)
VARCHAR2(9)
VARCHAR2(9)
KJBRROLE
NUMBER
KJBRNAME
KJBRMASTER
KJBRGRANTQ
KJBRCVTQ
KJBRWRITER
VARCHAR2(30)
NUMBER
RAW(4)
RAW(4)
RAW(4)
Notes
PCM resource address
Resource held mode
Request mode of lock at head of convert-queue
(KJUSERNL if non existent)
mode and role combined bitwise
0x00 if NULL; 0x01 if S; 0x02 if X;
0X08 if G0 (global role, no PI);
0x18 if G1 (global role, one or more PIs)
resource name, format [id1][id2],[BL]
master instance (always local instance)
lock address at head of grant-queue
lock address at head of convert queue
lock address elected for WRITE
Summary
11-341
Objectives
12-343
12-344
12-345
12-346
L0, G0, G1
L0, G0, G1
G1, G2
12-347
SMON Process
12-348
12-349
12-350
BWR Dump
12-351
BWR Dump
The dump in the slide is from a redo log file dump done with:
SQL> ALTER SYSTEM DUMP LOGFILE 'filename';
Recovery Set
12-352
Recovery Set
The first read of a blocks change vector in the redo stream sets the first-dirty and lastdirty SCN values in the recovery set. Subsequent reads from the redo stream that occur on
the same block update the last-dirty SCN value in the recovery set.
12-353
12-354
Lock open on
recovering
instance
Locks open
on other
instances
Lock granted
on recovery
buffer
No lock or NL0
(X, S)
Local 0
Dont Care
No lock
Recovery
buffer content
Recovery
action
No recovery
buffer needed
No recovery;
remove entry
from recovery set
(X, S)
Global (0, 1)
Dont Care
Share (X, S)
Global lock;
increment PI
count in lock
state, use zero
SCN tag
Initiate write of
current block
(See note 1)
No recovery;
release recovery
buffer, decrement
PI count when
block write
completes
a) An (X, S)
Global
Share NG lock,
increment PI
count
b) All (N)
Globals
XG1
Get contents
from highest PI,
based on SCN
tags. If NG2, toss
the higher PI
(See note 2)
Apply redo
changes, write out
recovery buffer
when complete
12-355
Recovery lock
Recovery
buffer contents
Recovery
process action
No locks open or
all NL0
XL0
Apply redo
changes, write out
recovery buffer
when complete
(X, S) Local0
No lock
No recovery
buffer needed
No recovery;
remove block
entry from
recovery set
Initiate write of
current block;
recovery buffer
used for write
notification only
(no content)
No recovery; write
completion will
release recovery
buffer and lock as
usual
XG0
Apply redo
changes, write out
recovery buffer
when complete
12-356
12-357
12-358
12-359
12-360
12-361
12-363
12-364
IR of Nonfusion Blocks
12-365
IR of Nonfusion Blocks
If there are no surviving locks, the block must be read from disk and compared with the
last-dirty version for the block entry to determine if recovery is necessary.
During IR lock acquisition, an X lock is acquired on the block and it is read from disk. If
the on-disk version is more recent than the last-dirty version, then the block is removed
from the recovery set.
IR of Nonfusion Blocks
12-366
Lock
Recovering Process Action
Granted
No Lock
12-367
Memory Contingencies
12-369
Memory Contingencies
The recovery set (hash table and block entries) is stored in the PGA of the recovering
process. There must be enough virtual memory to construct the recovery set in PGA to
complete the first pass.
There must be at least one buffer per thread being recovered in the buffer cache for the
first- and second-pass log reads.
LEs correspond to recovery buffers. If a recovery block is not in the cache, then there is no
lock storage associated with it.
Code References
12-370
Code References
A more detailed list that indicates calling depth:
ktm.c
kcv.c
kct.c
kcra.c
kcrp.c
kcb.c
kcl.c
3.
1.
1.
1.
1.
Summary
12-372
SQL
SQL Layer
Layer
SQL
SQL Layer
Layer
Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS
Section
III
II
II
P
P
P
P
Platforms
C
C
C
C
Node
Node Monitor
Monitor
Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS
Node
Node Monitor
Monitor
Cluster
Cluster Manager
Manager
Linux Platform
Objectives
13-377
Hardware
Intel-based hardware
Externally shared SCSI or Fiber Channel disks
Interconnected via NIC
Software
OS versions supported:
RedHat 7.1 (9.0.1 and 9.2)
Suse 7.2 and Suse SLES7 (9.0.1 and 9.2)
13-378
13-379
raw Command
Usage: raw /dev/raw<N> /dev/<blockdev>
On Redhat, it is /dev/rawctl - raw io control device (it is in /usr/sbin/raw).
On Suse, it is /dev/raw - raw io control device (it is in /usr/local/bin/raw).
In the slide example sda3 means third partition of SCSI disk 1.
Note: You can store the commands at /etc/rc.d/boot.local. The commands
are executed immediately after booting. Or, store the commands in a file and execute
that file from boot.local.
For example, rawsetup is a file with all the commands for configuring the raw
devices and /etc/rc.d/boot.local contains the line:
. /etc/init.d/rawsetup
After creating raw partitions, you must give correct permissions on /dev/raw*.
Extended Storage
13-380
Extended Storage
LVM
The LVM hides the details about where data is stored: on what hardware as well as
where on that hardware. The management of volume groups and logical volumes can be
done while they are being used by the system. For example, you can increase the size of
a logical volume while it is being mounted; you do not have to unmount.
Cluster File Systems
Linux does not have its own cluster file system. Various third-party suppliers (like
Polyserve) supply a CFS. Oracle supplies its own CFS. This is the only supported option.
13-381
OCMS
13-382
OCMS Components
13-383
OCMS Components
Version Note
The Linux OCMS is ported from the Windows NT/2000 version.
Oracle version 9.0.x and 8.1.x architecture used an Oracle-written watchdog daemon to
monitor for system hangs, running as a process in user-space.
Oracle9i releases 9.2.0.1 and earlier use the Linux supplied softdog module to reset the
node in case of hangs.
Oracle9i release 9.2.0.2 uses a new Oracle-written, loadable kernel module, hangchecktimer, that runs in kernel space. The NM and CM functionality is combined into the
oracm background process (no more nm.log).
The older watchdog (Oracle9i release 1and earlier) could be starved for CPU by heavy
load and high kernel activity, causing many unnecessary node resets (false evictions).
Watchdog
service
Kernel mode
13-384
Instance-level
cluster information
Cluster Manager
Node-level
cluster information
Node Monitor
Watchdog
service
Watchdog daemon
User mode
Watchdog service
Watchdog timer
Watchdog Daemon
13-385
Watchdog Daemon
The important kernel configuration parameter for the watchdog daemon is
config_watchdog_nowayout.
After you create /dev/watchdog by using mknod, you get a watchdog daemon.
That is, subsequently opening the file and then failing to write to it for longer than one
minute results in rebooting the machine.
The watchdog can stop the timer if the process managing it closes the
/dev/watchdog file, provided that the parameter
config_watchdog_nowayout is set to N. The watchdog cannot be stopped after it
has been started if config_watchdog_nowayout is set to Y. On Redhat, it is N by
default, and on SuSe it is Y by default.
User mode
Kernel mode
Hangcheck-timer
13-386
Hangcheck Module
$ cd $ORACLE_HOME/oracm/admin
$ grep KernalModuleName cmcfg.ora
KernalModuleName=hangcheck-timer
13-387
Hangcheck Module
The hangcheck module is implemented from version 9.2.0.2 and later.
This module is not required for the CM operation, but its use is highly recommended.
This module monitors the Linux kernel for long operating system hangs that could
affect the reliability of a RAC node and cause corruption of a RAC database. When such
a hang occurs, this module sends a signal to reset the node.
Node resets are triggered from within the Linux kernel, making them much less affected
by the system load.
The CM on a RAC node can be easily stopped and reconfigured, because its operation is
completely independent of the kernel module.
The features that are provided by the hangcheck-timer module closely resemble the
features found in the implementation of the CM for RAC on the Windows platform, on
which the CM on Linux was based.
13-388
Cluster Manager
13-389
rdbms/src/generic/osds/skgxpu.c
rdbms/src/generic/osds/sskgxpu.c
libcmdll.so - rdbms/src/port/cm/dll/
Has one-to-one mapping for skgxn functionality
13-390
Cluster Manager
13-391
13-392
13-393
*:*
*:*
13-396
Starting CM
13-397
Starting WDD
Starting WDD:
watchdogd -g dba
13-398
Starting WDD
WDD is used only in Oracle9i before release 9.2.0.2.
Options to the watchdog command are:
-l: If 0, then no resources are registered for monitoring. This can be used while
debugging system configuration problems.
-t <number>: default 1000 ms (range: 0 ms to 3000 ms). This is the time
interval at which the WDD checks the heartbeat messages from its clients.
The default log file is $ORACLE_HOME/oracm/log/wdd.log.
Starting NM
13-399
Start Options in NM
nmcgf.ora parameters:
Starting CM
13-400
Debugging
13-401
Debugging
sskgxp_dmpsspt - port: dumps port structure.
sskgxp_dmpsspid
Summary
13-402
References
www.sistina.com/lvm
linux.oracle.com
13-403
HP-UX Platform
Objectives
14-405
14-406
HP-UX Architecture
For more information on HP-UX hardware variations, refer to
http://docs.hp.com/hpux/onlinedocs/B6257-90031/B625790031_top.html.
14-407
14-408
14-409
SKGXP: Lowfat
14-410
SKGXP: Lowfat
The HP Cluster Interconnect (CLIC) protocol is proprietary and is part of the HyperFabric
cluster system.
14-411
Cluster commands:
Cmhaltcl: Stop the cluster.
Cmrunnode: Join the node with the cluster.
Cmhaltnode: Remove the node from the cluster.
Cmviewcl: View the status of the cluster.
Cmruncl: Bring up the cluster.
14-412
Debugging on HP-UX
14-413
Summary
14-414
Tru64 Platform
Objectives
15-417
15-418
15-419
15-420
Cluster alias
Distributed Lock Manager (DLM)
Expanded process IDs
15-421
15-422
IPC: SKGXP
15-423
IPC: SKGXP
The cluster_interconnects initialization parameter defines which interface is
used.
When set to an IP address, the parameter uses that address and thus disables
processing in the sskgxp module.
When unset, the parameter uses the first available ics0 or mc0 interface (in that
order). ics0 is the name of the memory channel for Tru64, version 5.1 and later.
cluster_interconnects is ignored if the default RDG implementation is used.
skgxp is stored in libskgxpu.a (contains modules skgxpu.o and sskgxpu.o)
and is copied over to libskgxp9.a, if UDP implementation selected.
SKGXPM: RDG
15-424
SKGXPM: RDG
The RDG IPC is one of the most widely tested and proven IPC versions. It is used in the
SAP benchmark for Oracle, release 9.2.
The RDG IPC uses the rdg* kernel calls to create or initialize endpoints. Typical calls
are RdgInit, RdgNodeLookup, RdgEpCreate, RdgEpDestroy,
RdgShutdown, RdgIoCancel, and RdgEpLookup.
The RDG IPC uses the cfg_subsys_query call to find the RDG subsystem
information. Link commands should include -lrdg lcfg.
The RDG subsystem kernel parameters must be set as follows:
max_objs = 5120
msg_size = 32768
max_async_req = 512
rdg_max_auto_msg_wires = 0
rdg_auto_msg_wires = 0
Use the sysconfig -q rdg to verify these values (RDG version : RDG
V39.24b_BL17_BCGM623Z3).
15-426
Debugging on Tru64
15-427
Debugging on Tru64
The value TRUE for SKGXNTRCFLG must be uppercase.
ladebug
cfsstat
volprint
15-428
odump -Dl / ldd: For information about shared libraries linked with the
executable, section headers and so on
/usr/local/bin/trace: To trace and log the executable
/usr/local/bin/truss: Same as /usr/local/bin/trace but better
trace output
Summary
15-430
AIX Platform
Objectives
16-433
16-434
AIX SP Clusters
16-435
AIX SP Clusters
The term Parallel System Support Programs (PSSP) is also used for SP clusters.
16-436
16-437
SRC commands:
16-438
startsrc -s <sname>
stopsrc -s <sname>
lssrc -ls <sname>
lssrc -a
16-439
Clusterwide
disks
Node n
Instance
LCK0
LCK0
LMON
SKGP
SKGXP
SKGXN
SKGFR
HAGS
EM
Group Services services
AIO
VSD/CLV
KEXT
NET
LMON
Cluster layer,
CM
Operating
system
Net 1 and Net 2
16-440
16-441
16-443
If HACMP, set
HA_DOMAIN_NAME=`/usr/sbin/cluster/utilities/cldomain`
PGSD_SUBSYS=grpsvcs
16-444
Debugging on AIX
16-445
Debugging on AIX
There are many more dump routines in addition to the standard X$TRACE/KST and
DIAG. Refer to the source code for a list.
Summary
16-446
References
16-447
References
Contacts
Oracle on Aixrelated: Vijay.Sridharan@oracle.com
System-related issues: File ES1 Ticket
SP Sys Admin: David.Ong@oracle.com
HACMP Sys Admin: John.Tomicich@oracle.com
IBM-specific queries: Dennis Massanari: massanar@us.ibm.com
Other Platforms
Objectives
17-449
Objectives
The platforms covered in this lesson are:
Windows
Solaris
OpenVMS
17-450
17-451
17-452
Port-Specific Code
17-453
Installing RAC
17-454
Summary
17-455
SQL
SQL Layer
Layer
SQL
SQL Layer
Layer
Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS
Section
IV
II
II
P
P
P
P
Debug
C
C
C
C
Node
Node Monitor
Monitor
Buffer
Buffer Cache
Cache
CGS
CGS
GES/GCS
GES/GCS
Node
Node Monitor
Monitor
Cluster
Cluster Manager
Manager
V$ and X$ Views
and Events
Objectives
18-459
18-460
List of Views
See documentation for column descriptions.
V$ACTIVE_INSTANCES
V$BH
V$CACHE
V$CACHE_LOCK/_TRANSFER
V$CR_BLOCK_SERVER
V$ENQUEUE_LOCK/_STAT
V$FALSE_PING
V$FILE_CACHE_TRANSFER
V$GC_ELEMENT
V$GC_ELEMENTS_WITH_COLLISION
S
V$GCSHVMASTER_INFO
V$GCSPFMASTER_INFO
18-461
V$GES_BLOCKING_ENQUEUE
V$GES_CONVERT_LOCAL
V$GES_CONVERT_REMOTE
V$GES_ENQUEUE/_RESOURCE
V$HVMASTER_INFO
V$INSTANCE
V$LIBRARYCACHE
V$LOCK
V$LOCK_ELEMENT/_ACTIVITY
(V$PQ_SESSTAT, V$PX_*)
V$RESOURCE_LIMIT
V$ROWCACHE_PARENT
List of Views
The slide lists the views that are documented in the manuals. Views marked with are
created with the script CATCLUST.SQL. The V$GES_* views are synonyms for
V$DLM_* views and are also created with the script CATCLUST.SQL. Other internal
views are listed in V$FIXED_TABLE and expanded in X$KQFVI/X$KQFVT. Additional
views are:
V$DLM_ALL_LOCKS: Shows every DLM lock in the instance (PCM or not)
V$DLM_CONVERT_LOCAL: See V$GES_CONVERT_LOCAL
V$DLM_CONVERT_REMOTE: See V$GES_CONVERT_REMOTE
V$DLM_LOCKS: Blocked or blocking locks; a subset of V$DLM_ALL_LOCKS
V$DLM_MISC
V$DLM_RESS: See V$GES_RESOURCE
V$DLM_TRAFFIC_CONTROLLER
V$PING
V$FILE_PING
V$TEMP_PING
18-462
Old View
V$LOCK_ELEMENT
V$GC_ELEMENT
V$DLM_CONVERT_LOCAL
V$GES_CONVERT_LOCAL
V$DLM_CONVERT_REMOTE
V$GES_CONVERT_REMOTE
18-463
X$ Tables
x$bh
x$kccfe
x$kcfio
x$kclcrst
x$kglst
x$kjbr
x$kjdrhv
x$kjdrpcmhv
x$kjdrpcmpf
x$kjicvt
x$kjirft
18-464
x$kqrfp
x$ksimsi
x$ksqeq
x$ksqrs
x$ksqst
x$ksurlmt
x$ksuse
x$ksuxsinst
x$kvit
x$le
x$quiesce
X$ Tables
The X$ tables listed in the slide are the ones used by the V$ views on the previous slide.
WebIV note 208093.1 shows a good relation between V$ views and X$ tables.
WebIV note 22241.1 gives a reasonably complete listing of X$ tables.
Additional useful RAC X$ tables are x$kjbrfx.
Events
10254, level 1
18-465
Events
Triggering events for DLM:
29700 Enable lock convert statistics
29712-29713 Lock open convert cancel close operations
29714 DLM state object
29715 Reconfiguration
29716 Post wait and AST
29717 GRD or DLM freeze/unfreeze
29718 CGS or DLM CM interface
29720 GES or DLM SCN service
29722 GES or DLM process death
Objectives
19-467
KST: X$TRACE
19-468
KST: X$TRACE
Background
The Kernel Service Tracing (KST) facility was an existing component in the VOS layer
that was used by a few components for limited tracing. In Oracle9i, this mechanism has
been reworked to provide simpler yet more powerful interfaces for recording the execution
history of interesting components. This reworked mechanism also provides extensible
interfaces that allow the clients to customize instrumentation to satisfy their tracing needs.
KST output can be examined in the X$TRACE table.
KST Concepts
19-469
KST Concepts
The KST facility provides a mechanism to log the execution history of a component with
minimum performance impact. This is done by providing an in-memory trace buffer to
each Oracle process, because tracing with an in-memory buffer has less performance
impact than logging traces on disk.
Each Oracle process (whether foreground or background) is assigned its own trace buffer
that is allocated from the SGA. The buffer is accessible by other Oracle processes if any
process dies unexpectedly, increasing the availability of trace information for later
diagnosis.
Circular buffers are used to minimize the memory usage for tracing purposes by removing
stale data. However, users must specify a large enough buffer so that wrapping does not
cause data loss. Note that the faster a process generates tracing data, the larger the buffer
size that must be specified.
KST Concepts
19-471
Circular Buffer
X$TRACE
SGA
P1
Trace Buffer Process 1
Pn
Trace Buffer Process n
19-472
Circular Buffer
All trace buffers reside in the SGA, and each buffer is assigned to a single Oracle process.
During run time, trace data from each process is logged to its own buffer. Users can query
the content of trace buffers and the status of tracing behavior through some fixed table
views, that is, X$-tables.
19-473
SQL statements
ALTER TRACING
ALTER SYSTEM SET
19-474
19-475
Class
Scope
Trace enabled
Dynamic
Global
_trace_archive
Dynamic
Global
_trace_events
Dynamic
Local
_trace_processes
Static
Local
_trace_buffers
Static
Local
Dynamic
Local
_trace_file_size
Static
Local
_trace_options
Static
Global
_trace_flush_processes
19-476
19-477
19-478
trace_enabled
_trace_archive
_trace_flush_processes
_trace_events
X$TRACE
EVENT, OP, TIME, SEQ#, SID, PID, DATA
19-479
19-480
19-481
19-482
19-483
DLM layer
IPC layer
Space management layer
Shared servers (MTS)
PQ module
Transaction layer
KST Performance
19-484
KST Performance
Tracing definitely affects the overall performance of a system, regardless of any tracing
mechanism or design. The question is: How much performance degradation are users
willing to sacrifice in exchange for enhancing the diagnosability of the system when a
problem occurs?
In general, most customers are willing to have about 5% to 10% for the trade-off between
diagnosability and system performance.
In version 9.0.1, CPU instruction cycles used by KST tracing were measured. Regardless
of whether a trace event is enabled or not, some extra cycles are used after global tracing is
enabled (trace_enabled is TRUE) because certain cycles are required to perform the
event checking.
When global tracing is enabled, 58 extra cycles are used for event checking of a disabled
event and 176 extra cycles are used for an enabled event.
An average of less than 3% overhead was found when the regression test for RAC was run
with all events enabled at level 6 or less. Note that only a few components use KST
tracing in Oracle9i, release 1. Tracing overhead increases as instrumentation is done in
more RDBMS components.
Note that tracing overhead is a function of instrumentation. The performance may vary in
different releases.
DSI408: Real Application Clusters Internals I-484
KST: Examples
Sample instrumentation
Sample usage for KSTRC[0-6] in kju.c
Sample usage for KSTRCX in kjdd.c
Sample format callback in kji.c (kjdgtfmt)
19-485
KST: Examples
Following are the code examples on KSTRC[0-6], KSTRCX, formatting callback, and
kstdfcb. Note that the formatting callbacks should be registered at the notifier function
of the component.
kstdfcb
void kjinfy(nfytype, ctx)
ub4 nfytype;
dvoid *ctx;
{
...
else if (nfytype == KSCNOPCR)
{
...
/* Register KST trace format callback */
kstdfcb(KJDGTT_LKEVT, (KSTFPTR)kjdgtfmt);
/* Register KST trace format callback for kjdd layer */
kstdfcb(KJDGTT_DD, (KSTFPTR)kjddtfmt);
/* Register KST trace format callback for IPC layer */
kstdfcb(KJDGTT_IPC, (KSTFPTR)kjdgfmtipc);
/* Register KST trace format callback for TRFC layer */
kstdfcb(KJDGTT_TRFC, (KSTFPTR)kjdgfmttrfc);
}
...
}
dd_invalid = FALSE;
lk_node = kjiudb->node_id_kjga;
*pp = KJSOSTRUC(kjsolfs(sghead), kjddsg, link_kjddsg);
*pp2; /* to check for duplicate locks in the wait for graph */
*qp;
dd_victim = FALSE;
trcctx;
19-489
5
5
5
5
5
5
5
5
5
5
5
5
5
0
0
0
4
4
4
4
4
4
4
4
4
4
10280
10401
10401
10429
10427
10401
10429
10429
10429
10429
10429
10429
10429
1
28
27
7
10
14
2
2
2
2
2
2
2
0x00000005
KSXPUNMAP: client 1
KSXPMAP: client 1 base 0x80048000 size
MB SO Al: Allocated MBSO 82b5eac4
Init ctx: Initialize ksxp for 1 ports
KSXPTIDCRE: tid(1,1,0x83bed2b6)
AllocBuf: buf 824bf624, pool 800084b0,
AllocBuf: buf 824bfe44, pool 800084b0,
AllocBuf: buf 824c0664, pool 800084b0,
AllocBuf: buf 824c0e84, pool 800084b0,
AllocBuf: buf 824c16a4, pool 800084b0,
AllocBuf: buf 824c1ec4, pool 800084b0,
AllocBuf: buf 824c26e4, pool 800084b0,
0x37b8000
size
size
size
size
size
size
size
2080,
2080,
2080,
2080,
2080,
2080,
2080,
out(i)
out(i)
out(i)
out(i)
out(i)
out(i)
out(i)
1,
2,
3,
4,
5,
6,
7,
out(s)
out(s)
out(s)
out(s)
out(s)
out(s)
out(s)
0
0
0
0
0
0
0
KST Demonstration
19-490
KST Demonstration
Demonstration on user interfaces for modifying tracing behavior of KST mechanism:
Initialization parameters
Alter tracing
Alter system set
X$TRACE and X$TRACE_EVENTS
DIAG Daemon
RAC instance #1
SGA
RAC instance #2
Trace
buffer
Trace
buffer
Tracing
Tracing
Process
SGA
DIAG
DIAG
Process
Communication
19-491
DIAG Daemon
The diagram in the slide shows the architecture of the DIAG daemon in an RAC
environment.
Note that there is a difference between DIAGs in RAC and those in a single instance,
although both processes have the same name:
DIAG in a single instance is responsible for trace archiving and flushing only.
DIAG in a RAC instance provides other diagnosability services, in addition to trace
archiving and flushing.
DIAG Daemon:
Is an integrated service for all the diagnosability
needs of an instance
Provides a scalable framework for RAC
diagnosability
Works independently from an instance
Relies only on services provided by underlying OS
Is a lightweight daemon process, one per instance
19-492
DIAG Daemon:
Is highly available and is tolerant of common
failures
Monitors the health of a local RAC instance
Coordinates the collection of diagnosability data
from all the nodes in a RAC server
Services clusterized ORADEBUG
Provides an extensible interface for future projects
19-493
19-494
Orthogonal to instance:
Does not use latches or locks
Does not use shared resources from the database
kernel
Does not affect the instance and is not affected by
the instance
Does not share the communication channel with
other processes
19-495
Communication model:
Based on the IPC service from the OSD layer
Owns unique IPC port and message protocol
Supports multicast messaging
Supports memory-mapped copy for large data
transfer
19-496
Master DIAG:
Coordinates message ordering
Coordinates DIAG group reconfiguration
Synchronizes all DIAG group communications
19-497
19-498
19-499
19-500
Summary
19-501
DIAG architecture
ORADEBUG
and Other Debugging Tools
Objectives
20-503
ORADEBUG
ORADEBUG is RAC-aware.
20-504
ORADEBUG
You can use the options -G or -R to execute ORADEBUG across instances.
-G means the debugging data and result are written to the trace file of the
executing DIAG daemon at each participating instance.
-R means the same data is returned to the initiator DIAG daemon, which then
outputs them to its trace file.
Flash Freeze
20-506
Flash Freeze
Flash freeze permits the freezing of an entire instance. This permits the dumping of any
normal dumps via ORADEBUG without moving the system. Other instances may time
out or hang as a result of freezing one instance. Output for flash freeze commands
(including ffstatus) is written to the alert log. When ffbegin is issued, each
process notification is put in the alert log, as is the response from each process.
Likewise, messages appear in the alert log for ffresumseinst.
Use the SETINST command to specify which instances to freeze; default is the local
instance only.
LKDEBUG
20-507
LKDEBUG
Output is to the trace file (except for the help list).
SQL> oradebug lkdebug help
Usage:lkdebug [options]
-l [r|p] <enqueue pointer>
-r <resource pointer>
-b <gcs shadow pointer>
-p <process id>
-P <process pointer>
-O <i1> <i2> <types>
-a <res/lock/proc/pres>
-a <res> [<type>]
-a convlock
-a convres
-a name
-a hashcount
-t
-s
-k
Enqueue Object
Resource Object
GCS shadow Object
client pid
Process Object
Oracle Format resname
all <res/lock/proc/pres> pointers
all <res> pointers by an optional type
all converting enqueue (pointers)
all res ptr with converting enqueues
list all resource names
list all resource hash bucket counts
Traffic controller info
summary of all enqueue types
GES SGA summary info
NSDBX
20-508
NSDBX
Output is to the trace file (except for the help command).
SQL> oradebug nsdbx help
Usage:nsdbx [options]
-h
Help
-p <owner> <namespace> <key> <val> <nowait>
Publish a name-entry
-d <owner> <namespace> <key> <nowait>
Delete a name-entry
-q <namespace> <key>
Query a namespace
-an <namespace>
Print all entries in namespace
-ae
Print all entries
-as
Print all namespaces
HANGANALYZE
20-509
HANGANALYZE
This is similar in intent to what is performed manually through system states.
The level is between 1 and 10. Level 3 is good for a first pass.
Lev. Description
1,2 Only HANGANALYZE output, no process dump at all
3
Level 2 + Dump only processes thought to be in a hang (IN_HANG state)
4
Level 3 + Dump leaf nodes (blockers) in wait chains
(LEAF,LEAF_NW,IGN_DMP state)
5
Level 4 + Dump all processes involved in wait chains (NLEAF state)
10
Dump all processes (IGN state)
Remember to use SETINST to make it a clusterwide hang analysis.
Summary
20-510
References
20-511
References
See Note 178683.1 Tracing GSD, SRVCTL, GSDCTL, and SVRCONFIG for details
about tracing on the RAC utilities.