You are on page 1of 126

Parallel Concurrent Processing

Mike Swing TruTek mswing@trutek.com RMOUG 2009


1

Conclusions
You dont need RAC to use Parallel Concurrent Processing (PCP)! If you have PCP enabled, secondary nodes must be defined during the upgrade to R12 Tuning of TCP, SQLNet and PMON parameters can minimize PCP failover time. Implement Failover Sensitive Workshifts

Concurrent Processing Server


Allows scheduling of jobs batch jobs, or Requests in Oracle terms. Processes concurrent programs as a Request. Requests can be grouped together into Request Sets. Different types of concurrent managers handle different types of requests. A concurrent program can be assigned to a responsibility, and that responsibility can be assigned to users, allowing them the permission to run the concurrent program. Concurrent managers may have limits on the concurrent programs that can be run, and the times that they can be started. Requests have priorities, status and log and out files in the above directory
3

Definitions
CP => Concurrent Processing DCD => Dead Connection Detection ICM => Internal Concurrent Manager IM => Internal Monitor CRM => Conflict Resolution Manager PCP => Parallel Concurrent Processing PMON => Process Monitor for ICM
4

Concurrent Request

Phase and Status of Concurrent Requests


Phase Pending Pending Running Completed Completed Completed Inactive Status Normal Standby Normal Normal Error Warning No Manager Description - Action The request is waiting to be picked up by the next available manager. Waiting for CRM to resolve conflict. CRM could be slow or an incompatible program is running. The request is running normally. The request has finished successfully The request has finished with an error. Check logs. The request has finished with a Warning. Check the logs. Request wont run without a manager. Specialization rules arent configured properly.
6

PCP Failover
DB Node RH8
Database

RH7

RH8

RH9
sqlnet.ora

PCP

PCP

PCP Database Listener

SQL*Net Client

SQL*Net Client

SQL*Net Client

TCP_KEEPALIVE takes 240 seconds before issuing DCD

Concurrent Managers

Concurrent Managers
Manager Type Internal Concurrent Manager Conflict Resolution Manager Internal Monitor Concurrent Manager Concurrent Manager Concurrent Manager Concurrent Manager Transaction Manager Transaction Manager Transaction Manager Transaction Manager Service Instance Internal Manager Conflict Resolution Manager Internal Monitor:Node Service Manager: Node Standard Manager Inventory Manager Session History Cleanup PA Streamline Manager CRP Inquiry Manager FastFormula Transaction Manager PO Document Approval Manager Transaction Manager Scheduler/Prerelease Manager OAM Generic Collection Service:Node Program FNDLIBR FNDCRM FNDIMON FNDSM FNDLIBR INVLIBR FNDLIBR PALIBR CYQLIB FFTM POXCON FNDTMTST FNDSVC FNDSVC
9

Concurrent Processing
1. The Concurrent Web Processing server Interface Browser communicates with the database using Forms Server Oracle SQL*Net. JAVA 2. The concurrent JInitiator Interface program log or output Reports Server file from a request is passed back as a report to the Report SQL*Net ICM Service Internal Report Review Agent. FNDLIBR Manager Monitor Review FNDSM .rdx FNDIMON 3. The Report Review Agent Agent passes a file Standard Manager containing the entire Requests Log Out FNDCRM FNDLIBR report to the forms server. 4. The Forms Services component passes the report back to the users browser one page at time. Profile options can be used to control the size of the files and pages passed, to suit report volume and available network capacity.
HTML
Web Server

10

Internal Concurrent Manager


The Internal Concurrent Manager (ICM) starts, sets the number of active processes, monitors, and terminates all other concurrent processes through requests made to the Service Manager, including restarting any failed processes. The ICM also starts and stops, and restarts the Service Manager for each node. The ICM will perform process migration during an instance or node failure. The ICM will be active on a single node. This is also true in a PCP environment, where the ICM will be active on at least one node at all times.
11

Internal Concurrent Manager


The ICM really does not have any scheduling responsibilities. It has NOTHING to do with scheduling requests, or deciding which manager will run a particular request. The function of the ICM is to run 'queue control' requests; requests to startup or shutdown other managers. The ICM is responsible for startup and shutdown of the whole concurrent processing facility, and it monitors the other managers periodically, and restarts them if they should go down. It can also take over the Conflict Resolution manager's job, and resolve incompatibilities. If the ICM itself should go down, requests will continue to run normally, except for 'queue control' requests. Restart the ICM with 'startmgr'; no need to kill the other managers first.
12

Internal Concurrent Manager

13

Service Manager
FNDSM process - Communicates with the Internal Concurrent Manager, Concurrent Manager, and non-Manager Service processes. The Service Manager (SM) spawns, and terminates manager and service processes (these could be Forms, or Apache Listeners, Metrics or Reports Server, and any other process controlled through Generic Service Management). When the ICM terminates the SM that resides on the same node with the ICM will also terminate. The SM is chained to the ICM. The SM will only reinitialize after termination when there is a function it needs to perform (start, or stop a process), so there may be periods of time when the SM is not active, and this would be normal.

14

Service Manager
All processes initialized by the SM inherit the same environment as the SM. The SMs environment is set by APPSORA.env file, and the gsmstart.sh script. The apps_<sid> listener must be active on each CP node to support the SM connection to the local instance. There should be a Service Manager active on each node where a Concurrent or non-Manager service process will reside.
15

FNDSM Failure
FNDSM failover as noted in the concurrent manager log:
Could not contact Service Manager FNDSM_RH8_VIS. The TNS alias could not be located, the listener process on RH8 could not be contacted, or the listener failed to spawn the Service Manager process. Found dead process: spid=(962754), cpid=(2259578), Service Instance=(1045) CONC-SM TNS FAIL Call to PingProcess failed for WFMAILER CONC-SM TNS FAIL Call to StopProcess failed for WFMAILER CONC-SM TNS FAIL Call to PingProcess failed for FNDCPGSC
16

FNDSM Failover
Found dead process: spid=(716870), cpid=(2259580), Service Instance=(2009) Found dead process: spid=(1442020), cpid=(2259579), Service Instance=(2010) Starting WFMGSMD Concurrent Manager : 15-AUG-2008 13:28:56 Starting WFMGSMDB Concurrent Manager : 15-AUG-2008 13:28:56 Starting WFALSNRSVCB Concurrent Manager : 15-AUG-2008 13:28:57 Starting STANDARD Concurrent Manager : 15-AUG-2008 13:30:31 Starting Internal Concurrent Manager Concurrent Manager : 15-AUG2008 13:30:32
17

Internal Monitor
(FNDIMON process) - Communicates with the Internal Concurrent Manager. This manager/service is used to implement Parallel Concurrent Processing. You do not need to run this manager/service unless you are using Parallel Concurrent Processing. The Internal Monitor (IM) monitors the Internal Concurrent Manager, and restarts any failed ICM on the local node. It monitors whether the ICM is still running, and if the ICM crashes, it will restart it on another node. During a node failure in a PCP environment the IM will restart the ICM on a surviving node (multiple ICM's may be started on multiple nodes, but only the first ICM started will eventually remain active, all others will gracefully terminate). There should be an Internal Monitor defined on each node where the ICM may migrate.
18

Standard Manager
(FNDLIBR process) - Communicates with the Service Manager and any client application process. The Standard Manager is a worker process that initiates, and executes client requests on behalf of Applications batch, and OLTP clients.

19

Standard Manager

20

Standard Manager - OAM

The Standard Manager is active on RH9, even though no primary node is defined

Since no secondary node is defined, the Standard Manager will not failover Failover Processes in the Work Shifts definition are the number of processes that will run (3) when the Standard Manager fails over to the secondary node.

21

Transaction Manager
A Transaction Manger communicates with the Service Manager, and any user process initiated on behalf of Forms, or a Standard Manager request. A Transaction Manager: Supports synchronous processing of requests from a client program Gets request for a client program to run a server-side program synchronously. Return a status/results to the client program. At runtime, it starts a number of these managers as defined. Doesnt poll concurrent request table for a new request Only need 1 transaction manager per database, not 1 per instance.
22

Transaction Managers

Some of the Transaction Managers in R12

23

Configuring Transaction Managers for RAC


R11i Transaction Managers use DBMS_PIPE
This does not work across RAC instances RAC users must perform additional configuration
Requires complicated configuration or additional hardware

R12 Transaction Managers use AQ


Works across RAC Instances Simplifies configuration Reduces complexity Profile Option can switch between mechanisms
DBMS_PIPE can be used for non-RAC users if performance becomes an issue
24

Configuring Transaction Managers for RAC


Edit $ORACLE_HOME/dbs/<context_name>_ifile.ora and add these parameters:
_lm_global_posts=TRUE _immediate_commit_propagation=TRUE

Change the profile option Concurrent: TM Transport Type' to QUEUE', and verify that the transaction manager works across the RAC instance. ATG RUP3 (4334965) or higher provides an option to use AQs in place of Pipes. Profile Concurrent:TM Transport Type Set to QUEUE Pipes are more efficient but require a Transaction Manager to be running on each DB Instance. Navigate to Concurrent > Manager > Define screen, and set up the primary and secondary node names for transaction managers.
25

Configuring Transaction Managers for RAC


Transaction Managers allow a client to make a request for a program to be run on the server immediately. The client then waits for the program to complete and can receive program results from the server. As the client and server are two separate database sessions, the communication between has been handled using the DBMS_PIPE package. Unfortunately the DBMS_PIPE package does not extend to communications between sessions on different RAC instances. On an Applications instance using RAC, the client and server are very likely to be on different instances, causing transactions to time out for long periods or fail completely. The current workaround is to manually set up Transaction managers to connect to all RAC instances, which not only takes up additional resources, it may require additional middle-tier hardware or a complicated configuration that is difficult to maintain.

26

R12 Transaction Managers


In R12, the Transaction Managers use the AQ mechanism; the Transaction Managers, work on RAC connected to either instance. This greatly simplifies the configuration and reduces the complexity for RAC administrators. A Profile Option has been introduced to allow users to switch between the two transports DBMS_PIPE or AQ.

27

Concurrent:PCP Instance Check


Concurrent processing provides database instancesensitive failover capabilities. When an instance is down, all managers connecting to it switch to a secondary middle-tier node. However, if you prefer to handle instance failover separately from such middle-tier failover (for example, using TNS connection-time failover mechanism instead), use the profile option Concurrent:PCP Instance Check. When this profile option is set to OFF, Parallel Concurrent Processing will not provide database instance failover support; however, it will continue to provide middle-tier node failover support when a node goes down.
28

Conflict Resolution Manager


Concurrent managers read requests to start concurrent programs. The Conflict Resolution Manager checks concurrent program definitions for incompatibility rules. If a program is identified as Run Alone, then the Conflict Resolution Manager prevents the concurrent managers from starting other programs in the same conflict domain. When a program lists other programs as being incompatible with it, the Conflict Resolution Manager prevents the program from starting until any incompatible programs in the same domain have completed running. To enable/disable the Conflict Resolution Manager, use the system profile option 'Concurrent: Use ICM'. Set this to 'No' (default) allows the CRM to be started. Setting it to 'Yes' causes the CRM to be shutdown and the Internal Manager (ICM) will take over the conflict resolution duties. If the CRM will not start (it is started automatically by the ICM), check this profile option.
29

Conflict Resolution Manager


Use the system profile option 'Concurrent: Use ICM'. 'No allows the CRM to be started.
Setting it to 'Yes' causes the CRM to shutdown. The Internal Manager (ICM) will take over the conflict resolution duties. Using the ICM to resolve conflicts is not recommended. The CRM's sole purpose is to resolve conflicts, while the ICM has other functions to perform as well. Setting this option to 'YES' is not recommended.
30

Generic Service Management


An E-Business Suite system depends on a variety of services, such as Forms Listeners, HTTP Servers, Concurrent Managers, and Workflow Mailers. These services are composed of one or more processes. In the past, many of these processes had to be individually started and monitored by system administrators. Management of these processes is complicated, since these services can be distributed across multiple host machines. The introduction of Generic Service Management in Release 11i helped simplify the management of these processes by providing a fault tolerant service framework and a central management console built into Oracle Applications Manager. Service Management is an extension of Concurrent Processing, and provides a framework for managing processes on multiple host machines. With Service Management, virtually any application tier service can be integrated into this framework. Patch 2221688 introduces GSM.
31

GSM

32

Generic Services

33

GSM and Multiple Nodes


GSM enables users to manage Applications services across multiple middle-tier nodes. This includes services on Web/Forms nodes that previously have had no concurrent processing footprint. Users configuring GSM in a multiple-node system should be sure to have followed the instructions for Parallel Concurrent Processing. This includes setting the environment variable APPLDCP=ON and assigning a primary node for all defined managers and services (if not already defined.)
34

Seeded GSM Services


When configuring GSM the following GSM Services are seeded automatically:
Forms Listener Metrics Server Metrics Client Reports Server Apache Listener

LINUX users should not Activate the Reports Server under GSM
35

Starting GSM
Apps Listener: listener.ora gsmstart.sh exec FNDSM

36

adcmctl.sh
adcmctl.sh calls: starmgr.sh batchmgr.sh CONCSUB FNDSVCRG

37

FNDSVCRG Service Controller Utility


FNDSVCRG is an executable introduced as a part of the Seeded GSM Services. It provides improved coordination between the GSM monitoring of these service and their commandline control scripts. The $FND_TOP/bin/FNDSVCRG executable is called from adcmctl.sh control script before and after the script starts or stops the service. FNDSVCRG connects to the database using JDBC and validates the configuration of the Seeded GSM Service.
38

Verify GSM
To verify GSM is working, start the concurrent managers. Once GSM is enabled, the ICM uses Service Managers to start all concurrent managers and activated services. If the ICM is successfully starting the managers, then GSM has been configured properly. If managers and/or services fail to start, errors should appear in the ICM log file.
39

Service Manager Log


Each Service Manager maintains its own log file named FNDSMxxxx.mgr, located in the same directory as concurrent manager log files. If you cannot locate the Service Manager log file, it is likely that the Service Managers are not starting properly and there is a configuration issue that needs troubleshooting.
40

Kill FNDSM

Test Kill services and see if GSM restarts them

applvis 9007 1 0 11:53 ? 00:00:00 FNDSM applvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR applvis 9161 5683 0 11:55 pts/3 00:00:00 grep FND [applvis@rh9 scripts]$ kill -9 9007 [applvis@rh9 scripts]$ ps -ef |grep FND applvis 9159 9155 0 11:55 ? 00:00:00 FNDLIBR applvis 9169 1 0 11:55 ? 00:00:00 FNDSM applvis 9249 5683 0 11:57 pts/3 00:00:00 grep FND

Kill FNDCRM [applvis@rh9 scripts]$ ps -ef |grep FNDCRM applvis 8886 1 0 11:52 ? 00:00:00 FNDCRM APPS/ZGA13053E1E1B7BA773417089054DA88F194EAC0D687728CC2551870E6B78C4B439 EADB287342795115A88DBC85788CCB4 FND FNDCRM N 10 c LOCK Y RH9 1302318 [applvis@rh9 scripts]$ kill -9 8886 [applvis@rh9 scripts]$ ps -ef |grep FNDCRM applvis 9457 9392 0 12:09 ? 00:00:00 FNDCRM APPS/ZG26430816FA3570354BC57DE47FF105D145F8DE226EFE58CE04B416633DCB90126 7BFECFA7585114F7090060EFE1147BE FND FNDCRM N 10 c LOCK Y RH9 1302343 Both of these services were started before I could enter the grep command to find the corresponding process. 41

11i - Defining PCP Details

In Release 11i, the Secondary Node doesnt need to be filled in for failover to occur

42

R12 PCP Details

In Release 12, failover wont occur if there is no Secondary Node defined

43

R12 PCP Setup


The only Standard Manager set up to fail over is the Standard Manager

44

R12 Manager Failover

45

PCP Failover
DB Node RH8
Database

RH7

RH8

RH9
sqlnet.ora

PCP

PCP

PCP Database Listener

SQL*Net Client

SQL*Net Client

SQL*Net Client

TCP_KEEPALIVE takes 240 seconds before issuing DCD

46

Parallel Concurrent Processing


Parallel concurrent processing allows distribution of concurrent managers across multiple nodes. Benefits are better: performance, availability and scalability (load balancing). Parallel Concurrent Processing (PCP) is activated along with Generic Service Management (GSM); it can not be activated independently of GSM. With parallel concurrent processing implemented with GSM, the Internal Concurrent Manager (ICM) tries to assign valid nodes for concurrent managers and other service instances.

47

Parallel Concurrent Processing


There should be only one ICM and CRM, at any given time, although the ICM and CRM could be configured to run on several of the nodes. Concurrent Managers migrate to the surviving node when one of the concurrent nodes goes down.

48

Parallel Concurrent Processing


Web Browser

HTML
Interface

Web Server Forms Server

Data Reports Server

JInitiator

JAVA
Interface

Internal Monitor FNDIMON

ICM FNDLIBR Standard Manager FNDLIBR

Service Manager FNDSM

Report Review Agent Logs

SQL*Net

.rdx
Out

FNDCRM

Requests

Internal Monitor FNDIMON

ICM FNDLIBR Standard Manager FNDLIBR

Service Manager FNDSM

Report Review Agent

SQL*Net

.rdx Database
Out

FNDCRM

Requests

Logs

Whats wrong with this picture?


49

APPLDCP Profile Option


Starting with Release 11.5.10, FND.H, the APPLDCP environment variable is ignored. R12 GSM requires the value of APPLDCP to be set to ON. The value is hard-coded in afpcsq.lpc version 115.35, thereby ignoring the value of APPLDCP. As per ATG Development:
As of file "afpcsq.lpc" version 115.35 or higher, APPLDCP is internally hard-coded to "ON" when the Generic Service Management (GSM) is enabled--"keeping in mind, use of the GSM is required". In short, at "afpcsq.lpc" version 115.35 or higher with the GSM enabled, the setting of the APPLDCP environment variable is ignored--this is the "default behavior on all R12 releases." NOTE: As per ARU, "Patch 11i.FND.H" (3262159) and "Oracle Applications Release 11.5.10" (3140000) contains "afpcsq.lpc" version 115.37.

From Note: 753678.1

50

PCP Failover Mechanisms


TCP keepalive PMON ICM Process Monitor Dead Connection Detection Connection Failure Recovery R12 10g Timeout Parameters (untested)
sqlnet.inbound_connect_timeout (server) sqlnet.send_timeout (client and/or server) sqlnet.recv_timeout (client and/or server)
51

11i PCP Failure


TCP Failure ICM Lock is released, FNDIMON pings ICM node, if ping fails, check PMON PMON detects a dead process, crashed ICM reviver.sh DCD
52

R12 PCP Failure


TCP Failure PMON detects a dead process ICM Shutdown
Look for error messages ORA-3113, ORA3114 or ORA-1041

reviver.sh DCD
53

Reviver
ICM REVIVER Start Starts to Shutdown No Receive Shutdown?

Lost DB Connection? Yes

Attempt to Get DB Connection

No Sleep

Yes No Spawn Reviver Yes Kill Previous DB Session ICM Started? Yes

No

Start ICM Exit

From the CM log file: The ICM has lost its database connection and is shutting down. Spawning reviver process to restart the ICM when the database becomes available again. Spawned reviver process 10910.

Exit

54

reviver.log
The ICM has lost its database connection and is shutting down. Spawning reviver process to restart the ICM when the database becomes available again. Spawned reviver process 10910.

55

TCP
TCP/IP is a connection-oriented protocol; TCP implements packet timeout and retransmission in an effort to guarantee the safe and sequenced order of data packets. If a timely acknowledgement is not received in response to the probe packet, the TCP/IP stack will retransmit the packet some number of times before timing out. After TCP/IP gives up, SQL*Net receives notification that the probe failed.
56

TCP Keepalive
At this time, client side SQL*Net connections do not enable keepalive for TCP connections by default. However, it is possible to enable this by adding the ENABLE=BROKEN parameter to the SQL*Net connect string, by adding this parameter to the sqlnet.ora file. **WARNING** Keepalive intervals can typically be set to 2 hours or more (i.e,,it can take more than 2 hours to notice a dead server even if keepalive is enabled). To make keepalive useful for PCP and TAF the keepalive interval needs to be reduced to a smaller value (such as 2 minutes). If there are a lot of IDLE connections on your network, then reducing keepalive can increase network traffic significantly.
57

ENABLE=BROKEN
Sample TNS alias to enable keepalive (notice the ENABLE=BROKEN clause) VIS_BALANCE = (DESCRIPTION = (ENABLE=BROKEN) (ADDRESS_LIST = (LOAD_BALANCE = ON) (FAILOVER = ON) ADDRESS = (PROTOCOL = TCP) (HOST = rh8)(PORT = 1521)) (ADDRESS = (PROTOCOL = TCP)(HOST = rh6)(PORT = 1521)))

58

TCP Keepalive
**WARNING** Keepalive intervals are typically set to 2 hours or more (ie: it can take more than 2 hours to notice a dead server even if keepalive is enabled). To make keepalive useful for TAF, the keepalive interval would need to be reduced to a smaller value (such as 2 minutes). Note: 249213.1
59

TCP KeepAlive Parameters for Linux


tcp_keepalive_time tcp_keepalive_intvl tcp_keepalive_probes the time since the last data packet sent and the first keepalive probe the time between keepalive probes the number of probes to be sent before declaring the connection dead tcp_keepalive_time = 7200 seconds tcp_keepalive_intvl = 75 tcp_keepalive_probes = 9

Default Settings

A total of 7875 seconds, or 2 hours 11 minutes and 15 seconds.


60

TCP Keepalive
Initial Settings
tcp_keepalive_time = 200 secs tcp_keepalive_intvl = 20 tcp_keepalive_probes = 2

After 200 seconds of no response, TCP sends the first of 2 probes, 20 seconds apart. TCP notifies SQL*Net of the failure, and SQL*Net removes the offending connection.
61

TCP Retries
tcp_retries1 (default: 3) The number of times TCP will attempt to retransmit a packet on an established connection normally, without the extra effort of getting the network layers involved. tcp_retries2 (default: 15) The maximum number of times a TCP packet is retransmitted in established state before giving up tcp_syn_retries (default: 5) The maximum number of times initial SYNs for an active TCP connection attempt will be retransmitted. The default value is 5, corresponds to approximately 180 seconds.
62

TCP Retries
Now lets consider changing the following TCP parameters from their default values:
tcp_retries1 = 2 tcp_retries2 = 2 tcp_syn_retries = 2

In this example, the time to initialize the PCP failover was an average of 8 seconds after changing these TCP parameters.
63

Disconnect TCP Connection from RH9


From the ICM log: The Internal Concurrent Manager has encountered an error. Review concurrent manager log file for more detailed information. : 12JAN-2009 15:22:55 Shutting down Internal Concurrent Manager : 12-JAN-2009 15:22:55 12-JAN-2009 15:22:55 The ICM has lost its database connection and is shutting down. Spawning reviver process to restart the ICM when the database becomes available again. Spawned reviver process 1541. The VIS_0112@VIS internal concurrent manager has terminated with status 1 - giving up. Found dead process: spid=(17963), cpid=(1302176), ORA pid=(26), manager=(0/1)
64

PMON & fnd_concurrent _queues


PMON updates the work_start column in the fnd_concurrent_queues table every 4 PMON cycles fdpsrp() (running_processes correction): ICM cannot obtain exclusive lock on FND_CONCURRENT_QUEUES Oracle error code returned: 1 This message is information and does not indicate a problem with CP functionality. remote call function (FNDIMON) 15-AUG-2008 10:06:02 - Function to call: PingProcess
65

PMON ICM Lock 11i


If the ICM lock is not available, FNDIMON will now ping the node of the ICM. If the ping succeeds, we conclude that the ICM is fine. What???? If the ping fails, we further check if it has been over quesiz pmon cycles since the ICM updated the work_start column fnd_concurrent_queues. If it has been more than four pmon cycles we conclude that the ICM is dead.
66

PMON found dead process


On RH9 the PMON found a dead process. The PMON takes about 1 second to run, then sleeps for 2 minutes:
Process monitor session started : 18-JAN-2009 21:46:05 Found dead process: spid=(16977), cpid=(1321475), Service Instance=(36543) Process monitor session ended : 18-JAN-2009 21:46:06 The Internal Concurrent Manager has encountered an error. Review concurrent manager log file for more detailed information. : 18-JAN-2009 22:02:01
67

PMON node RH9 is down


From the ICM log:
Process monitor session started : 12-JAN-2009 15:18:27 Internal Concurrent Manager found node RH9 to be down. Adding it to the list of unavailable nodes. CONC-SM TNS FAIL Call to PingProcess failed for XDPCTRLS
68

PMON
Process monitor session started : 18-JAN-2009 22:38:57
CONC-SM TNS FAIL Call to PingProcess failed for OAMGCS 18-JAN-2009 22:38:58 - Node:(RH7), Service Manager:(FNDSM_RH7_VIS) currently unreachable by TNS Found dead process: spid=(11234), cpid=(1321563), ORA pid=(167), manager=(0/4)

Process monitor session ended : 18-JAN-2009 22:38:58


69

PMON
Shutting down Internal Concurrent Manager : 18JAN-2009 22:02:01 18-JAN-2009 22:02:01 The ICM has lost its database connection and is shutting down. Spawning reviver process to restart the ICM when the database becomes available again. Spawned reviver process 10910.

70

PMON runs every 2 minutes


Process monitor session ended : 18-JAN2009 21:49:05 Process monitor session started : 18-JAN2009 21:51:05

71

Edit ICM Runtime Parameters

72

Edit PMON Parameters

73

Edit PMON Parameters

ICM parameters are read from batchmgr.sh when adcmctl.sh runs. Changing these parameters here does not change batchmgr.sh!

74

$FND_TOP/bin/batchmgr.sh
Make sure the PMON changes are made in the $FND_TOP/bin/batchmgr.sh file. FILENAME # batchmgr # DESCRIPTION # fire up Internal Concurrent Manager process # USAGE # batchmgr arg1=val1 arg2=val2 ... # # Parameters may be sent via the environment. # # ARGUMENTS # [appmgr|sysmgr]=username/password # [sleep=sleep_seconds] # [mgrname=manager_name] # [logfile=log_filename] # [restart=N|mim minutes between restarts] # [mailto="user1 user2..."] # [PRINTER=printer_name] # [pmon=iterations] # [quesiz=pmon_iterations] # [diag=Y|N]

DEFAULT 15 icm $FND_TOP/$APPLLOG/$mgrname.mgr N current user 4 1 N

75

Reviver
ICM REVIVER Start Starts to Shutdown No Receive Shutdown?

Lost DB Connection? Yes

Attempt to Get DB Connection

No Sleep

Yes No Spawn Reviver Yes Kill Previous DB Session ICM Started? Yes

No

Start ICM Exit

From the CM log file: The ICM has lost its database connection and is shutting down. Spawning reviver process to restart the ICM when the database becomes available again. Spawned reviver process 10910.

Exit

76

reviver.log
reviver.sh starting up... [ Mon Jan 12 20:02:15 MST 2009 ] - Read APPS username/password. [ Mon Jan 12 20:02:45 MST 2009 ] - Attempting database connection... [ Mon Jan 12 20:02:45 MST 2009 ] - Successful database connection. [ Mon Jan 12 20:02:45 MST 2009 ] - Killing previous ICM session... 1 row updated. Commit complete. [ Mon Jan 12 20:02:45 MST 2009 ] - Looking for a running ICM process... [ Mon Jan 12 20:02:45 MST 2009 ] - ICM now running, reviver.sh complete.

77

reviver.sh
reviver.sh code summary Sleep 30 Test_connection Kill_old _icm Get session Alter system kill session Check_running_icm Fnd_conc.ecm_alive start_icm startmgr.sh
78

Dead Connection Detection


Dead Connection Detection (DCD) is a feature of SQL*Net 2.1 and later, including Oracle Net8. DCD detects when a partner in a SQL*Net V2 client/server or server/server connection has terminated unexpectedly, and releases the resources associated with it.

79

Implement DCD
Implement by: adding SQLNET.EXPIRE_TIME = 1 (Minutes) to the sqlnet.ora file If the connection is idle for the time interval specified in minutes by the SQLNET.EXPIRE_TIME parameter, the serverside process sends a small 10-byte packet to the client. The packet is sent using TCP/IP.
80

DCD ICM Lock


ICM and IM can use the DCD functionality of the Network (TCP sqlnet). ICM is a client process connected to a DCD enabled DB dedicated server process. ICM holds the named PL/SQL Lock, the ICM lock. IM is continuously trying to check whether it can get the same named PL/SQL Lock.
81

DCD ICM Lock


As soon as the ICM lock is released by the DB / DCD, FNDIMON pings the ICM node, and the IM deduces that the ICM has crashed.
If the ping succeeds, we conclude that the ICM is fine.
Obviously, the ICM can be down, even if TCP is working, this is bad logic.

If the ping fails, FNDIMON determines if its been over four pmon cycles since the ICM updated the work_start column fnd_concurrent_queues. If it has been more than four pmon cycles FNDIMON concludes the ICM is dead.

The DCD comes into picture here after ICM has crashed and DB needs to identify that the ICM is gone. The DB needs to clean up the dedicated server process resource corresponding to the ICM client process
82

FNDIMON has the ICM Lock


Check if the ICM updated the work_start column fnd_concurrent_queues.

Be aware that if a TCP failure is not detected, failover will not occur. The following except from a concurrent manager log shows:
fdpsrp() (running_processes correction): ICM cannot obtain exclusive lock on FND_CONCURRENT_QUEUES Oracle error code returned: 1 This message is information and does not indicate a problem with CP functionality. remote call function (FNDIMON) 15-AUG-2008 10:06:02 - Function to call: PingProcess

The PingProcess continues until the CP processes resume, or a TCP failure is detected, and failover is begun.

83

11i PCP Failure


TCP Failure ICM Lock is released, FNDIMON pings ICM node, if ping fails, check PMON PMON detects a dead process, crashed ICM reviver.sh DCD
84

R12 PCP Failure


TCP Failure PMON detects a dead process ICM Shutdown
Look for error messages ORA-3113, ORA3114 or ORA-1041

reviver.sh DCD
85

Test PCP Failover Parameters


Test to explore effect of DCD, PMON and TCP failover methods. Variables: sqlnet.expire_time, pmon sleep and number of cycles, and the following TCP Keepalive parameters: tcp_keepalive_time, tcp_keepalive_intvl, tcp_keepalive_probes tcp_retries1 (default: 3, new value 2) tcp_retries2 (default: 15, new value 2) tcp_syn_retries (default: 5, new value 2)
86

Failover Test Results


Failover time / Failback time Expire_time PMON Sleep PMON Cycles tcp_KA time tcp KA intvl tcp KA probes tcp retries tcp retries2 tcp syn retries 241 secs / 250 secs / 50 secs 1 minute 5 minute 30 secs 30 secs 4 4 200 200 20 20 2 2 3 3 15 15 5 5

262 secs / 100 sec

10 minutes

30 secs

200

20

15

300 secs / 75 secs

1 minute

15 secs

200

20

15

285 secs / 35 min 8 secs / 105 secs 10 secs / 42 secs 7 secs / 40 secs 6 secs / 34 secs

10 minute 1 minute 1 minute 10 minutes 1 minute

30 secs 30 secs 30 secs 30 secs 15 secs

4 4 4 4 2

1000 1000 200 200 200

60 60 20 20 20

10 10 2 2 2

3 2 2 2 2

15 2 2 2 2

5 2 2 2 2

87

All Services are UP

88

Concurrent Managers

Processes - Actual = 1 and Target = 1, manager is running Processes - Actual = 0 and Target = 1, manager is running
89

Actual Processes = 0

Example of Actual Processes = 0, in this example the CRM is not running

90

PCP Setup

PCP setup this screen is continued on the next slide


91

Primary and Secondary Nodes


Any concurrent programs not assigned to the Standard Manager will not fail over The CRM, ICM and Standard Manager will fail over

92

TCP Failure

TCP disconnected at 2:57:25 10 seconds after the TCP connection was pulled, OAM reported the status above. It took 10 seconds for OAM to register a failure of services on RH9.
93

CRM is DOWN

If any of the subordinate services fail, it rolls up to the Dashboard

94

CRM Failure

CRM has failed, Actual Processes = 0

95

PCP Failover from RH9 to RH7

Adding Node:(RH9), to unavailable list Found dead process: spid=(9696), cpid=(1321449), ORA pid=(80), manager=(0/0) Found dead process: spid=(9784), cpid=(1321458), ORA pid=(114), manager=(0/0) Found dead process: spid=(9783), cpid=(1321457), ORA pid=(104), manager=(0/0) Found running request 4413565 attached to dead manager process. Attempting to restart request. Internal Concurrent Manager found node RH9 to be down. Adding it to the list of unavailable nodes.

96

GSM tries to restart the services


TCP and TNS is unavailable: Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42 CONC-SM TNS FAIL Routine AFPEIM encountered an error while starting concurrent manager STANDARD with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR. Check that your system has enough resources to start a concurrent manager process. Contac : 18-JAN-2009 21:43:42 Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42 CONC-SM TNS FAIL Routine AFPEIM encountered an error while starting concurrent manager STANDARD with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR. Check that your system has enough resources to start a concurrent manager process. Contac : 18-JAN-2009 21:43:42 Starting STANDARD Concurrent Manager : 18-JAN-2009 21:43:42 CONC-SM TNS FAIL Routine AFPEIM encountered an error while starting concurrent manager STANDARD with library /d01/oracle/VIS/apps/apps_st/appl/fnd/12.0.0/bin/FNDLIBR.

97

ICM and CRM are DOWN

98

RH9 is DOWN

Not really down, just not on the network

99

PCP is DOWN

This is momentary as GSM figures out what to do

100

Failover to Secondary Node

The ICM and CRM failed over to RH7 in about 1 minute and 30 seconds

101

Failover from RH9 to RH7


Starting Internal Concurrent Manager Concurrent Manager : 18-JAN-2009 21:51:23 : Started ICM on Target RH7. Process monitor session ended : 18JAN-2009 21:52:53 : Migration of ICM has completed. Shutting down Internal Concurrent Manager : 18JAN-2009 21:53:23 The VIS_0118@VIS internal concurrent manager has terminated successfully - exiting.
102

ICM Failover to RH7


Starting Internal Concurrent Manager Concurrent Manager : 18-JAN-2009 21:51:23 : Started ICM on Target RH7. Process monitor session ended : 18JAN-2009 21:52:53 : Migration of ICM has completed. Shutting down Internal Concurrent Manager : 18JAN-2009 21:53:23 The VIS_0118@VIS internal concurrent manager has terminated successfully - exiting.
103

RH9 not available

104

Request Failover

105

Standard Manager Failover Configuration

Note the Inventory Manager, MRP Manager and OAM Metrics Collection Manager are not setup to failover.
106

Managers with a Secondary Node

Note the Inventory Manager, MRP Manager and OAM Metrics Collection Manager are not setup to failover.
107

Failback

FAILBACK tcp connected at 31:40 The host, RH9 becomes available on OAM about 2 minutes later.
108

RH9 available

109

ICM Failback

110

Concurrent Manager Log


Starting Internal Concurrent Manager Concurrent Manager : 18-JAN-2009 22:53:33 : Started ICM on Target RH9. Process monitor session ended : 18JAN-2009 22:55:03 : Migration of ICM has completed. Shutting down Internal Concurrent Manager : 18JAN-2009 22:55:33 The VIS_0118@VIS internal concurrent manager has terminated successfully - exiting.
111

112

Failback Complete

Total Failback Time 3 minutes and 45 seconds


113

Standard Manager before Failover

The Standard Manager has 3 Actual and Target processes.

114

Standard Manager is DOWN

115

Standard Manager has 2 Processes on Failover

After 3 minutes and 30 seconds the Standard Manager started on RH7


116

Shutdown of CP

117

Concurrent Processing Load Balancing


Two types of Load Balancing Load Balancing with both nodes running no failover Load Balancing during failover

118

PCP Load Balancing


One of the benefits Parallel Concurrent Processing provides:
failover in case of node failure
maintain throughput and keep the business running during node failures.

When a node fails, the processes that were running on the failed node are restarted on secondary nodes. However, a resource intensive node may overload the secondary node when it fails-over.
119

PCP Load Balancing


If too many processes are running on the secondary node when the primary node fails over, the secondary node may not have the capacity to process the requests from additional concurrent managers. R12 introduces Failover Sensitive Workshifts. This enhancement allows the System Administrator to configure how many processes failover for each workshift. With this added control, System Administrators can enjoy the benefits of PCP failover without risking performance issues through overloaded resources.

120

R12 Failover Sensitive Workshifts

121

Failover Sensitive Workshifts

122

Failover Sensitive Workshifts

Conversely, if a failover occurs from node 1 to node 2, we may want to reduce the failover processes, however, this doesnt work. Only if the node fails does the failover processes take effect.
123

Failover Processes

PO Document Approval Manager and the Standard Manager will reduce the number of processes when RH7 fails. When RH9 fails, the number of failover processes for managers that run on RH7 are not reduced.

124

Failover Sensitive Workshifts


Its clear: to run a R11i or R12 system during a failover, there are two choices: Run the servers at 35% or less utilization Reduce the number of processes that are allowed during failover For most businesses the second option is the most practical.
125

References
249213.1 - Performance problems with Failover when TCP Network goes down 364171.1- TAF Session Hangs, Select Fails To Complete W/ Loss Of NIC: Tune TCP Keepalive 211362.1 - Process Monitor Session Cycle Repeats Too Frequently 291201.1 - How To Remove a Dead Connection to the Target Database 362135.1 - Configuring Oracle Applications Release 11i with Oracle10g Release 2 Real Application Clusters and Automatic Storage Management Optimizing the E-Business Suite with Real Application Clusters (RAC) - Ahmed Alomari 240818.1 - Concurrent Processing: Transaction Manager Setup and Configuration Requirement in an 11i RAC Environment R12 ATG - Concurrent Processing Functional Overview Aaron Weisberg 210062.1 - Generic Service Management (GSM) in Oracle Applications 11i 271090.1 - Parallel Concurrent Processing Failover/Failback Expectations 241370.1 - Concurrent Manager Setup and Configuration Requirements in an 11i RAC Environment 602899.1 - Some More Facts On How to Activate Parallel Concurrent Processing

126

You might also like