Professional Documents
Culture Documents
No. TC0296
Nb of pages : 11
Date : 15-03-2002
TEMPORARY
PERMANENT
SUBJECT : COLLECTING INFORMATION IN CASE OF CPU CRASHES This technical communication cancels and replaces the technical communication TC0057. This technical communication gives tricks and trips about CPU problems like reboot, stop of the telephone application or no possibility to connect on the system.
OmniPCX 4400
COLLECTING INFORMATION IN CASE OF CPU CRASHES
CONTENTS
1. 2. INTRODUCTION ........................................................................ 3 TRICKS AND TRIPS ABOUT CPU PROBLEMS ................................ 3
Who has initialised the reboot of the system? ......................................... 3 The telephone application stops ............................................................. 4 One CPU reboots continuously ............................................................... 4
2.1.1. Automatic shutdown ............................................................................................3
2.3.1. Problem with the IO2 configuration .....................................................................4 2.3.2. Problem with the IO2 board ................................................................................5 2.3.3. Problem with OPS files.........................................................................................5
Database corruption............................................................................... 5 Checking the installation of the software version .................................... 6 Checking the V24 ports .......................................................................... 6 Capturing the system information........................................................... 6
3. 4.
Ed. 15-03-2002
TC0296
OmniPCX 4400
COLLECTING INFORMATION IN CASE OF CPU CRASHES
4.9.
4.10. Configuration of the system.................................................................. 10 4.11. Type of CPU ......................................................................................... 10 4.12. References of the CPU .......................................................................... 10 4.13. Detection of a memory corruption ........................................................ 10 4.14. Crashdump file .................................................................................... 10
TC0296
Ed. 15-03-2002
OmniPCX 4400
COLLECTING INFORMATION IN CASE OF CPU CRASHES
1.
INTRODUCTION
This technical communication gives tricks and trips about CPU problems like reboot, stop of the telephone application or no possibility to connect on the system. In case no solution is found, some hardware checking will have to be done. If the solution is still not found, a Service Request of observation sheet type will have to be opened by the Business Partner. The observation sheet will contain a list of information described at the end of this document. Many different log files are stored on the system. Each reset of the system will replace some previous log files by new ones. The log files regarding one specific problem must be collected as soon as possible before the information is lost.
2.
2.1.
2.1.1. Automatic shutdown Edit the text files /DHS3dyn/incid/incpbm.1, /DHS3dyn/incid/incpbm.2 /DHS3dyn/incid/incpbm.3.
A line containing mailsys asks shutdown means that the software application has launched a reboot itself. There was no manual action done by anybody to start the shutdown.
0005 Fri Jan 25 15:42:30 2002 mailsys asks 'echo `ps|wc -l` processes running&' 0006 Fri Jan 25 15:42:30 2002 mailsys asks 'shutdown &' at 25/01/02 15:42:30 ; S 0007 Fri Jan 25 15:42:31 2002 mailsys asks 'echo `ps|wc -l` processes running&'
In this case, the list of the incidents on the system will then contain a gravity 0 incident that confirms the reboot. The reboot was started because some failure was detected in the system. The reason of the failure may also be indicated in the incidents list. Note When analysing the incidents of the system, check that there is no filter and all the incidents are displayed. Even some incidents with lower priority can indicate the reason of the reboot.
Ed. 15-03-2002
TC0296
OmniPCX 4400
COLLECTING INFORMATION IN CASE OF CPU CRASHES
2.2.
The telephone stopped but the system does not restart. There is no V24 or IP access on the system. The screen on the console port remains blank. Please try to generate manually a crashdump. The crashdump is a copy of the complete memory of the system. The crashdump file will be have to be added to the observation sheet. On the console port only, use the following commands: start a "Capture Text" on the Windows terminal, to enter in the kernel debugger, press : Ctrl G then Ctrl O the following message appears: ** Entry in the debugger requested, ** Press <Return> within 30 seconds to enter into the kernel debugger press the key Return immediately, the following message appears: --- Enter Debugger kdbg> to start the crashdump: kdbg> ED It will take a few seconds. to reboot the system: kdbg> ZZ the "Capture Text" will have to be added to the observation sheet with the crashdump file; see 4.14 for the way to extract the crashdump file.
2.3.
2.3.1. Problem with the IO2 configuration If there are IO2/IO2N boards, please check that the management of these boards is correct and identical on the Main and Stand-By CPU. After the reboot of the CPU, stop the launch of the telephone application. From the software version C1.712, the following commands will permit to consult and modify the management of the IO2 boards even if the telephone is stopped: login : mtcl a4400> RUNMAO a4400> mgr
TC0296
Ed. 15-03-2002
OmniPCX 4400
COLLECTING INFORMATION IN CASE OF CPU CRASHES
2.3.2. Problem with the IO2 board When the IO2 board is available and managed, it has the switching role. It takes place of the embedded IO1 in the CPU. When the reboots cannot be explained, it is interesting, for test, to replace it by an IO2N board if possible; as the software is different, the reaction of the system will be also different and can provide information on the initial default. Attach the result of this test to the observation sheet. Note The IO2N is only taken into account from some software releases. Please refer to the technical communication TC0192 - Installation procedure of IO2N boards. When an IO2 board or IO2N board is installed with a CPU, the same type of IO2 board must be installed with the duplicated CPU (if any). 2.3.3. Problem with OPS files Normally, the same OPS files are installed on both CPU. If it is not the case and the field PARA_MAO 1 of the file hardware.mao is different, a permanent CPU reboot can occur with the incidents 2076 or 2070: 2076 = Size of a TEL's or remanent's region is not the same on main and stand-by CPU or for Releases 1.4/2.x : 2070 = Swapping mode is not the same on main and stand-by CPU This may occur when you install new OPS files on the Stand By CPU and there is a modification of the size of remanent data. In this case, the telephone application will have to be completely stopped to install the new OPS files on both CPU.
2.4.
Database corruption
In case of a database corruption, some incidents will be listed in the incidents file. You can also use the following commands to check the database: a4400> cd /DHS3data/mao a4400> checkinitrem If a corruption of the database is suspected, you have to restore a save of the database. In the recent versions, a save of the database is by default automatically done every day on the hard disk of the CPU. In order to restore a save of database, please use the following commands: login : swinst Option 4 : Select Save & Restore operations Option 4 : Select Restore operations Option 2 : Select Restore from cpu disk Choose the save file to restore among several present on the disk.
Ed. 15-03-2002
TC0296
OmniPCX 4400
COLLECTING INFORMATION IN CASE OF CPU CRASHES
2.5.
The integrity test checks that the software version has been correctly installed on the CPU. Please use the following commands: login : swinst Option 8 : Select the menu Software identity display Option 6 : Application software validity checking Select the partition to be checked. Check the result of the integrity test. In case of problem (a message of "Checksum incorrect" type is displayed), the software must be loaded again.
2.6.
In case of a CPU3: if a modem or TA is connected to a V24 port with login, it must not be managed as follows: replies Hayes codes to commands and local echo for a modem, a welcome menu on TA,
check the presence of an client application which will dialog with a V24 port of the CPU. a loop can occur if the V24 port of the CPU is managed with a login. The system could reboot. Trick Use the following command to verify the activity of the V24 ports : a4400> sar y 1 <nb> (where <nb> = number of "scan" on the port ; example : 20 ) The parameter 1 means a scan every second. This command provides the number of bytes sent and received on the serial ports.
(410)xa004010> sar -y 1 20 Chorus xa004010 MiX V.3.2r4.1.5 r4.1.5 COMP-386 01/25/102 15:25:18 rawch/s canch/s outch/s rcvin/s xmtin/s mdmin/s 15:25:19 0 0 124 0 124 0 15:25:20 0 0 58 0 58 0 15:25:21 0 0 58 0 58 0 15:25:22 0 0 58 0 58 0 etc up to 20 lines
Check the outch/s field; it indicates the number of characters sent on all V24 ports.
2.7.
In case of frequent CPU problems without any explanation, start a continuous "Capture Text" on the Windows terminal of the PC connected to the console port. The system messages are not stored and are output only on this port. This trace could be added in the observation sheet.
TC0296
Ed. 15-03-2002
OmniPCX 4400
COLLECTING INFORMATION IN CASE OF CPU CRASHES
3.
HARDWARE INVESTIGATIONS
When the software checks have been done and the problem is not fixed, some investigations must be done on the hardware items like : CPU board : swap the CPU with another board, power supply, grounding connection, default on ET board (located in the bottom left-hand cabinets), power rectifier too low, back panel, external environment, etc.
4.
Each time there is a CPU crash, the hardware has been checked and you cannot find the reason or provide a solution, an observation sheet must be prepared with all the information described below.
4.1.
The observation sheet will describe the problem in details and answer the following questions: Is the telephone still running ? Did the system make a reset ? Did the system make a reset by itself ? Did the system recover by itself from the problem ? Did somebody manually do something to reboot the system ? What is the display on the UA sets during the problem ? Do you have tone ? Can you connect on the system during the problem ? Is it a new installation ? If this is a new problem on an old system, what has been modified on the installation ? What is the frequency of the problem. When did it happen (date and time) ? Is the OPS configuration conformed to customer requirements (traffic, virtual sets, etc.).
4.2.
The observation sheet will indicate the investigations that were conducted on site: board swapping, item replacement, checking, etc. These information will avoid the TS to ask to test something that was already tested on site.
Ed. 15-03-2002
TC0296
OmniPCX 4400
COLLECTING INFORMATION IN CASE OF CPU CRASHES
4.3.
The log files of the incidents are stored on the system as follows:
-------------------------------------------------------------------------------------------------------> time reboot 2 reboot 1 last reboot NOW
Add the results of the following commands in the observation sheet : a4400> incvisu a4400> incvisu -1 a4400> incvisu -2 Check that all the incidents are displayed and no incident is filtered.
4.4.
The log files of the exceptions are stored on the system as follows:
-------------------------------------------------------------------------------------------------------> time reboot 2 reboot 1 last reboot NOW
Add the results of the following commands in the observation sheet: a4400> excvisu a4400> excvisu -1 a4400> excvisu -2
4.5.
The log files of the black boxes are stored in the directory /tmpd as follows :
------------------------------------------------------------------------------------------------------------> time reboot -3 reboot 2 reboot 1 last reboot NOW
Add the results of the following commands in the observation sheet: a4400> readbbox a4400> readbbox -1 a4400> readbbox -2 a4400> readbbox -3
TC0296
Ed. 15-03-2002
OmniPCX 4400
COLLECTING INFORMATION IN CASE OF CPU CRASHES
4.6.
The log files of the telephone application are stored in the directory /tmpd as follows:
------------------------------------------------------------------------------------------------------------> time reboot -3 reboot 2 reboot 1 last reboot NOW
Add the copy of the four text files in the observation sheet : /tmpd/DHS3-INIT.log, /tmpd/DHS3-INIT.olog, /tmpd/DHS3-INIT.log2, /tmpd/DHS3-INIT.log3.
4.7.
The log files of Chorus are stored in the directory /etc as follows :
-------boot.log3-----><------------boot.log2----------><----------- boot.log1--------><------- boot.log
Add the results of the following command in the observation sheet : a4400> traceboot v
4.8.
The log files of the system are stored in the directory /DHS3dyn/incid as follows:
----incpbm.3-----<-----------incpbm.2------------><----------incpbm.1----------><----------incpbm
Add the copy of the three text files in the observation sheet: /DHS3dyn/incid/incpbm.1, /DHS3dyn/incid/incpbm.2, /DHS3dyn/incid/incpbm.3.
4.9.
For each shelf of each node, add the results of the following command in the observation sheet: a4400> config x (x = number of shelf)
Ed. 15-03-2002
TC0296
OmniPCX 4400
COLLECTING INFORMATION IN CASE OF CPU CRASHES
TC0296
10
Ed. 15-03-2002