3PAR CTO Site Trouble Rev7

3PAR CTO site Trouble-shooting guide rev 7 Purpose: The purpose of this document is to provide guidelines to the 3PAR
R 1st level test technicians for troubleshooting failures that occur at the WW CTO sites. It is not intended to cover all troubleshooting levels. Products covered are 3PAR F/T/V-Class products. The document is structured by reported failure mode and lists probable causes/corrective actions for each failure mode. Troubleshooting approach: 1. Ensure the unit remains in troubleshooting mode while performing troubleshooting steps. 2. Use troubleshooting notes to capture all actions performed & results observed. 3. Log repair actions via the GUIDE interface. Failure modes: ************************************************************************************* Failure mode: Timed out waiting for Whack Failure mode details: ERROR Timed out waiting for Whack Failure mode explanation: A terminal server connection to the Node has been established & the test is waiting for a prompt (whack) to be provided by the Node which confirms the Node is up & ready to communicate with the test system. If the Node does not provide a whack prompt within 2minutes the test will fail. The issue could be the serial connection from the terminal server to the Node is not working or the Node is not working. Probable causes & corrective actions: 1. 2. 3. 4. 5. Check the test serial cables are plugged into all Nodes in the configuration Check the test serial cables are fully seated in each Node serial port Check the correct test serial cables are connected to the Nodes Check the Node is powered on If the error persists contact the test lead to troubleshoot further
************************************************************************************* Failure mode: Failed to telnet to TSERV Failure mode details: FAILURE MESSAGE 1 Failed to telnet to TSERV101104167 or
Failure02 Failed 01bsystemsetup for Failed to telnet to TSERV101104175 5001 Failure mode explanation: The test fails to establish a terminal server connection to the Node. Only one terminal connection can be made to each Node. A typical problem is someone has manually made a terminal connection. In the December 2011 test code release an improvement was made that will clear any existing terminal server connections before starting test. Probable causes & corrective actions: 1. Do not troubleshoot contact the test lead to troubleshoot. ************************************************************************************* Failure mode: LESB error Failure mode details: Checking for LESBs lesbcheck started at Fri Nov 11 002327 CST 2011 Retrieving port LESB stats for following ports 201 202 203 ERROR Port 203 LESB errors or Checking for LESBs lesbcheck started at Wed Nov 23 08:59:36 SGT 2011 Retrieving port LESB stats for following ports: 0:2:1 0:2:2 0:2:4 ERROR: Port 0:2:4 - LESB errors ID ALPA ----Port_WWN---- LinkFail LossSync LossSig PrimSeq InvWor
Failure mode explanation: LESB stands for Link Error Status Bits. During the test errors on the fiber channel links are checked. If a link error is detected the test will fail the checking LESB step. The specific path where the error occurred is reported. Example: ERROR: Port 0:2:4 - LESB errors, this refers to Node 0 PCI slot 2 Port 4
Probable causes & corrective actions: 1. If the cables & ports were not cleaned with compressed air before starting the test, clean the cable & port where the error has occurred. 2. Check the FC cable is seated correctly. 3. If the problem persists replace the FC cable. 4. If the problem persists, contact the test lead to troubleshoot further. ************************************************************************************* Failure mode: Disk event error Failure mode explanation: When running the stress test a drive can post an error. The array operating system will detect this error & log the error in the event log. The CCT test monitors the event log every 30mins & will fail the test on valid drive disk event errors. If the test fails for a single disk event error, do not restart the test the drive needs to be replaced. The specific drive errors are highlighted in yellow. Failure mode details: Severity Informational Type Disk event Message pd 159 port b0 on 164 cmdstat0x00 Failure mode explanation: or Time 20111215 114203 SGT Severity Degraded Type Disk event Message pd 84 port a0 on 222 cmdstat0x1d or Severity Debug Type Disk event Message pd 55 port b0 on 301 cmdstat0x04 or Failure #2: Failure during stress test Message : pd 22 port a0 on 2:0:1: cmdstat:0x04 (TE_CRC -- CRC error), scsistat:0x02 (Check condition), snskey:0x03 (Medium error), asc/ascq:0x11/0x1 (Read retries exhausted), toterr:289, deverr:5 Located PD 22 at physical location Cab0 Cage0 Bay5 Severity Debug or Severity Informational Type Disk event Message pd 16 port a0 on 061 cmdstat0x00 TEPASS
Success scsistat0x02 Check condition snskey0x01 Recovered error ascascq0x150x0 Random positioning error info0x10c0c306 cmdspec0x0 snsspec0x40080 host0x4 abort0 CDB280006C3C00000020000 Read10 blk0x6c3c000 blkcnt 0x200 frucd0x0 LUN0 LUNWWN0000000000000000 toterr151 deverr5 or Type : Disk event Message : pd 40 port a0 on 2:0:1: cmdstat:0x01 (TE_FAIL -- Generic failure code), scsistat:0x02 (Check condition), snskey:0x0b (Aborted command), asc/ascq:0x47/0x0 (Scsi parity error), info:0x97ef4b09, cmd_spec:0x0, sns_spec:0x600000, host:0x4, abort:0, CDB:2A00094BEE0000020000 (Write10), blk:0x94bee00, blkcnt 0x200, fru_cd:0x3, LUN:0, LUN_WWN:0000000000000000, toterr:29, deverr:1
Probable causes & corrective actions: 1. The disk drive is faulty replace the disk drive 2. If problem persists at the same location (same PD), contact the test lead to troubleshoot further. ************************************************************************************* Failure mode: Slow PD or Slow drive - reported on a single drive Failure mode explanation: The test script or product OS code monitors for slow performing drives. If a slow drive is detected, the drive will be failed by the test. With newer OS versions, 3PAR InForm OS 2.3.1.284 (MU2) P17and 3.1.1 GA, monitoring of slow drives occurs by the Inform OS code and an event is posted. Prior to these OS versions the slow drive check is performed by the manufacturing test script (histpd). Failure mode details: Slow disk PD 63 63 163 53442502 1136 0 15 0 0 0 16 0 or Slow PD 12 222 2900 881 761 540 226 33 26 141 189 821 50 or Disk Event (With newer OS versions see above) Severity : Degraded Type : Notification Message : Marking slow disk 211 failed
Probable causes & corrective actions:
1. The disk drive is faulty replace the disk drive 2. If problem persists at the same location (same PD), contact the test lead to troubleshoot further. ************************************************************************************* Failure mode: Eagle memory or control cache DIMM error Failure mode details: Eagle memory cerr Message posted by node 0MemComm error reg 0x00000081 PRECE MC0INTMEM ADDR 0x000000017db1f3c0 DIMM 000 Failure2 STAGE01anodesetup Failed or Time 20111112 230610 CST Severity Major Type Control Cache DIMM CECC Monitoring Message Node 2 DIMM2 Correctable ECC limit exceeded Failure mode explanation: While checking the controller memory a memory error occurs. Probable causes & corrective actions: 1. Node memory issue replace Node 2. If the problem persists contact the test lead to troubleshoot further ************************************************************************************* Failure mode: Code 37 event error Failure mode details: Bios eeprom log events Message Node 0 log Code 37 GEvent Triggered Subcode 0x80002002 0 Failure mode explanation: Controller fatal error 37 was detected. Probable causes & corrective actions: 1. Node issue replace Node ************************************************************************************* Repair procedures: Node replacement:
If a Node is replaced, reseated or the slot location is changed, always check that the Node connectors have not dislodged (come loose) after the Node removal. HBA replacement : Ensure HBA cards are NOT hot-swapped or hot reseated. This causes damage to the HBAs and is causing a latent defect in the field. The proper procedure is: 1. Perform a clean shutdownsys of the nodes. 2. Power off all nodes using the power switch on the nodes. This procedure is also true when moving or reseating nodes.

3PAR CTO Site Trouble Rev7

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3PAR CTO Site Trouble Rev7

Uploaded by

Copyright:

Available Formats

3PAR CTO site Trouble-shooting guide rev 7 Purpose: The purpose of this document is to provide guidelines to the 3PAR

Probable causes & corrective actions:

You might also like