You are on page 1of 29

Exalogic Elastic Cloud Compute Node System

board Replacement
Connectivity to the rack will depend on the customer's access requirements. The following
procedure is to be used with the latest EIS Checklist & the Exalogic Owner's Guide (Section 3.4),
assuming using a laptop attached to the Cisco management switch. If no port is available in a full
rack, then temporarily disconnect a port used for one of the PDU’s and export the backup files to
/tmp on another compute node in the rack using scp. If the customer does not allow login access to
the host ILOM, then they will need to run the commands given below.

Remember if connecting to ILOM via serial cable that the baud rate is 9600 for replacement
boards. This will get corrected during the post-install procedure to the Exalogic default which is
115200 for installed boards.

Note : If you disconnect one of the ILOM ports for temporary usage - the entire compute node will
be inaccessible from the management network.

Reference links:
http://eis.us.oracle.com/checklists/pdf/Exalogic-X5.pdf

http://docs.oracle.com/cd/E18476_01/index.htm

You may also need 3.x Firmware versions under Replacement


Procedures section:- http://www.oracle.com/technetwork/documentation/sys-mgmt-
networking-190072.html#ilom

PLEASE NOTE : If passwords have been changed from


factory defaults, please consult with the
customer SA. If you need ‘Service’ or ‘Escalation’
passwords on the ILOM and are unable
to generate these yourself per instructions below,
please open an Oracle Service Request ticket to
have one
generated. Have the output from show /SYS on the
ILOM ready together with ‘version’ information.

ILOM Firmware
Prior to going to site it is a good idea to find
out what ILOM firmwares is running on the
customers
rack systems. You may well need to take this
software with you to upgrade the firmwares to the

Version 4 December 10, 2013 Barry Wright


appropriate
level. See this tech note:

Exalogic Patch Set Updates (PSU) Master Note (Doc


ID 1314535.1)

Extract the necessary ILOM bits and apply


manually.

Section 1. Pre-Install Procedures


1. Backup ILOM Settings.

Assuming the ILOM is not the reason for the replacement of the system motherboard, then take a
current backup of the ILOM SP configuration using a browser under ILOM Configuration Tab.
(Different versions of ILOM may have this in different locations on the BUI)

This can also be done from the ILOM CLI as follows:-

-> cd /SP/config
-> set passphrase=welcome1
-> set dump_uri=scp://root:password@laptop_IP/var/tmp/SP.config

The FSE/customer should also collect /SP/network and /SP since you will need to manually set the
IP address after the MB is replaced.

2. Obtain the correct Serial Numbers required


(a) Make a note of the System Serial Number from the front label of the server.
(b) Make a note of the Rack Master Serial Number from the front label of the rack (left-side
vertical wall, half way up the rack).

3. Determine if the compute node is a physical or virtual installation.


(a) Ask SA
(b) Solaris: # cat /etc/release
Linux: # cat /etc/redhat-release
OVS: # cat /etc/redhat-release or /etc/enterprise-release (Oracle VM server release)

4. If the system is not down already due to whatever problem is causing the motherboard to be
replaced, the system administrator should prepare the system for service by performing any
application related functions required to shutdown the compute node. This might include but is not
limited to performing a system backup, fail-over of application or services, and finally a system
shutdown. If this is an OVS Virtual installation please confirm the SA has put the compute node in
‘Maintenance’ mode from OVMM/EMOC.

Maintenance Mode:
Please consult with site SA with these steps as they should run them.

Version 4 December 10, 2013 Barry Wright


If the customer needs assistance with these procedures, they should engage
EEST for assistance through the SR

The general process that needs to be followed when an identity changing repair is needed is as
follows;

a) Identify any vServers running on the Node to be serviced and shut them down.

b) Remove server from Ops Center


1) remove (delete) operating system asset
2) remove server asset

c) Remove server from server pool and from OVMM (interacting with OVMM)
1) remove server from server pool
2) unexpose OVS repository
3) remove server from OVMM

Details of OVMM/EMOC tasks:


Please consult with site SA with these steps as they should run them.
If the customer needs assistance with these procedures, they should engage
EEST for assistance through the SR
Taking the compute node out of EMOC and OVMM
Reference:
Making an OVS Node Unavailable for vServer Placement in Exalogic Virtual Environments
using EMOC (Doc ID 1551724.1)

Note:

The above procedure works OK if the node is up - if it is dead – the node may come up with no
/OVS mounted & does not join the cluster.

Workaround:

From OVM Manager

1. Edit the server pool and remove the node already added.
2. Go to the events view in OVMM for the unassigned server and acknowledge all events (green)
3. Rediscover the server.
4. Edit the server pool and add it once more (may cause the node to reboot)
5. When the node is back up , add the server to the pool again.

Note: A new system board will have a new UUID which will change the compute node's identity.

The affected node should be already without VM s running and in ‘Maintenance’ mode.

Version 4 December 10, 2013 Barry Wright


If node has Exalogic Control (EC) VM’s before set Maintenance’ mode, stop the EC VM and start it
(using xm create ../vm.cfg) in another alternative node,
Take care with the order to stop/start EC VM’s . It is important for all Exalogic Control VM’s to
continue running prior to putting
the failed Compute node in Maintenance Mode.

Check doc:
How To Stop and Start the Entire Exalogic Control Stack In An Exalogic EECS v2.0.6.0.0 and
later Virtual releases (Doc ID 1594223.1)

EMOC tasks

Example here assumes Compute Node 2 has had a failure. See Doc ID 1551724.1 for removal of
node from EMOC.

Login with ‘root’ user:

Identify any vServers running on the Node to be serviced and shut them down.

You may get an error as you see here - the error seen here can be ignored

Version 4 December 10, 2013 Barry Wright


Next:: remove server asset (Compute Node 2)

check node to be serviced (compute Node 2 in this case) disappears in the left panel

Version 4 December 10, 2013 Barry Wright


OVMM tasks:

Note: log in with admin user to the OVMM BUI

Remove computer node from Server pool: (Compute Node 2)

Select ‘Server pool’ and click ‘Edit Server pool’ action then “Edit Server Pool” pops up,
select ‘Server’ tab and then move node (Compute Node 2) to the left pane, click OK, news jobs
starts, wait to be completed.

Version 4 December 10, 2013 Barry Wright


Once the job completes, the removed compute node should not be listed in the pool (cnode is now
in Unassigned Servers)

Node should not be listed in server pool now:

Version 4 December 10, 2013 Barry Wright


Checking node is not presented in Repository:

Click OVMM BUI ‘Repositories’ tab, select / in left repositories pane, then click the two green
arrows, next select ‘Servers’ and check cnode is not present on the right .

Version 4 December 10, 2013 Barry Wright


Checking admin servers:

In the ‘Storage’ tab, select the Generic Network File System, this will bring up the Add/Remove
Admin Server list, ensure cnode is not listed there. If there, select << remove from list.

Version 4 December 10, 2013 Barry Wright


In ‘Servers and VM’s ‘ tab , select ‘Unassigned Servers’, the cnode should now be listed there

Version 4 December 10, 2013 Barry Wright


Select node and click the red cross to delete it, a new job starts to delete node, wait job completion

now Unassigned list is empty

Version 4 December 10, 2013 Barry Wright


Now you are ready to replace/repair the faulty system board.

Section 2. Physical Replacement


1. Replace the motherboard as per Service Manual, migrating existing CPUs, DIMMs, PCI Cards
and risers.

NOTE:- Pull power cords before opening the top cover to avoid a SP degraded condition.

Reference links for Service Manual's:

– Exalogic Elastic cloud X2-2 / Compute Node: Sun Server X4170 M2


(http://docs.oracle.com/cd/E19762-01/index.html)
– Exalogic Elastic cloud X3-2 / Compute Node Sun Server X3-2
(http://docs.oracle.com/cd/E22368_01/index.html)
– Exalogic Elastic Cloud X4-2 / Compute Node Sun Server X4-2
(http://docs.oracle.com/cd/E36975_01/index.html)
Exalogic Elastic Cloud X5-2 / Compute Node Sun Server X5-2
– : http://docs.oracle.com/cd/E41059_01/html/E48320/index.html

2. Carefully follow the port numbers on the cables when re-attaching so they are not reversed. It is
easiest to plug cables in while the server is in the fully extended maintenance position.

3.Do not power up the system yet just ILOM.

Section 3. Post-Replacement Procedures


1. After the motherboard been replaced, the new motherboard brings a different MAC addresses,
for this reason the management network will not work through eth0 interface.

Solution:
Update MAC in ifcfg-eth interfaces as indicated in document: Network Interfaces Not

Version 4 December 10, 2013 Barry Wright


Operational After Motherboard Replacement on Exalogic Nodes (Doc ID 1569243.1)

2. Update the Serial Number on the new motherboard, to that of the server chassis. This is
REQUIRED in order for ASR to continue to work on the unit, and is REQUIRED for all servers
that
are part of Exalogic racks that may have a future Service Request, whether ASR is configured
now or not.

Exalogic Elastic Cloud X2-2 and X3-2 machines with x4170m2 or X3-2 / X4-2 / X5-2
Compute Nodes

These platforms use the Top Level Indicator (TLI) feature in ILOM to perform the motherboard
serial number update automatically.

For more information on TLI and restricted shell lease refer to the following 2 MOS notes for these
systems:-

TLI MOS Note 1280913


Restricted Shell MOS Note 1302296

NOTE:- The serial numbers of each server can be found at the front on the left hand side.

(a) Login as “root” with password “changeme”.

(b) Check with “show /SYS”

-> show /SYS


...
Properties:
type = Host System
ipmi_name = /SYS
product_name = SUN FIRE X4170 M2 SERVER
product_part_number = 602-4980-01
product_serial_number = 1039FMM0E6
product_manufacturer = SUN MICROSYSTEMS
product_swoRDFish_id = urn:uuid:3158f092-cb2e-11de-9e65-
080020a9ed93
fault_state = OK
clear_fault_action = (none)
power_state = On

If the replacement has the correct product serial number, then skip to step 2 of the post-replacement
procedures. If the replacement does not have the product serial number populated correctly, then
continue:

(c) Where there is at least one container which still contains valid TLI information, a
service mode command copypsnc can be used to update the product serial number.

Version 4 December 10, 2013 Barry Wright


Login as root and create escalation mode user with service role:

-> cd /SP/users
-> create sunny role=aucros (will ask for password)

(d) Gather “version”, “show /SYS” and “show /SP/clock” outputs needed for
generating the service mode password:

-> version
SP firmware 3.0.9.27.a
SP firmware build number: 58740
SP firmware date: Tue Sep 14 15:48:24 EDT 2010
SP filesystem version: 0.1.23

-> show /SYS


...
Properties:
type = Host System
ipmi_name = /SYS
product_name = SUN FIRE X4170 M2 SERVER
product_part_number = 602-4980-01
product_serial_number = 00000000
product_manufacturer = SUN MICROSYSTEMS
product_swoRDFish_id = urn:uuid:3158f092-cb2e-11de-9e65-
080020a9ed93
fault_state = OK
clear_fault_action = (none)
power_state = On

-> show /SP/clock


...
Properties:
datetime = Mon Oct 24 17:12:43 2011
timezone = EDT (America/New_York)
uptime = 56 days, 08:02:02
usentpserver = enabled

(e) Generate a service mode password using “http://ilompass.us.oracle.com/” Login is


via Sun unix login name and LDAP password (Oracle Single-Sign-On in the future).
Example output of the tool is:

BRAND : sun
MODE : service
VERSION : 3.0.9.27.a
SERIAL : 00000000
UTC DATE : 10/24/2011 17:12

POP DOLL PHI TOW BRAN TAUT FEND PAW SKI SCAR BURG CEIL MINT DRAB
KAHN FIR MAGI LEAF LIMB EM LAWS BRAE DEAL BURN GOAL HEFT HEAR KEY
SEE A

Version 4 December 10, 2013 Barry Wright


If you are unable to get to ilompass.us.oracle.com ,
try https://modepass.us.oracle.com/modepass/

(f) Logout of root and log back in as 'sunny' user that you created, and enter Service
mode:

-> set SESSION mode=service


Password:*** **** *** *** **** **** **** *** *** **** **** ****
**** **** **** *** **** **** **** ** **** **** **** **** **** ****
**** *** *** *
Short form password is: ARMY ULAN HULL

Currently in service mode.

(g) Review the current PSNC containers with “showpsnc” command:

-> showpsnc
Primary: fruid:///SYS/PDB
Backup 1: fruid:///SYS/MB
Backup 2: fruid:///SYS/DBP

Element | Primary | Backup 1 | Backup 2

------------------+-------------------+-------------------+-------------------
Container Status Invalid Valid Valid
PPN 602-4980-01 602-4980-01 602-4980-01
PSN 00000000 1039FMM0E6 1039FMM0E6
Product Name SUN FIRE X4170 M2 SERVER SUN FIRE X4170 M2 SERVER SUN FIRE
X4170 M2 SERVER
WWN 500605b00290a3e0 500605b00290a3e0 500605b00290a3e0
->

(h) Correct the invalid containers using the “copypsnc” command:

(i) Verify it is correct with “show /SYS” and “showpsnc”.

(j) Logout from the 'sunny' user, and log back in as root, and remove the 'sunny' user:

-> delete /SP/users/sunny

2. Re-flash the ILOM/BIOS to the correct levels required for Exalogic Elastic Cloud.
(a) login to another compute node ilom and check the version.
-> version
SP firmware 3.0.16.10.a
SP firmware build number: 68533
SP firmware date: Wed Oct 12 10:46:03 EDT 2011
SP filesystem version: 0.1.23

(b) If you do not have the correct firmware installed, and you know the correct
version, then it can be obtained from MOS patches or EIS DVD.

Version 4 December 10, 2013 Barry Wright


Sun EXALOGIC Current Product Patches & Firmware (Doc ID 1432636.1)
https://support.us.oracle.com/oip/faces/secure/km/DocumentDisplay.jspx?id=1432636.1

http://eis.central.sun.com/eisdvd/eisdvd.html

3. Restore the backed up SP configuration done during the pre-installation steps.

(c) Using a browser under Maintenance Tab or from ILOM cli:-

-> cd /SP/config
-> set passphrase=welcome1
-> set load_uri=scp://root:password@laptop_IP/var/tmp/SP.config

If SP backup was not possible check with customer for network information & use another ILOM
within the rack for general settings. The primary specific setup for Exalogic are:
(a) Baud rate is 115200
(b) /SP system_identifer is set to the appropriate rack type string and master
Rack Serial Number. This is critical for ASR deployments. The Master Rack Serial
number can be obtained top left inside the cabinet or from show /SP on any other
ILOM. The string should be of the following format:
1. system_identifier = Oracle Exalogic X2-2 1052AK22D6

For Example:
-> show /SP
Properties:
check_physical_presence = true
hostname = elx22bur09cn01-ilom
reset_to_defaults = none
system_contact = (none)
system_description = SUN FIRE X4170 M2 SERVER, ILOM
v3.0.16.10.a, r68533
system_identifier = Oracle Exalogic X2-2 1052AK22D6
system_location = (none)

(c) /SP hostname is setup


(d) /SP/network settings
(e) /SP/alertmgmt rules that may have been previously setup by ASR or cell
configuration
(f) /SP/clock timezone, datetime, and /SP/clients/ntp NTP settings
(g) /SP/clients/dns Name service settings
(h) root account password.

If the root password has not been changed to customers you can
have the customer do this, or do this manually:

-> set /SP/users/root password=welcome1 (or customers password)

Version 4 December 10, 2013 Barry Wright


Changing password for user /SP/users/root...
Enter new password again: ********
New password was successfully set for user /SP/users/root

Finally, check you can login to all interfaces and ILOM can be
accessed using a browser and ssh from another system on the
customer's management network.

4. Confirm the sideband management is restored on the ilom.

Example:
-> show /SP/network

/SP/network
Targets:
interconnect
ipv6
test

Properties:
commitpending = (Cannot show property)
dhcp_server_ip = none
ipaddress = 10.152.223.171
ipdiscovery = static
ipgateway = 10.152.223.1
ipnetmask = 255.255.255.0
macaddress = 00:21:28:A5:BE:21
managementport = /SYS/MB/NET0 <--
outofbandmacaddress = 00:21:28:A5:BE:20 <--
pendingipaddress = 10.152.223.171
pendingipdiscovery = static
pendingipgateway = 10.152.223.1
pendingipnetmask = 255.255.255.0
pendingmanagementport = /SYS/MB/NET0 <--
sidebandmacaddress = 00:21:28:A5:BE:21 <--
state = enabled ←

(a)If the side band management or any other network settings


are incorrect, please follow the manual method documented
here:
http://docs.oracle.com/cd/E18476_01/doc.220/e18478/ilom.htm#CACCHBCB

(b) Reset the ILOM under the Maintenance Tab or from ILOM cli:
-> reset /SP

5. Configure the BIOS settings at POST.

Version 4 December 10, 2013 Barry Wright


(a) Power-on the host server and start a console either from the ilom over ssh or using the web
browser. Enter the BIOS during post. Press “F2” (ILOM java console) or “Esc-2” (ILOM serial) to
go into BIOS setup and check BIOS settings against EIS checklist and ECU.

(b) For Solaris Physical ECU, there are no BIOS changes required.
For Linux Physical ECU, Enable the C-states.
Make sure the CPU C-State is ENABLED. Follow the steps below:
• Login to the compute node ILOM
• set /HOST boot_device=bios
• start /SYS
• Start /SP/Console
• Wait for Menu
• Advanced Tab
• CPU Configuration
• Scroll to bottom
• Confirm enabled

For ECU for OVS Virtual installation there are specific bios settings that need to be set.

Login to the compute node ILOM

• set /HOST boot_device=bios


• start /SYS
• start /SP/console
• Wait for menu
• Advanced tab

Make sure that the PCI Payload is set to 256. Follow the steps below:
• PCI Express Configuration
• Change Maximum Payload Size to '256'

Make sure that the CPU C-State is DISABLED. Follow the steps below:
• CPU Configuration
• scroll to bottom
• confirm DISABLED

Enable SRIOV if the nodes have been freshly imaged to OVS from OEL. If
SRIOV is not enabled, the IB network will not be enabled. To enable SRIOV do
the following:
• Select I/O Virtualization and enable it
• Save Changes and Exit

6. OVS/OVMM General procedure after power up:

a) Discover server in EMOC


(using the appropriate profiles in Plan management/profiles/discovery)
1) discover OS (ServerOS@… profile - add assets action)

Version 4 December 10, 2013 Barry Wright


2) discover ILOM (ServerILOM@… profile - add assets action)

b) Discover server in OVMM and add in to server pool

1) discover server in OVMM


2) attach OVS repository
3) add server to server pool

6a. EMOC/OVMM software details:

Below are the detailed steps to follow to return a node back to EMOC and OVMM
Please consult with site SA with these steps as they should run them.
If the customer needs assistance with these procedures, they should engage
EEST for assistance through the SR

Once the node is up and running, it is good practice to check the admin network, and all other
networks, as ilom, node eth-admin, ipoib-xxx ip’ss – check these can be pinged from other nodes,
If everything is ok , proceed with the next steps

EMOC Details:
Log into EMOC with root user

Discover OS and next node HW in EMOC:

discover OS (ServerOS@... profile - add assets action)


select in Plan management, profiles and Polices, expand discovery, select cnode ServerOs@ profile)
and click 'add assets'--> Add now.

Version 4 December 10, 2013 Barry Wright


wait until the job completes successfully

Same with serverILOM@... profile - add assets action

Version 4 December 10, 2013 Barry Wright


wait until the job completes successfully

Now that the EMOC part is completed, check in assets that the cnode is listed OK in left panel and
is listed with the other nodes

Discover server in OVMM:

Log in to Oracle VM Manager using the admin user credentials and discover the new compute
node. Use the IP address of the IPoIB-ovm-mgmt partition.

In ‘Servers’ and ‘VM’s’ tab, select Server Pools and click ‘Discover’ server action

Type the cnode IPoIB-ovm-mgmt IP

Version 4 December 10, 2013 Barry Wright


The port used for server discovery by default in OVMM is 8899, it can be verified on the compute
node in file "/etc/ovs-agent/agent.ini"

The Default password for ovs-agent in OVMM is oracle . This is required during discovery.

Once the job completes , the cnode is in unassigned list:

Edit the server pool to include node :

Version 4 December 10, 2013 Barry Wright


Move the node to the right
it should be listed now in the serverpool

Version 4 December 10, 2013 Barry Wright


Refresh the repository in the Oracle VM Manager by doing the following:
1. Log in to the Oracle VM Manager console.
2. Select Repositories / in the right pane.
3. Click the Refresh Repositories icon. This is the icon with curved blue arrows.

Check if the node is presented in repositories (use the two green arrows action icon, select server in
pop up)

In the case here on this screen shot, the node is already presented

Version 4 December 10, 2013 Barry Wright


The node should also list /OVS/... repository mounted: #show mount

Version 4 December 10, 2013 Barry Wright


Check the admin server list: Storage -> File servers -> Generic Network Storage -> click
add/remove admin server action icon, check admin server list,. If the cnode was not included, move
it to the right

Version 4 December 10, 2013 Barry Wright


Node moved , next click OK:

Check the cnode is configured ok, Utility, VM Server, take Ownership flags selected, so node is
ready to work in normal operation in virtual environment.

Version 4 December 10, 2013 Barry Wright


To check this: select the cnode under the expanded server pool and click edit

Now the customer should be ready to test if the VMs run OK in new cnode, (can be used procedure
to set vserver_placement.ignore_node=true to all other cnodes
except the node we want vms to be started to test)

check doc ref:


Making an OVS Node Unavailable for vServer Placement in Exalogic Virtual Environments
using EMOC (Doc ID 1551724.1)

Note : you may see errors such as this:

Version 4 December 10, 2013 Barry Wright


Stopping and restarting all EC Vservers may be necessary.

It is good practice to leave EC VMs in the default nodes were they should be.

Note: If Vserver is not starting, please verify OVMM has the following checked after re-discovery
of asset.

Utility Server X
VM Server X

You can now hand the system back to the customer System Administrator to check all services
are up and also if this was an OVS Virtual install, they will need to verify the VM's are able to
come up properly. If the customer DBA requires assistance beyond this, then you should
direct them to callback the parent SR owner in EEST.

Version 4 December 10, 2013 Barry Wright

You might also like