Professional Documents
Culture Documents
Bachelor's Project
Sun servers open--source
software systems management
Ondej Jakubk
I would like to thank my family, my friends and my colleagues for their insight, sup-
port and wisdom. I am truly grateful for being surrounded by such brilliant people.
Declaration
I hereby declare that I have completed this project independently and that I have
listed all the literature and publications used.
I have no objection to usage of this work in compliance with the act 60 Zkon
. 121/2000Sb. (copyright law), and with the rights connected with the copyright act
including the changes in the act.
In . . . . . . . . . . . . . . . . . . . . . . . on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Abstrakt
Abstract
1 Introduction 1
7 Zenoss integration 41
7.1 Choosing an approach 41
7.2 Development environment 41
7.3 Important design decisions 42
7.3.1 Event classes 42
7.3.2 Per-trap mapping vs. defaultmapping 42
7.4 Development steps 43
7.4.1 Compiling MIBs 44
7.4.2 Creating Event classes 44
7.4.3 Creating Event mappings 45
7.4.4 Adding products 50
7.4.5 Final modications 50
7.5 Testing 51
7.6 Future extension 51
8 Conclusion 53
A CD Contents 57
1 Introduction
Systems management has become a very important topic in almost every organisa-
tion depending on IT services. It encompasses entire life cycle of IT infrastructure,
including i.e. tracking and documenting requirements, purchasing and renewing
equipment, license management, fault and risk monitoring etc. While systems man-
agement has beenin some wayalways present in IT departments of mid-size to big
enterprises, approach to systems management was often dened in a company-spe-
cic way, with no standardization.
However, many companies now span a number of countries or even continents.
For all but the biggest companies, it would be very inecient to invest in develop-
ment of complete in-house solution for systems managementthese companies rely
on third party solutions, that oer cheaper, well tested and supported alternative.
Decentralization of IT resources is a very important factor for the need of systems
management. It has become quite common to have more than one datacenter, often
in remote locations, possibly quite far apart from each other so that in case of an
accident at or near one of them, the operations of a company can continue relatively
uninterrupted (in this case, by accident we mean either a natural phenomenalike
ooding, storm, reor an act of ill willsuch as a terrorist attack). Because the IT
support may not be alway present on site, an advanced warning of some components'
possible failures is very important. Some, albeit not all system management software
suites can even tie individual systems, groups of systems or even components to a
service, so when a failure is imminent, one can see which services are in jeopardy.
Businesses of today rely on IT more than ever before. Even a minute long outage
can cost thousands of dollars in eect. Therefore, some companies (notably telecom-
munication companies, banks, etc.) build systems with certain level of redundancy,
so in the case of failure of one system, other system takes over in a reasonable amount
of time, so the interruption is barely noticeable. System management is necessary in
this case as it provides information about the nature of failure and it helps selecting
and migrating to a dierent system.
Computing power (in the sense of CPU processing speeds, RAM and storage sizes,
etc.) keeps growing and its price is falling. However the workload is so variable that
computing power may not load processing node enough so that its power consumption
is actually higher than the outcome of its work.
This led to a rebirth of one IT industryvirtualization. To a certain level, vir-
tualization has been possible on various levels since 1967, in this case on IBM CP-40.
However, the main reason back then was to enable various software to run unmod-
ied or simultaneously (computers were batch oriented and most software was not
1
2 Sun servers open--source systems management
designed for any level of multitasking). Now, the reason for virtualization is consoli-
dation, power consumption reduction and control of expenses.
Availability of relatively cheap but powerful commodity hardware has led to a
new architecture of ITinstead of renting a dedicated machine (although this is still
possible), one can rent virtual machines, running on possibly very dierent set of
hardware. With properly setup infrastructure (ber channel or iSCSI disk arrays,
virtualization software supporting live migration etc.), it is possible to achieve a very
high availability and reliability.
However, cheaper systems are being built from cheaper components that are
prone to failure more often than never, thus the need for proper monitoring is high.
With proper software, migrating of virtual machines in case of a hardware malfunc-
tion can be automated.
Power consumption monitoring is a very important part of systems management.
With power becoming more expensive, a careful monitoring of power consumption
with relation to tasks performed is required to manage the costs of ones IT operations
or to properly bill the customers (the latter applies specically to cloud computing
customers).
This bachelors project will focus on one area of systems managementsystems
health monitoring. With above in mind, we can try to focus on a clear design, that will
allow implementing above described features or connecting with existing features in
place.
Objective is to design and implement a Zenoss extension (also known as ZenPack)
that will allow to discover, monitor and report system health status of some Oracle
Sun servers to user. Zenoss was chosen because it is a very advanced integration plat-
form, with advanced features such as graphing, so a future extensions like recording
and analyzing power consumption trends can be implemented. Selection was done in
unpublished work by the author, available separately [1].
2 Systems management software
The following commercial product have been used by the author to manage Oracle
Sun servers:
CA Unicenter NSM
HP Operations Manager
IBM Director
IBM Tivoli Enterprise Console
IBM Tivoli NetCool OMNIbus
All of these products can do passive monitoringlisten for events, either received
using SNMP traps, system logs or some other mechanism (like direct database entry,
command line tool execution etc.).
The Tivoli Enterprise Console, also known as TEC is one of the oldest systems
management package. It relies on Tivoli Management Framework which provides
also way how to install other extensions and patches. TEC itself has rather simple
GUI written in Java, but the backend consists of many helper programs usually writ-
ten in C. TEC is used to do passive monitoring onlyit waits for events and those
events get processed using internal engine (some of its parts are based on Prolog lan-
guage). This software package however requires preinstalled database system to be
present.
3
4 Sun servers open--source systems management
NetCool OMNIbus is similar to TEC, but it has a more modern GUI. Being a
product acquired through acquisition, it is not written in Java but in compiled lan-
guage. It uses totally dierent language for writing custom extension and as one of a
few, it has its own database bundled.
Operations Manager, Director and Unicenter NSM are products of dierent com-
panies, but they have one common featurethey support active polling. Other than
that, they oer similar features and all can receive and process notications from
Oracle Sun servers.
The following features are present in all integrations with these products:
Integration that support polling usually can at least display the state of system LEDs,
some (CA Unicenter NSM) can display a hierarchy of sensors.
In the open--source market, there are right now the following major products:
Nagios
OpenNMS
Zabbix
Zenoss
6 Sun servers open--source systems management
Nagios is the oldest and most mature open--source product. It is very scalable, well
documented, but its web GUI lacks some modern featureswhich of course means it
is very fast, albeit sometimes not very user friendly.
It is written mainly in C, which is another cause of high speed. Monitoring data
can be obtained by running checks either built-in or user supplied scripts called
plugins whose exit code and (optionally) any output is processed and evaluated by
Nagios.
Checks can be run either locally or remotely using a tool called NRPE (Nagios
Remote Plugin Executor). In addition to having Nagios to run a check actively (see
subsection 4.2.1 at page 21), one can also feed data into Nagios asynchronously (see
subsection 4.2.2 at page 22). For more information please see www.nagios.org or
[2].
OpenNMS is another network monitoring/management software package. While
Nagios achieves portability across dierent platform by using C as its programming
language, OpenNMS is written in Java, which makes it too very portable. It requires
Systems management software 7
database for its backing. It provides more modern GUI to user, otherwise its features
are mostly comparable to others.
From [3]:
Zabbix is an enterprise-class open source distributed monitoring solution.
Zabbix is software that monitors numerous parameters of a network and
the health and integrity of servers. Zabbix uses a exible notication mecha-
nism that allows users to congure e-mail based alerts for virtually any event.
This allows a fast reaction to server problems. Zabbix oers excellent report-
ing and data visualisation features based on the stored data. This makes Zab-
bix ideal for capacity planning.
8 Sun servers open--source systems management
Zabbix supports both polling and trapping. All Zabbix reports and statis-
tics, as well as conguration parameters, are accessed through a web-based
front end. A web-based front end ensures that the status of your network and
the health of your servers can be assessed from any location. Properly con-
gured, Zabbix can play an important role in monitoring IT infrastructure.
This is equally true for small organisations with a few servers and for large
companies with a multitude of servers.
Zabbix is written in C and PHP and requires a database backing.
Finally, we are about to look at Zenoss, which is our integration platform. Ocial
documentation [4]says:
Zenoss is today's premier open source IT management solution. Through in-
tegrated monitoring, it enables you to manage the status and health of your
infrastructure through a single, Web-based console.
The power of Zenoss starts with its in-depth Inventory and Conguration
Management Database (CMDB). Zenoss creates this database by discovering
managed resourcesservers, networks, and other devicesin your IT envi-
ronment. The resulting environment model provides a complete inventory of
your key systems, down to the level of resource components (interfaces, ser-
vices, and processes, and installed software.)
Systems management software 9
With the model built, you can use Zenoss' integrated availability and per-
formance monitoring features to monitor and report on all aspects of your IT
infrastructure. Zenoss also provides events and fault management features
that tie into the CMDB. These features help drive operational eciency and
productivity by automating many of the notication, alerts, escalation, and
remediation tasks you perform each day.
Zenoss is written in Python and is based on Zope application platform and like most
previously mentioned software products, it requires databasespecically MySQL.
servers
routers
racks
switches
11
12 Sun servers open--source systems management
can be monitored. Since the SNMP implementation can be carried out even on very
small devices, SNMP can be implemented even for devices like air conditioning control
etc.
Currently, SNMP exists in three versions (in parentheses the years of standard-
ization by the Internet Engineering Task Force is given):
Even though the latest version of SNMP brings very important new features, like
authentication and encryption, it is still not supported by some of the network man-
agement software suites.
SNMP is a datagram protocol and therefore there is a possibility of the data being
lost en route. This is especially important when using passive monitoringnetwork
elements such as routers can cause UDP packets to be lost and in the case of fatal error
(by fatal error an error causing powering o of the monitored device) the notication
may not be received at all, causing the error to be found due to some other malfunction
(typically a segment of network being down, possibly a service like database or web
server being inaccessible).
When working with SNMP based technologies, one can ofter come across the following
terms:
OID
varbind
table
scalar
index
As mentioned above subsection 3.1.1 at page 12, there is a special format that de-
scribes the data sent over SNMP. Format of a MIB is derived from ASN.1 (see sub-
section 3.1.3.1 at page 14). Formally, it has been dened in [9]. Citation:
14 Sun servers open--source systems management
3.1.3.1 ASN.1
Abstract Syntax Notation One is one of many approaches on data structure descrip-
tion. What makes it stand out is that it allows specication of the structure, but it
also describes its encoding and decoding into various formats (ranging from binary
formats to XML).
ASN.1 is an international standard adopted by Internation Telecommunication
Union (ITU) and by ISO/IEC. It has been standardized as [1013]. Due to its versatil-
ity, ASN.1 and its hierarchical data model is used other application protocols as well,
including internet telephony (H.323) and directory services (LDAP).
Rather than a being a single protocol specication, IPMI species full set of physical
interfaces to a system controller, communication protocol and data representation. It
is specied in [14], a standard designed by a computer manufacturer consortium led
by Intel. Citation for [14]:
The IPMI specications dene standardized, abstracted interfaces to the plat-
form management subsystem. IPMI includes the denition of interfaces for
extending platform management between board within the main chassis, and
between multiple chassis.
The term platform management is used to refer to the monitoring and
control functions that are built in to the platform hardware and primarily used
for the purpose of monitoring the health of the system hardware. This typi-
cally includes monitoring elements such as system temperatures, voltages,
fans, power supplies, bus errors, system physical security, etc. It includes
Protocols for system management 15
System management has traditionally used a particularly simple approach using se-
rial line, or its alternativetelnet or secure shell access to the system controller or
to the system itself.
System controller on most server platform oers a broad range of system man-
agement possibilities. Besides power control and console control, it also provides sys-
tem administrator with the ability to display the status of sensors and to list system
events.
# ssh root@myhost
Password:
Waiting for daemons to initialize...
Daemons ready
/SYS
Properties:
product_name = SPARC-Enterprise-T5220
Although the output is optimized for human reading and not for programmatic analy-
sis, there are well established tools that can parse this output (expect [17]), and feed
the resulting data to a system management software.
This technique applies not only to system controller, but to BIOSes and even oper-
ating system command line utilities. There are a few Zenoss extensionsZenPacks,
that use the technique of parsing text output to deliver information on processes, CPU
load, storage status and more.
# cat /proc/partitions
major minor #blocks name
8 0 312571224 sda
8 1 309917916 sda1
8 2 1 sda2
8 5 2650693 sda5
In addition to protocols listed above, there are some other protocols used for system
management. One of the mature one is syslog protocol.
Unix system log protocol is specied in [18]. It was designed with networking in
ming, so although it is generally used on local host, it is possible to setup the daemon
to lter and forward messages to a network host. On this host, further processing
can be done. Usually, traditional syslog will not record originating host name, so
there needs to be a special daemon or the system logging daemon needs a special
conguration.
Being a very old protocol, there is almost no security (besides facilities like re-
jecting a host that is not in a list, etc.), and by generating a ood of messages, it is
possible to overload the daemon or ll the space in /var/log lesystem, which may
lead to unexpected failures.
Commercial products (especially those that contain or can be used with their own
agents on remote hosts) also use various RPC mechanisms. Among the most common,
there are the following:
18 Sun servers open--source systems management
Description of these protocols is beyond the scope of this project, for further informa-
tion please consult the references. In case of proprietary software, details about the
usage of these protocols may not be fully known, therefore their use as an communi-
cation protocol with custom software may be very challenging.
4 Approaches to system management
In this chapter we will describe possible approaches to system management, and com-
pare them in terms of protocol requirements, generated network trac and reliability.
Possibly the simplest approach to system management (more specically, system
health monitoring) is simply to wait until the device stops working, rendering some
service or services unusable. While possible to do so (indeed, author have observed
such approach in an educational institution), there is no warning in advance and
therefore such approach is only feasible in environments where setting up monitoring
would be more expensive than repairing failed systems.
To be able to monitor any system, there must be a way to connect to it. In systems
management, we usually use one of the following four communication channels:
local only
in-band communication
out-of-band communication
side-band communication
19
20 Sun servers open--source systems management
This implies that operating system on the monitored device has to support man-
agement trac handling (usually, this is accomplished by running a so-called agent).
Also, it means that management trac occupies (at least partially) useful bandwidth
and that the agent will use some CPU cycles.
On the other hand, using this type of communication poses no additional require-
ments on the existing network infrastructureno additional cabling is required and
no changes to network switches and routers needs to be made. Especially when deal-
ing with many servers, savings on network infrastructure may be signicant.
One signicant drawback of this approach is that without operating system run-
ning, management may not be possible (although servers with Wake--on--LAN capa-
bility can be at least turned on remotely).
therefore it is very important to develop and enforce security guidelines with same
strictness as guidelines applying to operating system and network security.
In conclusion, drawback of this approach is higher network infrastructure costs,
but for setups requiring additional features like storage redirection etc., this approach
is benecial.
By active monitoring we mean such setup, where the monitoring station (i.e. a box
running monitoring software) actively queries managed (monitored) devices.
Certain protocols (like IPMI) support only this type of monitoring, others (like
SNMP) support both active and passive.
During active monitoring, the following data (albeit not all of these may be avail-
able ) is usually gathered and/or updated in regular time intervals:
22 Sun servers open--source systems management
Depending on the verbosity of data obtained and on time intervals, active monitoring
can cause a signicant network trac (this may not be favourable especially when
using in-band communication). However the amount of data transmitted may be reg-
ulated by selecting only a subset of data (e.g. checking a system status and reading
an extended set of data when the status changes).
Advantage of active monitoring is reliabilityeven when using non-reliable data
transfer (UDP protocol used with SNMP protocol), the monitoring station can usually
detect missing data and request it again.
Another huge advantage is the ability to gather statistically relevant data to be
stored and processed (like power consumption, network port trac etc.). Advanced
features of monitoring software can include graphing and reporting, which can in
turn be used to consolidate computing resources in power-ecient way.
This type of monitoring is usually supported by most network devices, ranging
from servers to low-cost switches.
Huge advantage is that very little network trac is generated, and also this
method is very CPU usage friendly (neither agent/system controller nor monitoring
station are processing huge amounts of data).
This method may not be supported by all devices.
When both above mentioned approaches are combined, possibly the most reliable
monitoring system can be built. However, not all monitoring packages allow these
two approaches to be combined.
Modus operandi is like this:
1. Monitoring station reads all data using active approach (i.e. full repository).
2. Monitored hosts issue notications based on their status changes.
3. Monitoring station updates it's data either by:
a. using solely data from the passive notication
b. refreshing all data from the appropriate monitored device
4. Once a while, monitoring station refreshes all data (just in case notication was
lost).
Selection in particular setup will be subject to available software, number and type of
devices, current network infrastructure hierarchy and also time and budget alloted.
5 Sensors and components
Before we can get deeper into the actual data presented by Oracle Sun system con-
trollers and agents, we need to dene and explain terms that are connected with a
server.
Component is any functional part of the server. Components may nd themselves
in a number of states:
present
absent
functioning
about to malfunction
malfunctioning
unknown
Very closely related term is sensor. Sensors are usually connected with compo-
nents, although they may be connected with a whole system. There are fundamen-
tally two types of sensors:
The dierence is, that virtual sensors are being computed based on physical sensors.
It shall be noted that for some virtual sensors, the underlying physical sensors may
be hidden.
Physical sensors usually detect some values being out of range or just some true/false
conditions. Some types of physical sensors:
25
26 Sun servers open--source systems management
Among virtual sensors are those whose condition is base on state of other sensors
(e.g. power sensor measuring in Watts will be calculated from appropriate voltage
and current sensors) or based on a condition detected by software. For example:
Some sensors (mostly physical) have setup some thresholds. A threshold is a value,
which the measured value must achieve and cross for the sensor to change its state.
Usually, only sensors that measure continuous values (numeric sensors, the opposite
being discrete sensors) have dened thresholds:
non--critical
critical
non--recoverable
disabled
memory error detected
OK/fail
present/absent
Both kinds of sensors have so-called assertions and deassertions. These two are op-
posite to each other. Assertion means that the sensor assumes some state (usually
Sensors and components 27
error state), deasertion means that the sensor leaves the state that was previously
asserted.
However, this may sometimes be trickylets see an example. We have a sensor
HDD0 (the names are usually longer, but for the sake of example lets keep this one)
that has the following states:
Device Present
Device Absent
Hot Spare
Rebuild In Progress
and for all of the, both assertion and deassertion is enabled. In this particular exam-
ple, having the sensor in Device Present Assert means that the particular device
is present. Similarly, Device Absent Assert will mean that the device has been re-
moved.
There is however one more approachhave the device in Device Absent De-
assert and Device Absent Deassert and Device Present Deassert. Both mean
the same thing as the ones in previous paragraphthe device has been inserted (is
no longer absent) and device has been removed (and is no longer present) respec-
tively. Any integration dealing with sensor must be aware of this and preferably
should translate incoming notications into one common format and discard the less
common and more confusing one.
28 Sun servers open--source systems management
6 Management interfaces of Oracle Sun servers
Since this project focuses on systems management of Oracle Sun servers, we rst
need to describe management capabilities of these servers.
Oracle (and previously Sun) has a very broad portfolio of servers. However, for
this project, we will focus on the following hardware families:
The work will be done primarily on latest available servers (i.e. not End--of--Life ones).
Although it may seem as a waste of time to target also servers no longer in production,
it is author's belief that these servers may still be present especially in educational
institutions, where they performance is still sucient and having an open--source tool
for monitoring will be more than benecial.
All the servers mentioned above have a special, independent computer on--board, that
controls power, monitors environmental and system characteristics (voltages, device
presence, fan speeds etc.) and reports the using methods describe below. This com-
puter is called system controller on SPARCs and service processor on x86 servers.
On Oracle Sun servers mentioned above, one may encounter the following ver-
sions of system controllers:
ALOM is the oldest from these two, and one can nd it only on older SPARC servers
(there are two versionsALOM and ALOM--CMT, the rst one being used on sun4u
29
30 Sun servers open--source systems management
platforms and the latter being used on servers with UltraSPARC T1 processorthese
processors have the ability to run several threads in parallel, also called Chip Multi-
threading, hence the abbreviation CMT).
ALOM had only command line interface and they can send e-mail to adminis-
trator in the event of malfunction, newer version of ALOM--CMT also support SNMP
protocol. There is no web GUI, though. ALOM is primarily out--of--band (using ser-
ial line or its own network port), but it can be congured from within Solaris using
scadm(1M) command. Features are pretty much standard:
power control
serial console redirection
logical domains (on CMT machines, [27])
environment monitoring
listing, disabling and enabling components
eLOM on the other side can be found only on older x86 platforms. It oers com-
mand line interface, SNMP interface and web interface. In addition to features listed
with ALOM (except the logical domains), eLOM has these additional features:
ILOM is the latest and actively developed system controller software. It can be
found both on SPARC and x86 servers and it oers everything ALOM and eLOM oer
together.
# ssh root@alom-server
sc> showhost
Sun-Fire-T2000 System Firmware 6.7.6 2009/10/29 16:06
serial line
telnet (may be disabled for security reasons)
secure shell
internally over OS tool (e.g. scadm(1M))
6.3 SNMP
SNMP interface is arguably the most used interface for system management. Both
eLOM and ILOM support SNMP from the very rst versions, ALOM--CMT started
to support SNMP directly relatively late.
However, either due to absence of SNMP interface (ALOM--CMT prior to v1.4) or
due to simple wish to monitor the system in--band, there are so-called agents. There
are currently two:
Monitoring Agent for Sun Fire and Netra Systems (MASF) [28]
32 Sun servers open--source systems management
# ssh root@elom-host
root@elom-host's password:
Version 2.91
Hostname: SUNSP0016365B97FB
IP address: 10.18.141.146
/SP/SystemInfo/ProductInfo
Targets:
Properties:
ProductManufacturer = Sun Microsystems
ProductProductName = Sun Fire X2200 M2
ProductPartlNumber = 1S39U9ZST61
ProductSerialNumber = 0624QC0029
AssetTag =
Target Commands:
show
MASF is available only on SPARC systems, but it supports both ALOM (including the
CMT variant) and ILOM system controller. On the other hand, the Hardware Man-
agement Agent supports only x86 systems and only those running specic versions
of ILOM.
All system controllers supporting SNMP and both agents can be congured to
accept incoming SNMP requests for data (useful when monitoring these systems
activelyalso known as polling) and/or they can send SNMP traps or notications
Management interfaces of Oracle Sun servers 33
# ssh root@sparc-ilom
Password:
Waiting for daemons to initialize...
Daemons ready
Properties:
type = Host System
ipmi_name = /SYS
keyswitch_state = Normal
product_name = SPARC-Enterprise-T5220
product_part_number = 602-3821-08
product_serial_number = BEL07513TT
product_manufacturer = SUN MICROSYSTEMS
fault_state = OK
power_state = On
...
on their own (passive monitoring). However, the format of data diers considerably
among the types of service controller or agents. Its structure is important for further
work on the integration with Zenoss, so the data structure (described using MIBs)
will be discussed in the next section.
Format and purpose of MIB was already dened (see section 3.1.3 at page 13). Oracle
Sun systems (or more precisely, the system controllers and agents) implement some
of the following MIBs:
34 Sun servers open--source systems management
ENTITY-MIB
SUN-PLATFORM-MIB
SUN-ILOM-PET-MIB
SUN-HW-TRAP-MIB
SUN-HW-MONITORING-MIB
SUN-ASR-NOTIFICATION-MIB
In the following paragraphs, we will look into these MIBs in higher detail.
ENTITY-MIB is the only MIB that has not been dened by Oracle (formerly Sun). It
is dened in an independent specication [30]. The purpose of MIB is given as follows
([30]):
In particular, it (this MIB) describes managed objects used for managing mul-
tiple logical and physical entities managed by a single SNMP agent.
ENTITY-MIB contains structures that (in terms of server management) describe
various components of the server, including details about count and type of processors,
DIMM modules manufacturer etc.
SUN-PLATFORM-MIB is a MIB that extends ENTITY-MIB with details about
operational state and also it contains tables that identify and list system sensors, to-
gether with their thresholds and current values. Also, this MIB in particular denes
some notications, that can be used to dynamically modify the model of monitored
system and/or it can be translated and displayed to user. However, these traps do not
carry all the information (like the type of sensor issuing the warning), so additional
action is required to get such information (typically, this is done using regular expres-
sion that looks for a certain pattern of sensor names). Using regular expressions is
quick and functional way, but author believes the correct approach is to poll the agent
or system controller for a correct sensor type based on received OIDs present in the
notications. These two MIBs are supported in MASF (SPARC) and all ILOMs and
eLOMs.
SUN-ILOM-PET-MIB is one of the MIBs that doesn't use typical Sun (Oracle)
OID tree, but it instead uses a tree wiredformgmt (Wired for Management). This
is an OID tree reserved by Intel [31]for so-called PETs (Platform Event Traps). These
largely correspond with IPMI and ofter carry similar date. However, such trap gen-
erated carries a computed specic type (a number that identies the type of trap or
Management interfaces of Oracle Sun servers 35
notications that is being sent). Most NMSes can't deal with dynamic specic types,
they expect these numbers to be assigned statically and dened in the MIBand that
is the purpose of this MIB. However, in case there is another PET MIB by a dier-
ent vendor, they will share the OID tree and the numbers will collide. Not only will
the names and descriptions of most or all notications dierent, but some may have
totally dierent meaning.
SUN-HW-TRAP-MIB was designed relatively recently with a single purposeelim-
inate the need to do a regular expression matching or polling agent when a trap is
received. Hence, a direct display of these traps is preferred.
SUN-HW-MONITORING-MIB was designed to remove a dependency on ENTITY-MIB
and to provide some more information about the monitored system. It features data
like cumulative state, which is computed on the monitored host side. The advantage
of this approach is mainly saving the network tracNMS may poll only few val-
ues in the MIB and get a full tree only in case something goes wrong. This MIB is
implemented only in the Hardware Management Agent.
SUN-ASR-NOTIFICATION-MIB is currently implemented by ASR agent. De-
scription from [32]:
ASR is a secure, scalable, customer--installable software feature of warranty
and SunSpectrum support that provides auto-case generation when specic
hardware faults occur. ASR is designed to enable faster problem resolution by
eliminating the need to initiate contact with Sun for hardware failures, reduc-
ing both the number of phone calls needed and overall phone time required.
ASR also simplies support operations by utilizing electronic diagnostic data.
In case there is an error detected (hardware error), the ASR agent sends details
about the error, together with unique identier of the system to Oracle, where the
data is ltered and entered as a Service Request on behalf of the customer. This saves
time and communication eorts. In addition, ASR generates a SNMP notication to
inform the customer about Service Request being created on his behalf.
6.3.1.2 Notications
It is not feasible to describe every single notication declared in all MIBs, as that
would make this document extensively long and also very quickly outdated. In this
section, we will describe the basic principles behind notications in Oracle (Sun)
MIBs.
36 Sun servers open--source systems management
Please bear in mind that SNMP is UDP based and therefore each trap with lower
severity (e.g. the one suggesting system is getting into better condition) should auto-
matically close all previous events with higher severity, if they were sent for the same
sensor.
SUN-ASR-NOTIFICATION-MIB has only ve notication:
sunAsrSrCreatedTrap
sunAsrSrCreationInProgressTrap
sunAsrSrUpdatedTrap
sunAsrSrDelayedTrap
sunAsrSrFailureTrap
Management interfaces of Oracle Sun servers 37
With these notications, NMS can display appropriate messages when a service re-
quest gets created, is being created, has been updated, is delayed or has failed, re-
spectively.
entPhysicalTable
entLogicalTable
entLPMappingTable
entAliasMappingTable
entPhysicalContainsTable
sunHwMonInventoryTable
sunHwNumericVoltageSensorTable
sunHwDiscreteVoltageSensorTable
sunHwNumericCurrentSensorTable
sunHwDiscreteCurrentSensorTable
sunHwNumericPowerDeviceSensorTable
sunHwDiscretePowerDeviceSensorTable
sunHwNumericCoolingDeviceSensorTable
sunHwDiscreteCoolingDeviceSensorTable
sunHwNumericTemperatureSensorTable
sunHwDiscreteTemperatureSensorTable
sunHwNumericProcessorSensorTable
sunHwDiscreteProcessorSensorTable
sunHwNumericMemorySensorTable
sunHwDiscreteMemorySensorTable
sunHwNumericHardDriveSensorTable
sunHwDiscreteHardDriveSensorTable
sunHwNumericIOSensorTable
sunHwDiscreteIOSensorTable
Management interfaces of Oracle Sun servers 39
sunHwNumericSlotOrConnectorSensorTable
sunHwDiscreteSlotOrConnectorSensorTable
sunHwNumericOtherSensorTable
sunHwDiscreteOtherSensorTable
sunHwMonIndicatorTable
sunHwMonTotalPowerConsumption
As one can see, this MIB is more ne grained that ENTITY-MIB. In addition to these
tables, certain values of interest are also directly available as scalars, which radically
simplies writing management extensions. There are quite a few scalars, only some
are listed below (for a full list and description see the MIB itself, it is well commented):
sunHwMonProductName
sunHwMonProductType
sunHwMonCumulativeSensorAlarmStatus
sunHwMonIndicatorServiceName
sunHwMonIndicatorServiceCurrentStatus
6.4 IPMI
IPMI is supported only in eLOM and ILOM. Utilities that access system controllers
over IPMI (e.g. ipmitool(1M), [33]) can use two connection methods:
While the rst is available always, KCS (Keyboard Style Controller) was not avail-
able on SPARC systems until recentlythis was caused by a driver missing, not a
hardware defect [35].
All of the system controllers can send notications using e-mail and they can also for-
ward the events to a system logging daemon running on remote host. To the author's
knowledge, these interfaces are seldom used.
40 Sun servers open--source systems management
However, web interface is used quite often, it oers a quick way how to check
server status, server components and also to upgrade rmware remotely without hav-
ing to run TFTP server.
Since we now have all management protocols, approaches and Oracle Sun servers
available interfaces described, we can start designing and implementing Zenoss inte-
gration. As resources materials [3641]were invaluable and provided all information
needed for designing and implementing the integration.
Zenoss supports both active and passive approach. To be able to actively poll system
controllers or agents for data, it is necessary to develop plugins in Python that extend
Zenoss' object model. While the API is not overly complex and ENTITY-MIB mod-
elling is already present, it would be time consuming to implement the other MIB
(SUN-HW-MONITORING-MIB) and management capabilities would thus be limited
to system controllers with ILOM and eLOM and to SPARC hosts running MASF.
On the other hand, implementing trap handling is easier, and as a result of imple-
menting support for SUN-PLATFORM-MIB and SUN-HW-TRAP-MIB notications
much more platforms will be supported:
Eventually, the desired functionality is that of existing integration with IBM Tivoli
Enterprise Console [42]or IBM Tivoli NetCool OMNIbus [43].
A VirtualBox virtual machine running Debian GNU/Linux 5.0 with installed stack
Zenoss 2.5.1 (recently updated to 2.5.2). Development was done accordingly to Jane
41
42 Sun servers open--source systems management
Curry's [40]development tree was stored outside of Zenoss and versioned in Mer-
curial repository.
Zenoss organizes events into event classes. There are certain already existing classes,
like /Hw/Perf etc. There were possible two approaches:
While the rst approach would suggest that the integration would t seamlessly into
existing environment (especially helpful when users already have some paging, e-mail
or other notications setup), the second approach guarantees that there will be no
clashes with existing setup (of course, unless the user creates his own event classes
with the same names).
As this integration should not break anything in the end--users setup, it has been
decided to create a completely separate namespace.
When Zenoss receives an event (in this case caused by receiving SNMP notication),
it will try to process the event using Event Class Key, which is usually the name of
the SNMP notication (provided the MIB is loaded and compiled). To do that, it will
search its database and looks for Event Class Mappings, which play a similar role as
rules in other software.
Zenoss integration 43
When the mapping is not found, it will try and look for defaultmapping, that may
process the generated event. Although it would be simpler to develop just one block of
code to process these events, there is a concern that running a larger block of code for
every single notication would make the application much slower. Hence, a decision
to create a mapping for every single SNMP notication has been made.
In this section we will describe steps taken to develop this integration. There is
one step common to all subsequents stepsonce it has been veried that the de-
scribed action was successful, the resulting objects are added to the ZenPack (called
44 Sun servers open--source systems management
This is arguably the simplest step. It involves copying used MIBs to location where
Zenoss expects them ($ZENHOME/share/mibs/site). The $ZENHOME environ-
ment variable is set by default for user zenoss.
Then, as user zenoss, one has to run the command
$ zenmib -v 10
Before creating mappings, it is necessary to have all event classes against which we
want to map events to. Based on the two MIBs used now, the following classes will
be created:
/Events/Oracle
/Events/Oracle/Voltage
/Events/Oracle/Temperature
/Events/Oracle/Electrical Current
/Events/Oracle/Fan Speed
/Events/Oracle/Other
/Events/Oracle/Power Supply
/Events/Oracle/Fan
/Events/Oracle/Processor
/Events/Oracle/Memory
/Events/Oracle/Hard Drive
Zenoss integration 45
/Events/Oracle/IO
/Events/Oracle/Slot or Connector
/Events/Oracle/Component
/Events/Oracle/FRU
/Events/Oracle/Power Consumption
These can be created from GUI by following the Events menu item in the left nav-
igation bar and the by clicking Add New Organizer from the menu on the left from
Subclasses.
However, it is also possible to do this using a tool zendmd, which is essentially
a Python interpreter with preloaded Zenoss classes [44](this is just a skeleton script,
full can be found on CD in directory scripts as le createEventClasses.py):
import Globals
from transaction import commit
from Products.ZenUtils.ZenScriptBase import ZenScriptBase
dmd = ZenScriptBase(connect=True).dmd
event_classes = [
'/Events/Oracle',
'/Events/Oracle/Voltage',
...
]
for ec in event_classes:
dmd.Events.manage_addOrganizer(ec)
commit()
As a result, we now have all event classes we need in place and can proceed to
the event mappings creation.
Recommended procedure for creating Event class mappings is to have the Zenoss
SNMP daemon receive all possible notications and then by creating the mappings
from GUI. These mapping can then be modied again from GUI [39].
46 Sun servers open--source systems management
However, if we do that for just one notication we can observe the following attrib-
utes are present (lled values are in parentheses) and the rest is to be lled manually:
NameAn identier for this event class mapping. Not important for match-
ing events.
Event Class KeyMust match the incoming event's eventClassKey eld
for this mapping to be considered as a match for events.
SequenceSequence number of this mapping, among mappings with an
identical event class key property. Go to the Sequence tab to alter its posi-
tion.
RuleProvides a programmatic secondary match requirement. It takes a
Python expression. If the expression evaluates to True for an event, this
mapping is applied.
RegexThe regular expression match is used only in cases where the rule
property is blank. It takes a Perl Compatible Regular Expression (PCRE).
If the regex matches an event's message eld, then this mapping is applied.
TransformTakes Python code that will be executed on the event only if
it matches this mapping. For more details on transforms, see the section
titled Event Class Transform.
ExplanationFree-form text eld that can be used to add an explanation
eld to any event that matches this mapping.
ResolutionFree-form text eld that can be used to add a resolution eld
to any event that matches this mapping.
Although we possibly could enter all mappings by using GUI, this would be error
prone and not very ecient. Luckily, as Zenoss is based on Zope, every GUI action
has a corresponding Python function that can be called.
Zenoss integration 47
To manipulate event classes, we rst need to get the class that represents them.
This is doable by the following method:
dmd.Events.getOrganizer(name)
eventClassKey and id shall be set to the translated name of the SNMP noti-
cation.
example shall be set to snmp trap <name>.
transform shall contain Python code that will modify received event text, sever-
ity and possibly set other values so clearing will work.
explanation and resolution may contain text explaining nature of the
event.
Most of the traps from SUN-HW-TRAP-MIB will have processing similar to this
(please note, that although MIBs do specify an user friendly mapping of integers to
names, Zenoss does not use these mappings):
Other notications will have similar processing. How do we put all this together?
Lets put together a algorithm:
When this is done, one may end up with a following script. Of course, this is not a
complete script, full version is present on the CD. First, we need to prepare a list of
notication, together with their Event Classes:
Zenoss integration 49
denitions = []
Here, the hw_thr_assert and hw_thr deassert are strings that contain the
template for transformation script to be input into Zenoss.
When we have the denitions array lled up with transformation rules, we
can cycle through them and create mappings in Zenoss:
import Globals
from transaction import commit
from Products.ZenUtils.ZenScriptBase import ZenScriptBase
dmd = ZenScriptBase(connect=True).dmd
commit()
Finally, we may want to add a new manufacturer and a list of products. This again
can be done from GUI or from command--line using zendmd.
However, the syntax here is not as easy as in the rst example, so for purpose of
this project, products were created by hand using GUI.
Manufacturer Oracle was added to Zenoss, and a list of servers was created:
Even though scripting the creation of the mappings saved us a considerable amount
of time, the script inevitably may not be able to generate all messages and severities
Zenoss integration 51
7.5 Testing
Optimal approach for testing would be to create an automation that would simulate
failures on physical machines, which would in turn respond with notication. A semi--
manual checking would then be required to conrm that the integration works as
expected.
However, due to time constraints and unavailability of all testing machines, a
dierent approach was chosen. One server (Oracle Sun SPARC Enterprise T5220
Server) was congured to send notications from system controller and MASF agent
to the same IP address running Zenoss with this integration. Hard drives, power sup-
plies and fans were the removed and the reinstalled to verify that traps are received
and cleared.
As of now, the integration has just basic functionality. Following paragraphs describe
the possible new features to be developed, possibly as a future work of author.
Testing framework. To ensure this software works, a complete automated testing
framework supporting physical servers needs to be developed and regularly run.
Better clearing mechanism. Right now, due to Zenoss way of handling clearing
events (i.e. only events with cleared severity can clear others) it is true that notica-
tions ending with Deassert have severity of cleared. This may not be true, because
even if the sensor reading drops below non--recoverable threshold, its reading is now
critical and not OK.
Polling. This would mean developing a plugin into Zenoss that would discover
and model the server using data obtained by periodical reading MIB data.
Model updates from traps. Instead or in addition to writing to event console when
a SNMP notication is received, a previously obtained model of the server could be
either updated or a forced reread of all data can be forced. This of course requires a
52 Sun servers open--source systems management
functional polling and to function properly, a model will need to be updated anyway
from time to time, just to make sure that a SNMP notication wasn't lost en route.
Graphing and reporting. Based on data obtained by previous two extensions, it
would be possible to implement graphing and reporting, showing for example tem-
perature trends, and more importantly power consumption.
8 Conclusion
This project was partially research and partially implementation oriented. As a re-
sult, a brief yet hopefully useful description of system management motivations, tech-
nologies and software was given.
In addition, a basic but functional integration into open--source system manage-
ment tool was developed and tested (albeit only in limited way), by which this project
fullled its assignment.
Author implemented a new and previously unknown (or at least not publicly de-
scribed) way how to create Event Class mappings programatically.
However, from the former idea of a complete monitoring solution that would do
polling, graphing and notications simultaneously was not realized. Nonetheless,
even though this solution does not use all features of Zenoss, there is a room for
improvement, as described earlier.
53
54 Sun servers open--source systems management
References
55
56 Sun servers open--source systems management
[24] Sun Microsystems, Inc. Advanced Lights Out Management (ALOM) CMT v1.4
Guide. (2007a). 819-7991-10.
[25] Sun Microsystems, Inc. Embedded Lights Out Manager Administration GuideFor
the Sun Fire X2200 M2 and Sun Fire X2100 M2 Servers. (2009). 819-6588-14.
[26] Oracle, Inc. Oracle Integrated Lights Out Manager (ILOM) 3.0 Getting Started
Guide. (2010c). 820-5523-11.
[27] Oracle, Inc. Oracle VM Server for SPARC. (2010e). (formerly LDOMS).
[28] Sun Microsystems, Inc. Sun SNMP Management Agent for Sun Fire and Netra
Systems. (2004).
[29] Oracle, Inc. Sun Server Management Agents 2.0 User's Guide. (2010b).
821-1610.
[30] K. McCloghrie and A. Bierman, RFC2737: Entity MIB (Version 2) (IETF, 1999).
Obsoleted by RFC 4133.
[31] Intel, HP, NEC, and Dell Platform Event Trap Format Specication. v1.0.
[32] Oracle, Inc. Auto Service Request (ASR) v2.6Installation and Operations
Guide. (2010a). http://wikis.sun.com/display/ASRSO/Home.
[33] D. Laurie IPMItool. (2007). http://ipmitool.sourceforge.net/.
[34] Oracle, Inc. IPMItool. (2010d). http://www.sun.com/system-
management/tools.jsp.
[35] Sun Microsystems, Inc., PSARC 2008/119 sun4v /dev/bmc (Sun Microsystems,
Inc., 2008). (not available publicly).
[36] Zenoss, Inc. Zenoss Administration. (2010b).
[37] Zenoss, Inc. Zenoss Developer's Guide. (2010c).
[38] Zenoss, Inc., Zenoss 2.5 source code documentation (Zenoss, Inc., 2010a).
[39] J. Curry Zenoss Event Management. (2010). version 3.
[40] J. Curry, Creating Zenoss ZenPacks (Jane Curry, 2009a).
[41] J. Curry Crafting Zenoss Core users for events and zProperties. (2009b). draft.
[42] Sun Microsystems, Inc. Monitoring Sun Servers in an IBM Tivoli Enterprise
Console Environment. (2009b).
[43] Sun Microsystems, Inc. Monitoring Sun Servers in an IBM Tivoli Netcool/OM-
NIbus Environment. (2009a).
[44] N. Brockett, batchaddlocations.py (Zenoss, Inc., 2009).
A CD Contents
As a part of this project, a CD was created. It contains the following les and direc-
tories:
LVII
58 Sun servers open--source systems management