You are on page 1of 68

Czech Technical University in Prague

Faculty of Electrical Engineering

Bachelor's Project
Sun servers open--source
software systems management

Ondej Jakubk

Supervisor: Ing. Josef Hajas

Study Program: Electrical Engineering and Information Technology


Computer Engineering

May 27, 2010


Acknowledgement

I would like to thank my family, my friends and my colleagues for their insight, sup-
port and wisdom. I am truly grateful for being surrounded by such brilliant people.
Declaration

I hereby declare that I have completed this project independently and that I have
listed all the literature and publications used.
I have no objection to usage of this work in compliance with the act 60 Zkon
. 121/2000Sb. (copyright law), and with the rights connected with the copyright act
including the changes in the act.

In . . . . . . . . . . . . . . . . . . . . . . . on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Abstrakt

elem tto bakalsk prce je provst analzu dostupnch softwarovch pro-


dukt pro systmovou sprvu, a ji komernch i otevench, dle analyzovat monosti
integrace se servery spolenosti Oracle (Sun) a implementace integranho een do
vybranho nstroje.
Soust analzy je t teoretick st zamen na uitenost systmov sprvy,
pouvan metody zskvn dat a protokoly, kter jsou pi monitorovn a sprv
server pouvany.

Abstract

Objective of this bachelor's project is to analyze available systems management


products, both commercial and open--source. It analyzes integration possibilities
against servers made by Oracle (Sun) and a result of this project is an integration
into a selected software.
As a part of analysis there is also a theory focused on benets of systems manage-
ment, available methods of data acquisition and protocols that are used for monitoring
and managing servers.
Contents

1 Introduction 1

2 Systems management software 3


2.1 Commercial oerings 3
2.2 Open-source oerings 5

3 Protocols for system management 11


3.1 Simple Network Management Protocol 11
3.1.1 Monitoring over SNMP 12
3.1.2 Important terms related to SNMP 13
3.1.3 Management Information Base 13
3.1.3.1 ASN.1 14
3.2 Intelligent Platform Management Interface 14
3.3 Web-Based Enterprise Management 15
3.4 Other protocols 16
3.4.1 Remote shell access 16
3.4.2 Other protocols 17

4 Approaches to system management 19


4.1 Way of communication 19
4.1.1 In-band communication 19
4.1.2 Out-of-band communication 20
4.1.3 Side-band communication 21
4.2 By means of data gathering 21
4.2.1 Active monitoring 21
4.2.2 Passive monitoring 22
4.2.3 Combination of active and passive monitoring 23
4.3 Final comparison 23

5 Sensors and components 25


6 Management interfaces of Oracle Sun servers 29
6.1 System controllers 29
6.2 Command--line interface 30
6.3 SNMP 31
6.3.1 Oracle Sun MIBs 33
6.3.1.1 Origin and purpose of these MIBs 34
6.3.1.2 Notications 35
6.3.1.3 Polled data 37
6.4 IPMI 39
6.5 Other interfaces 39

7 Zenoss integration 41
7.1 Choosing an approach 41
7.2 Development environment 41
7.3 Important design decisions 42
7.3.1 Event classes 42
7.3.2 Per-trap mapping vs. defaultmapping 42
7.4 Development steps 43
7.4.1 Compiling MIBs 44
7.4.2 Creating Event classes 44
7.4.3 Creating Event mappings 45
7.4.4 Adding products 50
7.4.5 Final modications 50
7.5 Testing 51
7.6 Future extension 51

8 Conclusion 53

A CD Contents 57
1 Introduction

Systems management has become a very important topic in almost every organisa-
tion depending on IT services. It encompasses entire life cycle of IT infrastructure,
including i.e. tracking and documenting requirements, purchasing and renewing
equipment, license management, fault and risk monitoring etc. While systems man-
agement has beenin some wayalways present in IT departments of mid-size to big
enterprises, approach to systems management was often dened in a company-spe-
cic way, with no standardization.
However, many companies now span a number of countries or even continents.
For all but the biggest companies, it would be very inecient to invest in develop-
ment of complete in-house solution for systems managementthese companies rely
on third party solutions, that oer cheaper, well tested and supported alternative.
Decentralization of IT resources is a very important factor for the need of systems
management. It has become quite common to have more than one datacenter, often
in remote locations, possibly quite far apart from each other so that in case of an
accident at or near one of them, the operations of a company can continue relatively
uninterrupted (in this case, by accident we mean either a natural phenomenalike
ooding, storm, reor an act of ill willsuch as a terrorist attack). Because the IT
support may not be alway present on site, an advanced warning of some components'
possible failures is very important. Some, albeit not all system management software
suites can even tie individual systems, groups of systems or even components to a
service, so when a failure is imminent, one can see which services are in jeopardy.
Businesses of today rely on IT more than ever before. Even a minute long outage
can cost thousands of dollars in eect. Therefore, some companies (notably telecom-
munication companies, banks, etc.) build systems with certain level of redundancy,
so in the case of failure of one system, other system takes over in a reasonable amount
of time, so the interruption is barely noticeable. System management is necessary in
this case as it provides information about the nature of failure and it helps selecting
and migrating to a dierent system.
Computing power (in the sense of CPU processing speeds, RAM and storage sizes,
etc.) keeps growing and its price is falling. However the workload is so variable that
computing power may not load processing node enough so that its power consumption
is actually higher than the outcome of its work.
This led to a rebirth of one IT industryvirtualization. To a certain level, vir-
tualization has been possible on various levels since 1967, in this case on IBM CP-40.
However, the main reason back then was to enable various software to run unmod-
ied or simultaneously (computers were batch oriented and most software was not

1
2 Sun servers open--source systems management

designed for any level of multitasking). Now, the reason for virtualization is consoli-
dation, power consumption reduction and control of expenses.
Availability of relatively cheap but powerful commodity hardware has led to a
new architecture of ITinstead of renting a dedicated machine (although this is still
possible), one can rent virtual machines, running on possibly very dierent set of
hardware. With properly setup infrastructure (ber channel or iSCSI disk arrays,
virtualization software supporting live migration etc.), it is possible to achieve a very
high availability and reliability.
However, cheaper systems are being built from cheaper components that are
prone to failure more often than never, thus the need for proper monitoring is high.
With proper software, migrating of virtual machines in case of a hardware malfunc-
tion can be automated.
Power consumption monitoring is a very important part of systems management.
With power becoming more expensive, a careful monitoring of power consumption
with relation to tasks performed is required to manage the costs of ones IT operations
or to properly bill the customers (the latter applies specically to cloud computing
customers).
This bachelors project will focus on one area of systems managementsystems
health monitoring. With above in mind, we can try to focus on a clear design, that will
allow implementing above described features or connecting with existing features in
place.
Objective is to design and implement a Zenoss extension (also known as ZenPack)
that will allow to discover, monitor and report system health status of some Oracle
Sun servers to user. Zenoss was chosen because it is a very advanced integration plat-
form, with advanced features such as graphing, so a future extensions like recording
and analyzing power consumption trends can be implemented. Selection was done in
unpublished work by the author, available separately [1].
2 Systems management software

In this chapter, an incomplete list of both commercial and non-commercial software


used for systems management is presented. When possible, manageability features
of Oracle servers using these particular software solutions is also described.
While there are many software solutions available from various vendors, only
few are listed in this section, just to give a brief overview of present features. The
objective is to make the resulting integration with open-source software comparable
to already existing integrations.

2.1 Commercial oerings

The following commercial product have been used by the author to manage Oracle
Sun servers:

CA Unicenter NSM
HP Operations Manager
IBM Director
IBM Tivoli Enterprise Console
IBM Tivoli NetCool OMNIbus

All of these products can do passive monitoringlisten for events, either received
using SNMP traps, system logs or some other mechanism (like direct database entry,
command line tool execution etc.).
The Tivoli Enterprise Console, also known as TEC is one of the oldest systems
management package. It relies on Tivoli Management Framework which provides
also way how to install other extensions and patches. TEC itself has rather simple
GUI written in Java, but the backend consists of many helper programs usually writ-
ten in C. TEC is used to do passive monitoring onlyit waits for events and those
events get processed using internal engine (some of its parts are based on Prolog lan-
guage). This software package however requires preinstalled database system to be
present.

3
4 Sun servers open--source systems management

Figure 2.1 IBM Tivoli Enterprise Console


with graphed amount of incoming events

NetCool OMNIbus is similar to TEC, but it has a more modern GUI. Being a
product acquired through acquisition, it is not written in Java but in compiled lan-
guage. It uses totally dierent language for writing custom extension and as one of a
few, it has its own database bundled.
Operations Manager, Director and Unicenter NSM are products of dierent com-
panies, but they have one common featurethey support active polling. Other than
that, they oer similar features and all can receive and process notications from
Oracle Sun servers.
The following features are present in all integrations with these products:

Translating SNMP traps and notications into user readable form.


Removing duplicates of events.
Having events with lower severity automatically close events with higher severity.
Systems management software 5

Figure 2.2 TEC showing new events

Integration that support polling usually can at least display the state of system LEDs,
some (CA Unicenter NSM) can display a hierarchy of sensors.

2.2 Open-source oerings

In the open--source market, there are right now the following major products:

Nagios
OpenNMS
Zabbix
Zenoss
6 Sun servers open--source systems management

Figure 2.3 CA Unicenter NSM showing hierarchy of sensors

Nagios is the oldest and most mature open--source product. It is very scalable, well
documented, but its web GUI lacks some modern featureswhich of course means it
is very fast, albeit sometimes not very user friendly.
It is written mainly in C, which is another cause of high speed. Monitoring data
can be obtained by running checks either built-in or user supplied scripts called
plugins whose exit code and (optionally) any output is processed and evaluated by
Nagios.
Checks can be run either locally or remotely using a tool called NRPE (Nagios
Remote Plugin Executor). In addition to having Nagios to run a check actively (see
subsection 4.2.1 at page 21), one can also feed data into Nagios asynchronously (see
subsection 4.2.2 at page 22). For more information please see www.nagios.org or
[2].
OpenNMS is another network monitoring/management software package. While
Nagios achieves portability across dierent platform by using C as its programming
language, OpenNMS is written in Java, which makes it too very portable. It requires
Systems management software 7

Figure 2.4 Nagios showing status of services (image from www.nagios.org)

database for its backing. It provides more modern GUI to user, otherwise its features
are mostly comparable to others.
From [3]:
Zabbix is an enterprise-class open source distributed monitoring solution.
Zabbix is software that monitors numerous parameters of a network and
the health and integrity of servers. Zabbix uses a exible notication mecha-
nism that allows users to congure e-mail based alerts for virtually any event.
This allows a fast reaction to server problems. Zabbix oers excellent report-
ing and data visualisation features based on the stored data. This makes Zab-
bix ideal for capacity planning.
8 Sun servers open--source systems management

Figure 2.5 OpenNMS event list

Zabbix supports both polling and trapping. All Zabbix reports and statis-
tics, as well as conguration parameters, are accessed through a web-based
front end. A web-based front end ensures that the status of your network and
the health of your servers can be assessed from any location. Properly con-
gured, Zabbix can play an important role in monitoring IT infrastructure.
This is equally true for small organisations with a few servers and for large
companies with a multitude of servers.
Zabbix is written in C and PHP and requires a database backing.
Finally, we are about to look at Zenoss, which is our integration platform. Ocial
documentation [4]says:
Zenoss is today's premier open source IT management solution. Through in-
tegrated monitoring, it enables you to manage the status and health of your
infrastructure through a single, Web-based console.
The power of Zenoss starts with its in-depth Inventory and Conguration
Management Database (CMDB). Zenoss creates this database by discovering
managed resourcesservers, networks, and other devicesin your IT envi-
ronment. The resulting environment model provides a complete inventory of
your key systems, down to the level of resource components (interfaces, ser-
vices, and processes, and installed software.)
Systems management software 9

With the model built, you can use Zenoss' integrated availability and per-
formance monitoring features to monitor and report on all aspects of your IT
infrastructure. Zenoss also provides events and fault management features
that tie into the CMDB. These features help drive operational eciency and
productivity by automating many of the notication, alerts, escalation, and
remediation tasks you perform each day.
Zenoss is written in Python and is based on Zope application platform and like most
previously mentioned software products, it requires databasespecically MySQL.

Figure 2.6 Zenoss with list of manufacturers


10 Sun servers open--source systems management
3 Protocols for system management

Systems management can be thought of as a network application. As such, it is neces-


sary to have one or more protocols, that will allow user to gather data (for description
of data gathering methods, please see chapter 4 at page 19). These protocols dier in
their complexity, reliability and verbosity.
Some devices may also implement two or more protocols simultaneously, but the
amount of data exposed may not be the same, even for the same device. Also, level of
support of these protocols varies considerable (e.g. very few software packages sup-
port IPMI out--of--the--box). In this chapter we will describe some of the most com-
monly used protocols that have been used for systems management.

3.1 Simple Network Management Protocol

Taken from Wikipedia [5]:


Simple Network Management Protocol (SNMP) is a UDP-based network proto-
col. It is used mostly in network management systems to monitor network-at-
tached devices for conditions that warrant administrative attention. SNMP
is a component of the Internet Protocol Suite as dened by the Internet Engi-
neering Task Force (IETF). It consists of a set of standards for network man-
agement, including an application layer protocol, a database schema, and a
set of data objects.
SNMP exposes management data in the form of variables on the managed
systems, which describe the system conguration. These variables can then
be queried (and sometimes set) by managing applications.
Although in the early days of the internet by network devices mostly computers were
meant, the specication is designed very much device-independently, therefore de-
vices such as

servers
routers
racks
switches

11
12 Sun servers open--source systems management

wireless access points


uninterruptible power supplies

can be monitored. Since the SNMP implementation can be carried out even on very
small devices, SNMP can be implemented even for devices like air conditioning control
etc.
Currently, SNMP exists in three versions (in parentheses the years of standard-
ization by the Internet Engineering Task Force is given):

SNMP v1 (19881990) [68]


SNMP v2c (1993)
SNMP v3 (2002)

Even though the latest version of SNMP brings very important new features, like
authentication and encryption, it is still not supported by some of the network man-
agement software suites.

3.1.1 Monitoring over SNMP

Network infrastructure implementing monitoring contains two important software


componentsthe agent and the network management software, also known as NMS.
Agent implements SNMP protocol and uses it to expose data. The structure of
data is dened using Management Information Base (see below). Usually vendors
choose to dene their MIBs very broadly, so every agent implementing that particular
MIB may not make use of all structures.
Network management software also makes use of the MIB to gather and trans-
late data it can get from the agent and performs further processingamong others
statistics, error notication, automated error processing etc.
SNMP protocol supports both active and passive monitoring. In active monitor-
ing, NMS uses SNMP requests (gets or sets) to get data or set conguration parame-
ters on the managed device directly. When monitoring passively, NMS only listens for
SNMP data coming from the managed device (SNMP uses two termstrap and no-
ticationthey are often used interchangeably, although rst term refers to SNMP
v1 and the latter to SNMP v2c and v3). Version 2c also species an inform packet,
that diers from trap and notication as it makes the NMS send a conrmation when
such packet is received. However, this mechanism is rarely used.
Protocols for system management 13

SNMP is a datagram protocol and therefore there is a possibility of the data being
lost en route. This is especially important when using passive monitoringnetwork
elements such as routers can cause UDP packets to be lost and in the case of fatal error
(by fatal error an error causing powering o of the monitored device) the notication
may not be received at all, causing the error to be found due to some other malfunction
(typically a segment of network being down, possibly a service like database or web
server being inaccessible).

3.1.2 Important terms related to SNMP

When working with SNMP based technologies, one can ofter come across the following
terms:

OID
varbind
table
scalar
index

OID is an abbreviation for object identier. It is represented as a dotted n--tuple of


integers (MIBs actually describe the textual representation of these OIDs).
Varbind stands for variable binding. It is consists of OID and its values, which
can be OID too or it can be a number, string, or any other data structure expressable
using ASN.1.
Scalar value is dened in MIB and it is always referenced using single OID.
Table is dened in MIB too, but to access the rows in columns, one must append
an index after the OID of column. Table is simply a set of columns.

3.1.3 Management Information Base

As mentioned above subsection 3.1.1 at page 12, there is a special format that de-
scribes the data sent over SNMP. Format of a MIB is derived from ASN.1 (see sub-
section 3.1.3.1 at page 14). Formally, it has been dened in [9]. Citation:
14 Sun servers open--source systems management

Management information is viewed as a collection of managed objects, resid-


ing in a virtual information store, termed the Management Information Base
(MIB). Collections of related objects are dened in MIB modules. These mod-
ules are written using an adapted subset of OSI's Abstract Syntax Notation
One, ASN.1 [10]. It is the purpose of this document, the Structure of Manage-
ment Information (SMI), to dene that adapted subset, and to assign a set of
associated administrative values.

3.1.3.1 ASN.1

Abstract Syntax Notation One is one of many approaches on data structure descrip-
tion. What makes it stand out is that it allows specication of the structure, but it
also describes its encoding and decoding into various formats (ranging from binary
formats to XML).
ASN.1 is an international standard adopted by Internation Telecommunication
Union (ITU) and by ISO/IEC. It has been standardized as [1013]. Due to its versatil-
ity, ASN.1 and its hierarchical data model is used other application protocols as well,
including internet telephony (H.323) and directory services (LDAP).

3.2 Intelligent Platform Management Interface

Rather than a being a single protocol specication, IPMI species full set of physical
interfaces to a system controller, communication protocol and data representation. It
is specied in [14], a standard designed by a computer manufacturer consortium led
by Intel. Citation for [14]:
The IPMI specications dene standardized, abstracted interfaces to the plat-
form management subsystem. IPMI includes the denition of interfaces for
extending platform management between board within the main chassis, and
between multiple chassis.
The term platform management is used to refer to the monitoring and
control functions that are built in to the platform hardware and primarily used
for the purpose of monitoring the health of the system hardware. This typi-
cally includes monitoring elements such as system temperatures, voltages,
fans, power supplies, bus errors, system physical security, etc. It includes
Protocols for system management 15

automatic and manually driven recovery capabilities such as local or remote


system resets and power on/o operations. It includes the logging of abnor-
mal or out--of--range conditions for later examination and alerting where the
platform issues the alert without aid of run--time software. Lastly it includes
inventory information that can help identify a failed hardware unit.

3.3 Web-Based Enterprise Management

A modied excerpt from [15]:


WBEM is a set of management and Internet standard technologies developed
to unify the management of distributed computing environments, facilitating
the exchange of data across otherwise disparate technologies and platforms.
It consists of a core set of standards developed by DMTF (Distributed Man-
agement Task Force), which includes the Common Information Model (CIM),
CIM-XML, CIM Query Language, WBEM Discovery using Service Location
Protocol (SLP) and WBEM Universal Resource Identier (URI) mapping. In
addition, the DMTF has developed a WBEM Management Prole template,
allowing for simplied prole development to deliver a complete, standalone
denition for the management of a particular system, subsystem, service or
other entity.
WBEM is extensible, facilitating the development of platform-neutral,
reusable infrastructure, tools and applications. In addition to its use by ven-
dors, end users and the open source community, WBEM is enabling other in-
dustry organizations to build on its foundation in areas including Web ser-
vices, security, storage, grid and utility computing.
Openness of the WBEM specications led to development of several implementation,
notably OpenPegasue [16]and WMI (Windows Management Instrumentation). WMI
does not rely on Web Services, but rather on COM objects and RPC calls.
16 Sun servers open--source systems management

WBEM is now part of many operating systemsapart from Windows' WMI, it is


present in most enterprise Linux distributions and in commercial Unices, like Oracle
Solaris and HP-UX.

3.4 Other protocols

3.4.1 Remote shell access

System management has traditionally used a particularly simple approach using se-
rial line, or its alternativetelnet or secure shell access to the system controller or
to the system itself.
System controller on most server platform oers a broad range of system man-
agement possibilities. Besides power control and console control, it also provides sys-
tem administrator with the ability to display the status of sensors and to list system
events.

# ssh root@myhost
Password:
Waiting for daemons to initialize...

Daemons ready

Sun(TM) Integrated Lights Out Manager

Version 3.0.6.1.d r48331

Copyright 2009 Sun Microsystems, Inc. All rights


reserved.
Use is subject to license terms.

-> show /SYS product_name

/SYS
Properties:
product_name = SPARC-Enterprise-T5220

Figure 3.1 Output from service console


Protocols for system management 17

Although the output is optimized for human reading and not for programmatic analy-
sis, there are well established tools that can parse this output (expect [17]), and feed
the resulting data to a system management software.
This technique applies not only to system controller, but to BIOSes and even oper-
ating system command line utilities. There are a few Zenoss extensionsZenPacks,
that use the technique of parsing text output to deliver information on processes, CPU
load, storage status and more.

# cat /proc/partitions
major minor #blocks name

8 0 312571224 sda
8 1 309917916 sda1
8 2 1 sda2
8 5 2650693 sda5

Figure 3.2 Output of cat command

3.4.2 Other protocols

In addition to protocols listed above, there are some other protocols used for system
management. One of the mature one is syslog protocol.
Unix system log protocol is specied in [18]. It was designed with networking in
ming, so although it is generally used on local host, it is possible to setup the daemon
to lter and forward messages to a network host. On this host, further processing
can be done. Usually, traditional syslog will not record originating host name, so
there needs to be a special daemon or the system logging daemon needs a special
conguration.
Being a very old protocol, there is almost no security (besides facilities like re-
jecting a host that is not in a list, etc.), and by generating a ood of messages, it is
possible to overload the daemon or ll the space in /var/log lesystem, which may
lead to unexpected failures.
Commercial products (especially those that contain or can be used with their own
agents on remote hosts) also use various RPC mechanisms. Among the most common,
there are the following:
18 Sun servers open--source systems management

ONC RPC (Open Network Computing Remote Procedure Call) [19]


CORBA (Common Object Request Broker Architecture) [20]
SOAP (Simple Object Access Protocol) [21]
XML-RPC (XML Remote Procedure Call) [22]

Description of these protocols is beyond the scope of this project, for further informa-
tion please consult the references. In case of proprietary software, details about the
usage of these protocols may not be fully known, therefore their use as an communi-
cation protocol with custom software may be very challenging.
4 Approaches to system management

In this chapter we will describe possible approaches to system management, and com-
pare them in terms of protocol requirements, generated network trac and reliability.
Possibly the simplest approach to system management (more specically, system
health monitoring) is simply to wait until the device stops working, rendering some
service or services unusable. While possible to do so (indeed, author have observed
such approach in an educational institution), there is no warning in advance and
therefore such approach is only feasible in environments where setting up monitoring
would be more expensive than repairing failed systems.

4.1 Way of communication

To be able to monitor any system, there must be a way to connect to it. In systems
management, we usually use one of the following four communication channels:

local only
in-band communication
out-of-band communication
side-band communication

By local communication a non-network communication with monitored system is usu-


ally meant. This may involve connecting serial console (e.g. laptop with serial line)
or display, keyboard and mouse manually. Watching status LEDs in person can be
also used for quick system status checking. For the purpose of this project, we will
not consider this as a viable method of system monitoring. All other communication
channels are described below.

4.1.1 In-band communication

In-band communication is a way of system monitoring and management communica-


tion, where the monitoring data is sent over the same network channel as production
data (e.g. web trac).

19
20 Sun servers open--source systems management

This implies that operating system on the monitored device has to support man-
agement trac handling (usually, this is accomplished by running a so-called agent).
Also, it means that management trac occupies (at least partially) useful bandwidth
and that the agent will use some CPU cycles.
On the other hand, using this type of communication poses no additional require-
ments on the existing network infrastructureno additional cabling is required and
no changes to network switches and routers needs to be made. Especially when deal-
ing with many servers, savings on network infrastructure may be signicant.
One signicant drawback of this approach is that without operating system run-
ning, management may not be possible (although servers with Wake--on--LAN capa-
bility can be at least turned on remotely).

4.1.2 Out-of-band communication

Out-of-band communication is complementary to the in-band communication. It uses


its own network port or, in some setups, serial line connected to network terminal
server.
Monitoring capabilities therefore do not depend on running operating system,
nor does the monitoring trac aect production network bandwidth and CPU load.
Depending on the system controller (this term is used mainly in connection with
SPARC systems, another used terms are BMCbaseboard management controller
and SPservice processor) additional features may be oered to the system adminis-
tratorfor example console redirection, storage redirection and management, rmware
update etc. Power control is one of the basic features.
This type of communication requires additional cabling and switching, so the
resulting network infrastructure is more dense and also more expensive. System
controller on the other hand does not use any special network features so very low
cost commodity switches.
Security of this dedicated management network is of vital concern to the user.
Breach may lead to disruption of management trac and it may be possible to over-
load the system controller. In case of breaking into the system controller, the ad-
versary could not only take the entire system down (possibly damaging production
data), but it may be possible to boot a totally dierent operating system from redi-
rected storageleading to data leak or intentional corruption. Of course, booting
a dierent operating system using a direct (i.e. production network) breach is also
possible, but this channel is expected to be much more secure (strong passwords, re-
wall, etc.). But a separate network may lead to temptation to keep default passwords,
Approaches to system management 21

therefore it is very important to develop and enforce security guidelines with same
strictness as guidelines applying to operating system and network security.
In conclusion, drawback of this approach is higher network infrastructure costs,
but for setups requiring additional features like storage redirection etc., this approach
is benecial.

4.1.3 Side-band communication

Side-band communication combines the best features of both communication methods


described above. Side-band communication usually involves system controller, that
uses the same network port as production network, but operating in a separate virtual
LAN (VLAN).
Features are usually comparable to those of out-of-band communications, yet
there are some savings in network infrastructure. Setting up network components
to correctly route information based on VLAN information may be more complicated
than other means.
Finally, not all service controllers support this type of communication, so unless
there is a bigger number of servers supporting this type of communication, invest-
ing time into setting up side-band monitoring in addition to any of the previously
mentioned ways is probably not a worthwhile eort.

4.2 By means of data gathering

4.2.1 Active monitoring

By active monitoring we mean such setup, where the monitoring station (i.e. a box
running monitoring software) actively queries managed (monitored) devices.
Certain protocols (like IPMI) support only this type of monitoring, others (like
SNMP) support both active and passive.
During active monitoring, the following data (albeit not all of these may be avail-
able ) is usually gathered and/or updated in regular time intervals:
22 Sun servers open--source systems management

list of hardware components with their statuses


list of sensors with current values, thresholds and statuses
overall system health status

Depending on the verbosity of data obtained and on time intervals, active monitoring
can cause a signicant network trac (this may not be favourable especially when
using in-band communication). However the amount of data transmitted may be reg-
ulated by selecting only a subset of data (e.g. checking a system status and reading
an extended set of data when the status changes).
Advantage of active monitoring is reliabilityeven when using non-reliable data
transfer (UDP protocol used with SNMP protocol), the monitoring station can usually
detect missing data and request it again.
Another huge advantage is the ability to gather statistically relevant data to be
stored and processed (like power consumption, network port trac etc.). Advanced
features of monitoring software can include graphing and reporting, which can in
turn be used to consolidate computing resources in power-ecient way.
This type of monitoring is usually supported by most network devices, ranging
from servers to low-cost switches.

4.2.2 Passive monitoring

Passive monitoring is an opposite (and complementary) approach compared to ac-


tivethis time, it is the responsibility of the monitored device to report a status
change to monitoring software. Based on this received information, monitoring sta-
tion will perform some actioneither predened or dened by user. Actions can be
from operator notication using paging or SMS, to automatic failure correction (like
starting virtual machines migration etc.).
However, when using non-reliable data transport (UDP), passive notication may
not even be received. Also, especially when using SNMP protocol, management sta-
tion does not usually send a reception conrmation. Multiple switches en route can
adversely aect datagrams, causing the message to be delayed, received out-of-order
or entirely to be lost. To prevent this, some management and monitoring software
can listen for SNMP notications in local network and send it to the master manage-
ment host using some reliable protocol (in most software this is implemented as RPC,
either original ONC RPC, web service call or propriatery protocol).
Approaches to system management 23

Huge advantage is that very little network trac is generated, and also this
method is very CPU usage friendly (neither agent/system controller nor monitoring
station are processing huge amounts of data).
This method may not be supported by all devices.

4.2.3 Combination of active and passive monitoring

When both above mentioned approaches are combined, possibly the most reliable
monitoring system can be built. However, not all monitoring packages allow these
two approaches to be combined.
Modus operandi is like this:

1. Monitoring station reads all data using active approach (i.e. full repository).
2. Monitored hosts issue notications based on their status changes.
3. Monitoring station updates it's data either by:
a. using solely data from the passive notication
b. refreshing all data from the appropriate monitored device
4. Once a while, monitoring station refreshes all data (just in case notication was
lost).

4.3 Final comparison

To be able to correctly choose between various approaches to monitoring, it is best to


have these methods compared in tables:

Feature In-band Out-of-band Side-band

OS Independent no yes yes


Communication port shared separate shared
Uses host CPU yes no no
Special net. requirements none yes, cabling yes, setup
Display/storage redirection needs OS support yes yes
Power management limited yes yes

Table 4.1 Comparison of communication methods


24 Sun servers open--source systems management

Feature Active Passive Combination

Comm. initiator management host monitored device both


Network trac high low medium
Reliability high lower highest
Stat. data available yes no yes
Mgmt software support medium very high very low
Mged devices support high lower

Table 4.2 Comparison of data acquisition methods

Selection in particular setup will be subject to available software, number and type of
devices, current network infrastructure hierarchy and also time and budget alloted.
5 Sensors and components

Before we can get deeper into the actual data presented by Oracle Sun system con-
trollers and agents, we need to dene and explain terms that are connected with a
server.
Component is any functional part of the server. Components may nd themselves
in a number of states:

present
absent
functioning
about to malfunction
malfunctioning
unknown

Very closely related term is sensor. Sensors are usually connected with compo-
nents, although they may be connected with a whole system. There are fundamen-
tally two types of sensors:

physical (e.g. voltage, fan speed, etc.)


virtual (e.g. system is OK)

The dierence is, that virtual sensors are being computed based on physical sensors.
It shall be noted that for some virtual sensors, the underlying physical sensors may
be hidden.
Physical sensors usually detect some values being out of range or just some true/false
conditions. Some types of physical sensors:

button sensor (power buttons, chassis intrusion detection)


fan speed sensor
current sensor
presence sensor
temperature sensor
voltage sensor

25
26 Sun servers open--source systems management

Among virtual sensors are those whose condition is base on state of other sensors
(e.g. power sensor measuring in Watts will be calculated from appropriate voltage
and current sensors) or based on a condition detected by software. For example:

memory ECC error sensor


OK/not--OK sensor
power sensor

Some sensors (mostly physical) have setup some thresholds. A threshold is a value,
which the measured value must achieve and cross for the sensor to change its state.
Usually, only sensors that measure continuous values (numeric sensors, the opposite
being discrete sensors) have dened thresholds:

non--critical
critical
non--recoverable

When a non--critical threshold is being crossed, usually a notication is generated,


but the condition is not severe and it won't impact function of the system. Staying
beyond critical threshold may potentially aect reliability and endurance may be af-
fected. Non--recoverable threshold crossing usually signals something has gone very
wrong and the system is immediately shutdown (although this can be modied and
sometimes disabled).
Also, thresholds can be low and highfor example, temperature sensor measur-
ing ambient temperature has a all six thresholds dened (high temperature is not
desired equally as freezing temperatures).
Discrete sensors have only a certain set of states they can have. Here is an in-
complete list of discrete values certain sensors can have:

disabled
memory error detected
OK/fail
present/absent

Both kinds of sensors have so-called assertions and deassertions. These two are op-
posite to each other. Assertion means that the sensor assumes some state (usually
Sensors and components 27

error state), deasertion means that the sensor leaves the state that was previously
asserted.
However, this may sometimes be trickylets see an example. We have a sensor
HDD0 (the names are usually longer, but for the sake of example lets keep this one)
that has the following states:

Device Present
Device Absent
Hot Spare
Rebuild In Progress

and for all of the, both assertion and deassertion is enabled. In this particular exam-
ple, having the sensor in Device Present Assert means that the particular device
is present. Similarly, Device Absent Assert will mean that the device has been re-
moved.
There is however one more approachhave the device in Device Absent De-
assert and Device Absent Deassert and Device Present Deassert. Both mean
the same thing as the ones in previous paragraphthe device has been inserted (is
no longer absent) and device has been removed (and is no longer present) respec-
tively. Any integration dealing with sensor must be aware of this and preferably
should translate incoming notications into one common format and discard the less
common and more confusing one.
28 Sun servers open--source systems management
6 Management interfaces of Oracle Sun servers

Since this project focuses on systems management of Oracle Sun servers, we rst
need to describe management capabilities of these servers.
Oracle (and previously Sun) has a very broad portfolio of servers. However, for
this project, we will focus on the following hardware families:

Oracle Sun Fire x86 Servers (X2000 and X4000 series)


Oracle Sun SPARC Enterprise Server (T1000, T2000 and T5000 series)
Oracle Sun Blade Server Modules (X6000 and T6000 series)
older Sun Fire Servers (SPARCs, V210 for example)

The work will be done primarily on latest available servers (i.e. not End--of--Life ones).
Although it may seem as a waste of time to target also servers no longer in production,
it is author's belief that these servers may still be present especially in educational
institutions, where they performance is still sucient and having an open--source tool
for monitoring will be more than benecial.

6.1 System controllers

All the servers mentioned above have a special, independent computer on--board, that
controls power, monitors environmental and system characteristics (voltages, device
presence, fan speeds etc.) and reports the using methods describe below. This com-
puter is called system controller on SPARCs and service processor on x86 servers.
On Oracle Sun servers mentioned above, one may encounter the following ver-
sions of system controllers:

Advanced Lights Out Manager (ALOM) [23 and 24]


Embedded Lights Out Manager (eLOM) [25]
Integrated Lights Out Manager (ILOM) [26]

ALOM is the oldest from these two, and one can nd it only on older SPARC servers
(there are two versionsALOM and ALOM--CMT, the rst one being used on sun4u

29
30 Sun servers open--source systems management

platforms and the latter being used on servers with UltraSPARC T1 processorthese
processors have the ability to run several threads in parallel, also called Chip Multi-
threading, hence the abbreviation CMT).
ALOM had only command line interface and they can send e-mail to adminis-
trator in the event of malfunction, newer version of ALOM--CMT also support SNMP
protocol. There is no web GUI, though. ALOM is primarily out--of--band (using ser-
ial line or its own network port), but it can be congured from within Solaris using
scadm(1M) command. Features are pretty much standard:

power control
serial console redirection
logical domains (on CMT machines, [27])
environment monitoring
listing, disabling and enabling components

eLOM on the other side can be found only on older x86 platforms. It oers com-
mand line interface, SNMP interface and web interface. In addition to features listed
with ALOM (except the logical domains), eLOM has these additional features:

graphical console redirection


storage redirection

ILOM is the latest and actively developed system controller software. It can be
found both on SPARC and x86 servers and it oers everything ALOM and eLOM oer
together.

6.2 Command--line interface

Command--line interface is universally available on all three service controllers. How-


ever, the syntax of commands diers considerably (to mitigate this to veteran SPARC
administrators, ILOM on SPARC can be run in ALOM--compatible mode, so that most
commands and possibly even script these administrators know or have written will
work as expected). Please see the examples:
Management interfaces of Oracle Sun servers 31

# ssh root@alom-server

Copyright 2008 Sun Microsystems, Inc. All rights reserved.


Use is subject to license terms.

Sun(tm) Advanced Lights Out Manager CMT v1.7.6

Please login: admin


Please Enter password: *****

sc> showhost
Sun-Fire-T2000 System Firmware 6.7.6 2009/10/29 16:06

Host ash versions:


OBP 4.30.4 2009/08/19 07:24
Hypervisor 1.7.3.a 2009/10/29 15:50
POST 4.30.4 2009/08/19 07:47

Figure 6.1 ALOM exampleinformation about server

Command--line interface can be accessed over the following interfaces:

serial line
telnet (may be disabled for security reasons)
secure shell
internally over OS tool (e.g. scadm(1M))

6.3 SNMP

SNMP interface is arguably the most used interface for system management. Both
eLOM and ILOM support SNMP from the very rst versions, ALOM--CMT started
to support SNMP directly relatively late.
However, either due to absence of SNMP interface (ALOM--CMT prior to v1.4) or
due to simple wish to monitor the system in--band, there are so-called agents. There
are currently two:

Monitoring Agent for Sun Fire and Netra Systems (MASF) [28]
32 Sun servers open--source systems management

# ssh root@elom-host
root@elom-host's password:

Sun(TM) Embedded Lights Out Manager

Copyright 2004-2006 Sun Microsystems, Inc. All rights reserved.

Version 2.91

Hostname: SUNSP0016365B97FB

IP address: 10.18.141.146

MAC address: 00:16:36:5B:97:FB

System serial number: 0624QC0029

/SP -> show /SP/SystemInfo/ProductInfo

/SP/SystemInfo/ProductInfo
Targets:

Properties:
ProductManufacturer = Sun Microsystems
ProductProductName = Sun Fire X2200 M2
ProductPartlNumber = 1S39U9ZST61
ProductSerialNumber = 0624QC0029
AssetTag =

Target Commands:
show

Figure 6.2 eLOM exampleinformation about server

Oracle Server Hardware Management Agent [29]

MASF is available only on SPARC systems, but it supports both ALOM (including the
CMT variant) and ILOM system controller. On the other hand, the Hardware Man-
agement Agent supports only x86 systems and only those running specic versions
of ILOM.
All system controllers supporting SNMP and both agents can be congured to
accept incoming SNMP requests for data (useful when monitoring these systems
activelyalso known as polling) and/or they can send SNMP traps or notications
Management interfaces of Oracle Sun servers 33

# ssh root@sparc-ilom
Password:
Waiting for daemons to initialize...

Daemons ready

Sun(TM) Integrated Lights Out Manager

Version 3.0.6.1.d r48331

Copyright 2009 Sun Microsystems, Inc. All rights reserved.


Use is subject to license terms.

Warning: password is set to factory default.

-> show /SYS


...

Properties:
type = Host System
ipmi_name = /SYS
keyswitch_state = Normal
product_name = SPARC-Enterprise-T5220
product_part_number = 602-3821-08
product_serial_number = BEL07513TT
product_manufacturer = SUN MICROSYSTEMS
fault_state = OK
power_state = On

...

Figure 6.3 ILOM exampleinformation about server

on their own (passive monitoring). However, the format of data diers considerably
among the types of service controller or agents. Its structure is important for further
work on the integration with Zenoss, so the data structure (described using MIBs)
will be discussed in the next section.

6.3.1 Oracle Sun MIBs

Format and purpose of MIB was already dened (see section 3.1.3 at page 13). Oracle
Sun systems (or more precisely, the system controllers and agents) implement some
of the following MIBs:
34 Sun servers open--source systems management

ENTITY-MIB
SUN-PLATFORM-MIB
SUN-ILOM-PET-MIB
SUN-HW-TRAP-MIB
SUN-HW-MONITORING-MIB
SUN-ASR-NOTIFICATION-MIB

In the following paragraphs, we will look into these MIBs in higher detail.

6.3.1.1 Origin and purpose of these MIBs

ENTITY-MIB is the only MIB that has not been dened by Oracle (formerly Sun). It
is dened in an independent specication [30]. The purpose of MIB is given as follows
([30]):
In particular, it (this MIB) describes managed objects used for managing mul-
tiple logical and physical entities managed by a single SNMP agent.
ENTITY-MIB contains structures that (in terms of server management) describe
various components of the server, including details about count and type of processors,
DIMM modules manufacturer etc.
SUN-PLATFORM-MIB is a MIB that extends ENTITY-MIB with details about
operational state and also it contains tables that identify and list system sensors, to-
gether with their thresholds and current values. Also, this MIB in particular denes
some notications, that can be used to dynamically modify the model of monitored
system and/or it can be translated and displayed to user. However, these traps do not
carry all the information (like the type of sensor issuing the warning), so additional
action is required to get such information (typically, this is done using regular expres-
sion that looks for a certain pattern of sensor names). Using regular expressions is
quick and functional way, but author believes the correct approach is to poll the agent
or system controller for a correct sensor type based on received OIDs present in the
notications. These two MIBs are supported in MASF (SPARC) and all ILOMs and
eLOMs.
SUN-ILOM-PET-MIB is one of the MIBs that doesn't use typical Sun (Oracle)
OID tree, but it instead uses a tree wiredformgmt (Wired for Management). This
is an OID tree reserved by Intel [31]for so-called PETs (Platform Event Traps). These
largely correspond with IPMI and ofter carry similar date. However, such trap gen-
erated carries a computed specic type (a number that identies the type of trap or
Management interfaces of Oracle Sun servers 35

notications that is being sent). Most NMSes can't deal with dynamic specic types,
they expect these numbers to be assigned statically and dened in the MIBand that
is the purpose of this MIB. However, in case there is another PET MIB by a dier-
ent vendor, they will share the OID tree and the numbers will collide. Not only will
the names and descriptions of most or all notications dierent, but some may have
totally dierent meaning.
SUN-HW-TRAP-MIB was designed relatively recently with a single purposeelim-
inate the need to do a regular expression matching or polling agent when a trap is
received. Hence, a direct display of these traps is preferred.
SUN-HW-MONITORING-MIB was designed to remove a dependency on ENTITY-MIB
and to provide some more information about the monitored system. It features data
like cumulative state, which is computed on the monitored host side. The advantage
of this approach is mainly saving the network tracNMS may poll only few val-
ues in the MIB and get a full tree only in case something goes wrong. This MIB is
implemented only in the Hardware Management Agent.
SUN-ASR-NOTIFICATION-MIB is currently implemented by ASR agent. De-
scription from [32]:
ASR is a secure, scalable, customer--installable software feature of warranty
and SunSpectrum support that provides auto-case generation when specic
hardware faults occur. ASR is designed to enable faster problem resolution by
eliminating the need to initiate contact with Sun for hardware failures, reduc-
ing both the number of phone calls needed and overall phone time required.
ASR also simplies support operations by utilizing electronic diagnostic data.
In case there is an error detected (hardware error), the ASR agent sends details
about the error, together with unique identier of the system to Oracle, where the
data is ltered and entered as a Service Request on behalf of the customer. This saves
time and communication eorts. In addition, ASR generates a SNMP notication to
inform the customer about Service Request being created on his behalf.

6.3.1.2 Notications

It is not feasible to describe every single notication declared in all MIBs, as that
would make this document extensively long and also very quickly outdated. In this
section, we will describe the basic principles behind notications in Oracle (Sun)
MIBs.
36 Sun servers open--source systems management

ENTITY-MIB has only one notication, entCongChange is the only present


notication. Its sole purpose is to inform NMS that a conguration change has oc-
curred and that it should reread all data.
SUN-PLATFORM-MIB has at present twelve notications dened. These noti-
cations were designed to work in cooperation with ENTITY-MIB, and as such each
notication carries an OID that points to the ENTITY-MIB and contains some addi-
tional information. However, this is not practical for integrations that only translate
notications, so there are additional varbind sunPlatNoticationAdditional-
Info that contain a human--readable text of the event that occurred.
SUN-ILOM-PET-MIB was already briey described. What is interesting about
the notications is that they contain only one varbind, but with a string of encoded
binary data. Among them there is also a sensor name, which is often decoded from
the trap and the rest is discarded as the meaning of the notication is already given
by the specication.
SUN-HW-TRAP-MIB is the only MIB designed solely for the purpose of sending
traps. As of now, it has seventy three notications dened. Names of the notications
contain both the type of sensor on which the event occurred, but also which threshold
was crossed. In the additional varbinds there is the full name of the sensor, threshold
value and current value. Example:

sunHwTrapVoltageNonCritThresholdExceededa non--critical thresh-


old was exceeded
sunHwTrapVoltageOkthe voltage is OK now

Please bear in mind that SNMP is UDP based and therefore each trap with lower
severity (e.g. the one suggesting system is getting into better condition) should auto-
matically close all previous events with higher severity, if they were sent for the same
sensor.
SUN-ASR-NOTIFICATION-MIB has only ve notication:

sunAsrSrCreatedTrap
sunAsrSrCreationInProgressTrap
sunAsrSrUpdatedTrap
sunAsrSrDelayedTrap
sunAsrSrFailureTrap
Management interfaces of Oracle Sun servers 37

With these notications, NMS can display appropriate messages when a service re-
quest gets created, is being created, has been updated, is delayed or has failed, re-
spectively.

6.3.1.3 Polled data

ENTITY-MIB contains the following tables:

entPhysicalTable
entLogicalTable
entLPMappingTable
entAliasMappingTable
entPhysicalContainsTable

It also contains entLastChangeTime scalar value.


Taken from [30]:
The entPhysicalTable contains one row per physical entity, and must
always contain at least one row for an overall physical entity, which should
have an entPhysicalClass value of stack(11)', chassis(3)' or mod-
ule(9)'.
Each row is indexed by an arbitrary, small integer, and contains a de-
scription and type of the physical entity. It also optionally contains the index
number of another entPhysicalEntry indicating a containment relation-
ship between the two.
The entLogicalTable contains one row per logical entity. Each row is
indexed by an arbitrary, small integer and contains a name, description, and
type of the logical entity. It also contains information to allow access to the
MIB information for the logical entity.
The entLPMappingTable contains mappings between entLogical-
Index values (logical entities) and entPhysicalIndex values (the physi-
cal components supporting that entity). A logical entity can map to more than
one physical component, and more than one logical entity can map to (share)
the same physical component.
The entAliasMappingTable contains mappings between entLogical-
Index, entPhysicalIndex pairs and alias' object identier values. This
allows resources managed with other MIBs (e.g., repeater ports, bridge ports,
38 Sun servers open--source systems management

physical and logical interfaces) to be identied in the physical entity hierarchy.


Note that each alias identier is only relevant in a particular naming scope.
The entPhysicalContainsTable contains simple mappings between
entPhysicalContainedIn' values for each container/containee' relation-
ship in the managed system. The indexing of this table allows an NMS to
quickly discover the entPhysicalIndex' values for all children of a given
physical entity.
Scalar object entLastChangeTime represents the value of sysUptime
when any part of the Entity MIB conguration last changed.
SUN-PLATFORM-MIB is an extension of ENTITY-MIB. Specically, it augments
entPhysicalTable with information about Oracle/Sun specic equipment infor-
mation and most importantly it adds information about sensors (i.e. when a row in
entPhysicalTable refers to a sensor, agent implementing the MIB will ll in
details about this sensorlike sensor type, thresholds and valuesinto appropriate
table with the same index as the row in entPhysicalTable).
SUN-HW-MONITORING-MIB is independent on ENTITY-MIB and is comple-
mented by SUN-HW-TRAP-MIB, which denitions of notications.
This MIB contains similar data as ENTITY-MIB, but the data is spread among
more tables:

sunHwMonInventoryTable
sunHwNumericVoltageSensorTable
sunHwDiscreteVoltageSensorTable
sunHwNumericCurrentSensorTable
sunHwDiscreteCurrentSensorTable
sunHwNumericPowerDeviceSensorTable
sunHwDiscretePowerDeviceSensorTable
sunHwNumericCoolingDeviceSensorTable
sunHwDiscreteCoolingDeviceSensorTable
sunHwNumericTemperatureSensorTable
sunHwDiscreteTemperatureSensorTable
sunHwNumericProcessorSensorTable
sunHwDiscreteProcessorSensorTable
sunHwNumericMemorySensorTable
sunHwDiscreteMemorySensorTable
sunHwNumericHardDriveSensorTable
sunHwDiscreteHardDriveSensorTable
sunHwNumericIOSensorTable
sunHwDiscreteIOSensorTable
Management interfaces of Oracle Sun servers 39

sunHwNumericSlotOrConnectorSensorTable
sunHwDiscreteSlotOrConnectorSensorTable
sunHwNumericOtherSensorTable
sunHwDiscreteOtherSensorTable
sunHwMonIndicatorTable
sunHwMonTotalPowerConsumption

As one can see, this MIB is more ne grained that ENTITY-MIB. In addition to these
tables, certain values of interest are also directly available as scalars, which radically
simplies writing management extensions. There are quite a few scalars, only some
are listed below (for a full list and description see the MIB itself, it is well commented):

sunHwMonProductName
sunHwMonProductType
sunHwMonCumulativeSensorAlarmStatus
sunHwMonIndicatorServiceName
sunHwMonIndicatorServiceCurrentStatus

6.4 IPMI

IPMI is supported only in eLOM and ILOM. Utilities that access system controllers
over IPMI (e.g. ipmitool(1M), [33]) can use two connection methods:

out--of--band or side--band over network


locally over KCS interface

While the rst is available always, KCS (Keyboard Style Controller) was not avail-
able on SPARC systems until recentlythis was caused by a driver missing, not a
hardware defect [35].

6.5 Other interfaces

All of the system controllers can send notications using e-mail and they can also for-
ward the events to a system logging daemon running on remote host. To the author's
knowledge, these interfaces are seldom used.
40 Sun servers open--source systems management

However, web interface is used quite often, it oers a quick way how to check
server status, server components and also to upgrade rmware remotely without hav-
ing to run TFTP server.

Figure 6.4 ILOM login screen


7 Zenoss integration

Since we now have all management protocols, approaches and Oracle Sun servers
available interfaces described, we can start designing and implementing Zenoss inte-
gration. As resources materials [3641]were invaluable and provided all information
needed for designing and implementing the integration.

7.1 Choosing an approach

Zenoss supports both active and passive approach. To be able to actively poll system
controllers or agents for data, it is necessary to develop plugins in Python that extend
Zenoss' object model. While the API is not overly complex and ENTITY-MIB mod-
elling is already present, it would be time consuming to implement the other MIB
(SUN-HW-MONITORING-MIB) and management capabilities would thus be limited
to system controllers with ILOM and eLOM and to SPARC hosts running MASF.
On the other hand, implementing trap handling is easier, and as a result of imple-
menting support for SUN-PLATFORM-MIB and SUN-HW-TRAP-MIB notications
much more platforms will be supported:

Eventually, the desired functionality is that of existing integration with IBM Tivoli
Enterprise Console [42]or IBM Tivoli NetCool OMNIbus [43].

7.2 Development environment

A VirtualBox virtual machine running Debian GNU/Linux 5.0 with installed stack
Zenoss 2.5.1 (recently updated to 2.5.2). Development was done accordingly to Jane

41
42 Sun servers open--source systems management

Curry's [40]development tree was stored outside of Zenoss and versioned in Mer-
curial repository.

7.3 Important design decisions

7.3.1 Event classes

Zenoss organizes events into event classes. There are certain already existing classes,
like /Hw/Perf etc. There were possible two approaches:

1. extend existing event classes


2. create a completely separate namespace with new event classes

While the rst approach would suggest that the integration would t seamlessly into
existing environment (especially helpful when users already have some paging, e-mail
or other notications setup), the second approach guarantees that there will be no
clashes with existing setup (of course, unless the user creates his own event classes
with the same names).
As this integration should not break anything in the end--users setup, it has been
decided to create a completely separate namespace.

7.3.2 Per-trap mapping vs. defaultmapping

When Zenoss receives an event (in this case caused by receiving SNMP notication),
it will try to process the event using Event Class Key, which is usually the name of
the SNMP notication (provided the MIB is loaded and compiled). To do that, it will
search its database and looks for Event Class Mappings, which play a similar role as
rules in other software.
Zenoss integration 43

Figure 7.1 Zenoss Event Processing

When the mapping is not found, it will try and look for defaultmapping, that may
process the generated event. Although it would be simpler to develop just one block of
code to process these events, there is a concern that running a larger block of code for
every single notication would make the application much slower. Hence, a decision
to create a mapping for every single SNMP notication has been made.

7.4 Development steps

In this section we will describe steps taken to develop this integration. There is
one step common to all subsequents stepsonce it has been veried that the de-
scribed action was successful, the resulting objects are added to the ZenPack (called
44 Sun servers open--source systems management

ZenPacks.ojakubcik.OracleHwMonitoring), the ZenPack is exported and


the commited to Mercurial repository.

7.4.1 Compiling MIBs

This is arguably the simplest step. It involves copying used MIBs to location where
Zenoss expects them ($ZENHOME/share/mibs/site). The $ZENHOME environ-
ment variable is set by default for user zenoss.
Then, as user zenoss, one has to run the command

$ zenmib -v 10

to process the new MIBs and load them into Zenoss.

7.4.2 Creating Event classes

Before creating mappings, it is necessary to have all event classes against which we
want to map events to. Based on the two MIBs used now, the following classes will
be created:

/Events/Oracle
/Events/Oracle/Voltage
/Events/Oracle/Temperature
/Events/Oracle/Electrical Current
/Events/Oracle/Fan Speed
/Events/Oracle/Other
/Events/Oracle/Power Supply
/Events/Oracle/Fan
/Events/Oracle/Processor
/Events/Oracle/Memory
/Events/Oracle/Hard Drive
Zenoss integration 45

/Events/Oracle/IO
/Events/Oracle/Slot or Connector
/Events/Oracle/Component
/Events/Oracle/FRU
/Events/Oracle/Power Consumption

These can be created from GUI by following the Events menu item in the left nav-
igation bar and the by clicking Add New Organizer from the menu on the left from
Subclasses.
However, it is also possible to do this using a tool zendmd, which is essentially
a Python interpreter with preloaded Zenoss classes [44](this is just a skeleton script,
full can be found on CD in directory scripts as le createEventClasses.py):

import Globals
from transaction import commit
from Products.ZenUtils.ZenScriptBase import ZenScriptBase
dmd = ZenScriptBase(connect=True).dmd

event_classes = [
'/Events/Oracle',
'/Events/Oracle/Voltage',
...
]

for ec in event_classes:
dmd.Events.manage_addOrganizer(ec)

commit()

As a result, we now have all event classes we need in place and can proceed to
the event mappings creation.

7.4.3 Creating Event mappings

Recommended procedure for creating Event class mappings is to have the Zenoss
SNMP daemon receive all possible notications and then by creating the mappings
from GUI. These mapping can then be modied again from GUI [39].
46 Sun servers open--source systems management

However, if we do that for just one notication we can observe the following attrib-
utes are present (lled values are in parentheses) and the rest is to be lled manually:

Name (SNMP trap name, e.g. sunPlatObjectCreation)


Event Class Key (SNMP trap name, e.g. sunPlatObjectCreation)
Sequence (number, in my case 7)
Rule
Regex
Example (snmp trap sunPlatObjectCreation)
Transform
Explanation
Resolution

Meaning of these elds is in [36]:

NameAn identier for this event class mapping. Not important for match-
ing events.
Event Class KeyMust match the incoming event's eventClassKey eld
for this mapping to be considered as a match for events.
SequenceSequence number of this mapping, among mappings with an
identical event class key property. Go to the Sequence tab to alter its posi-
tion.
RuleProvides a programmatic secondary match requirement. It takes a
Python expression. If the expression evaluates to True for an event, this
mapping is applied.
RegexThe regular expression match is used only in cases where the rule
property is blank. It takes a Perl Compatible Regular Expression (PCRE).
If the regex matches an event's message eld, then this mapping is applied.
TransformTakes Python code that will be executed on the event only if
it matches this mapping. For more details on transforms, see the section
titled Event Class Transform.
ExplanationFree-form text eld that can be used to add an explanation
eld to any event that matches this mapping.
ResolutionFree-form text eld that can be used to add a resolution eld
to any event that matches this mapping.

Although we possibly could enter all mappings by using GUI, this would be error
prone and not very ecient. Luckily, as Zenoss is based on Zope, every GUI action
has a corresponding Python function that can be called.
Zenoss integration 47

To manipulate event classes, we rst need to get the class that represents them.
This is doable by the following method:

dmd.Events.getOrganizer(name)

where name is a full path to event class organizer.


Each organizer has a method createInstance that takes one parameteriden-
tier of the created mapping (in our case, this will be the name of the notication).
This method nally returns and instance of EventClassInst, that we will further
manipulate.
EventClassInst has attributes that correspond to the eld described earlier
(e.g. eventClassKey). After creating the new mapping instance, all we need to do
is to set corresponding attributes using standard Python syntax and nally commit
everything into ZODB (Zope Object Database) by calling the commit() procedure.
In following list, we will describe which attributes and how need to or should be
set:

eventClassKey and id shall be set to the translated name of the SNMP noti-
cation.
example shall be set to snmp trap <name>.
transform shall contain Python code that will modify received event text, sever-
ity and possibly set other values so clearing will work.
explanation and resolution may contain text explaining nature of the
event.

Transform eld, corresponding to the transform attribute will contain dierent


Python code for notications from dierent MIBs. Some of them may be dropped
automatically:

# Drop this event


evt._action = "history"
48 Sun servers open--source systems management

Most of the traps from SUN-HW-TRAP-MIB will have processing similar to this
(please note, that although MIBs do specify an user friendly mapping of integers to
names, Zenoss does not use these mappings):

# Get interesting attributes


component = getattr(evt,'sunHwTrapComponentName', None)
threshold_type = getattr(evt, 'sunHwTrapThresholdType', None)
threshold_value = getattr(evt, 'sunHwTrapThresholdValue', None)
reading = getattr(evt, 'sunHwTrapSensorValue', None)
if threshold_type == 1:
# Upper
thr_type_text = "upper"
thr_word = "over"
thr_compare = ">="
elif threshold_type == 2:
# Lower
thr_type_text = "lower"
thr_word = "below"
thr_compare = "<="
else:
# Unknown threshold
evt._action = "drop"
evt.severity = 2 # Info
return
evt.summary = "<Sensor type> sensor %{component}s: reading is ..."
evt.component = component
evt.severity = SEVERITY
evt._action = "status"
# 0 = CLEAR, DEBUG, INFO, WARNING, ERROR, CRITICAL = 5

Other notications will have similar processing. How do we put all this together?
Lets put together a algorithm:

1. Construct a list of notication names.


2. For each notication, assign an Event Class and severity
3. Based on predened teplates, generate a transformation code for each notication.
4. For each notication, nd appropriate organizer (Event Class) and based on the
previously obtained information, create a mapping.

When this is done, one may end up with a following script. Of course, this is not a
complete script, full version is present on the CD. First, we need to prepare a list of
notication, together with their Event Classes:
Zenoss integration 49

denitions = []

# No /Events/Oracle needed, that is added automatically


# Sun HW Trap MIB - threshold notications
for sensor_short, sensor_type, zen_group in [
('Voltage', 'Voltage', '/Voltage'),
('Temp', 'Temperature', '/Temperature'), ...
]:
for thr_value, severity, threshold_type in [
('Fatal', 5, 'non-recoverable'),
('Crit', 4, 'critical'),
('NonCrit', 3, 'non-critical')]:
name = 'sunHwTrap' + sensor_short + thr_value +
'ThresholdExceeded'
organizer = zen_group
transform = hw_thr_assert % {
'severity' : severity,
'type' : sensor_type,
'threshold_type' : threshold_type}
d = {
'name' : name,
'organizer' : organizer,
'transform' : transform}
denitions.append(d)

Here, the hw_thr_assert and hw_thr deassert are strings that contain the
template for transformation script to be input into Zenoss.
When we have the denitions array lled up with transformation rules, we
can cycle through them and create mappings in Zenoss:

for denition in denitions:


org = dmd.Events.getOrganizer('/Events/Oracle" +
denition['organizer'])
inst = org.createInstance('" + denition['name'] + "')
inst.example = 'snmp trap ' + denition['name']
inst.transform = denition['transform']

Finally, we need to add some preamble to the script:


50 Sun servers open--source systems management

import Globals
from transaction import commit
from Products.ZenUtils.ZenScriptBase import ZenScriptBase
dmd = ZenScriptBase(connect=True).dmd

Also we need to commit the changes to database:

commit()

7.4.4 Adding products

Finally, we may want to add a new manufacturer and a list of products. This again
can be done from GUI or from command--line using zendmd.
However, the syntax here is not as easy as in the rst example, so for purpose of
this project, products were created by hand using GUI.
Manufacturer Oracle was added to Zenoss, and a list of servers was created:

Oracle Sun Fire X2250 Server


Oracle Sun Fire X2270 Server
Oracle Sun Fire X4100 M2 Server
Oracle Sun Fire X4200 M2 Server
Oracle Sun Fire X4600 M2 Server
Oracle Sun Fire X4540 Server
Oracle Sun Fire X4140 Server
etc.

7.4.5 Final modications

Even though scripting the creation of the mappings saved us a considerable amount
of time, the script inevitably may not be able to generate all messages and severities
Zenoss integration 51

correctly. Hence, a walkthrough the generated mappings is recommended and modi-


fying the generated code to make it more ecient for given purpose is encouraged.
Small modications were needed especially with the notications that cover more
than one event (sunHwTrapHardDriveStatus) and most SUN-PLATFORM-MIB
notications.

7.5 Testing

Optimal approach for testing would be to create an automation that would simulate
failures on physical machines, which would in turn respond with notication. A semi--
manual checking would then be required to conrm that the integration works as
expected.
However, due to time constraints and unavailability of all testing machines, a
dierent approach was chosen. One server (Oracle Sun SPARC Enterprise T5220
Server) was congured to send notications from system controller and MASF agent
to the same IP address running Zenoss with this integration. Hard drives, power sup-
plies and fans were the removed and the reinstalled to verify that traps are received
and cleared.

7.6 Future extension

As of now, the integration has just basic functionality. Following paragraphs describe
the possible new features to be developed, possibly as a future work of author.
Testing framework. To ensure this software works, a complete automated testing
framework supporting physical servers needs to be developed and regularly run.
Better clearing mechanism. Right now, due to Zenoss way of handling clearing
events (i.e. only events with cleared severity can clear others) it is true that notica-
tions ending with Deassert have severity of cleared. This may not be true, because
even if the sensor reading drops below non--recoverable threshold, its reading is now
critical and not OK.
Polling. This would mean developing a plugin into Zenoss that would discover
and model the server using data obtained by periodical reading MIB data.
Model updates from traps. Instead or in addition to writing to event console when
a SNMP notication is received, a previously obtained model of the server could be
either updated or a forced reread of all data can be forced. This of course requires a
52 Sun servers open--source systems management

functional polling and to function properly, a model will need to be updated anyway
from time to time, just to make sure that a SNMP notication wasn't lost en route.
Graphing and reporting. Based on data obtained by previous two extensions, it
would be possible to implement graphing and reporting, showing for example tem-
perature trends, and more importantly power consumption.
8 Conclusion

This project was partially research and partially implementation oriented. As a re-
sult, a brief yet hopefully useful description of system management motivations, tech-
nologies and software was given.
In addition, a basic but functional integration into open--source system manage-
ment tool was developed and tested (albeit only in limited way), by which this project
fullled its assignment.
Author implemented a new and previously unknown (or at least not publicly de-
scribed) way how to create Event Class mappings programatically.
However, from the former idea of a complete monitoring solution that would do
polling, graphing and notications simultaneously was not realized. Nonetheless,
even though this solution does not use all features of Zenoss, there is a room for
improvement, as described earlier.

53
54 Sun servers open--source systems management
References

[1] O. Jakubk, Selecting open-source system management solution for integrating


with Sun servers (unpublished, 2009). Available on CD.
[2] E. Galstad Nagios Core Version 3.x Documentation. (2009).
[3] Zabbix SIA, Zabbix 1.8 manual.
[4] Zenoss, Inc., Zenossgetting started (Zenoss, Inc., 2009).
[5] Wikipedia, Simple network management protocol (2010).
[6] M. Rose and K. McCloghrie, RFC1155: Structure and identication of manage-
ment information for TCP/IP-based internets (IETF, 1990).
[7] K. McCloghrie and M. Rose, RFC1156: Management Information Base for net-
work management of TCP/IP-based internets (IETF, 1990).
[8] J. Case, M. Fedor, M. Schostall, and J. Davin, RFC1157: Simple Network Man-
agement Protocol (SNMP) (IETF, 1990).
[9] K. McCloghrie, D. Perkins, and J. Schoenwaelder, RFC2578: Structure of Man-
agement Information Version 2 (SMIv2) (IETF, 1999).
[10] ITU, Abstract Syntax Notation One: Specication of basic notation (ITU, 2002a).
[11] ITU, Abstract Syntax Notation One: Information object specication (ITU,
2002b).
[12] ITU, Abstract Syntax Notation One: Constraint specication (ITU, 2002c).
[13] ITU, Abstract Syntax Notation One: Parameterization of ASN.1 specications
(ITU, 2002d).
[14] Intel, HP, NEC, and Dell, Intelligent Platform Management Interface Specica-
tion (Intel, 2009). Second generation, v2.0.
[15] DMTF, Inc., Web-based enterprise management (wbem) faqs (DMTF, Inc., 2010).
[16] The Open Group OpenPegasus. (2010). www.openpegasus.org.
[17] D. Libes, The expect home page (Don Libes, 2009). http://expect.nist.gov/.
[18] R. Gerhards, RFC5424: The Syslog Protocol (IETF, 2009).
[19] R. Thurlow, RFC5531 RPC: Remote Procedure Call Protocol Specication Version
2 (IETF, 2009).
[20] Object Management Group, Inc. Common Object Request Broker Architecture
(CORBA) Specication, Version 3.1. (2008).
[21] World Wide Web Consortium SOAP Version 1.2 Part 1: Messaging Framework.
(2007). second editions.
[22] D. Winer, Xml-rpc specication (xml-rpc.com, 1999).
[23] Sun Microsystems, Inc. Sun Advanced Lights Out Manager (ALOM) 1.6 Admin-
istration Guide. (2007b). 819-2445-11.

55
56 Sun servers open--source systems management

[24] Sun Microsystems, Inc. Advanced Lights Out Management (ALOM) CMT v1.4
Guide. (2007a). 819-7991-10.
[25] Sun Microsystems, Inc. Embedded Lights Out Manager Administration GuideFor
the Sun Fire X2200 M2 and Sun Fire X2100 M2 Servers. (2009). 819-6588-14.
[26] Oracle, Inc. Oracle Integrated Lights Out Manager (ILOM) 3.0 Getting Started
Guide. (2010c). 820-5523-11.
[27] Oracle, Inc. Oracle VM Server for SPARC. (2010e). (formerly LDOMS).
[28] Sun Microsystems, Inc. Sun SNMP Management Agent for Sun Fire and Netra
Systems. (2004).
[29] Oracle, Inc. Sun Server Management Agents 2.0 User's Guide. (2010b).
821-1610.
[30] K. McCloghrie and A. Bierman, RFC2737: Entity MIB (Version 2) (IETF, 1999).
Obsoleted by RFC 4133.
[31] Intel, HP, NEC, and Dell Platform Event Trap Format Specication. v1.0.
[32] Oracle, Inc. Auto Service Request (ASR) v2.6Installation and Operations
Guide. (2010a). http://wikis.sun.com/display/ASRSO/Home.
[33] D. Laurie IPMItool. (2007). http://ipmitool.sourceforge.net/.
[34] Oracle, Inc. IPMItool. (2010d). http://www.sun.com/system-
management/tools.jsp.
[35] Sun Microsystems, Inc., PSARC 2008/119 sun4v /dev/bmc (Sun Microsystems,
Inc., 2008). (not available publicly).
[36] Zenoss, Inc. Zenoss Administration. (2010b).
[37] Zenoss, Inc. Zenoss Developer's Guide. (2010c).
[38] Zenoss, Inc., Zenoss 2.5 source code documentation (Zenoss, Inc., 2010a).
[39] J. Curry Zenoss Event Management. (2010). version 3.
[40] J. Curry, Creating Zenoss ZenPacks (Jane Curry, 2009a).
[41] J. Curry Crafting Zenoss Core users for events and zProperties. (2009b). draft.
[42] Sun Microsystems, Inc. Monitoring Sun Servers in an IBM Tivoli Enterprise
Console Environment. (2009b).
[43] Sun Microsystems, Inc. Monitoring Sun Servers in an IBM Tivoli Netcool/OM-
NIbus Environment. (2009a).
[44] N. Brockett, batchaddlocations.py (Zenoss, Inc., 2009).
A CD Contents

As a part of this project, a CD was created. It contains the following les and direc-
tories:

Others/Directory containing other documents.


Project/Directory containing PDF le of this project.
RFC/Directory containing RFCs.
ZenPack/Directory containing source les for ZenPack.
READMEDescription of les on CD.

LVII
58 Sun servers open--source systems management

You might also like