Monitoring Managing Data Centre

Copyright © 2006 EMC Corporation. Do not Copy - All Rights Reserved.
Section 5 - Monitoring and Managing the

Data Center
Introduction
© 2006 EMC Corporation. All rights reserved.
Welcome to Section 5 of Storage Technology Foundations – Monitoring and Managing the Data
Center.
Copyright © 2006 EMC Corporation. All rights reserved.
These materials may not be copied without EMC's written consent.
EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC
CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND
WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY
DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an
applicable software license.
EMC2, EMC, Navisphere, CLARiiON, and Symmetrix are registered trademarks and EMC
Enterprise Storage, The Enterprise Storage Company, The EMC Effect, Connectrix, EDM,
SDMS, SRDF, Timefinder, PowerPath, InfoMover, FarPoint, EMC Enterprise Storage Network,
EMC Enterprise Storage Specialist, EMC Storage Logix, Universal Data Tone, E-Infostructure,
Access Logix, Celerra, SnapView, and MirrorView are trademarks of EMC Corporation.
All other trademarks used herein are the property of their respective owners.
Monitoring and Managing the Data Center - 1

Section Objectives
Upon completion of this section, you will be able to:
y Describe areas of the data center to monitor
y Discuss considerations for monitoring the data center
y Describe techniques for managing the data center
© 2006 EMC Corporation. All rights reserved. Monitoring and Managing the Data Center - 2
The objectives for this section are shown here. Please take a moment to read them.

In This Section
This section contains the following modules:
y Monitoring in the Data Center
y Managing in the Data Center
This section contains 2 modules, monitoring in the data center and managing in the data center.

Apply Your Knowledge

The following modules contain Apply Your Knowledge
information (Available in the Student Resource Guide):
y Monitoring in the Data Center
y Managing in the Data Center
Please note that certain modules of this section contain Apply Your Knowledge information that
is only available in the Student Resource Guide

Monitoring in the Data Center

After completing this module, you will be able to:
y Discuss data center areas to monitor
y List metrics to monitor for different data center
components
y Describe the benefits of continuous monitoring
y Describe the challenges in implementing a unified and
centralized monitoring solution in heterogeneous
environments
y Describe industry standards for data center monitoring
In this module, you will learn about different aspects of monitoring data center components,
including the benefits of pro-active monitoring and the challenges of managing a heterogeneous
environment (multiple hardware/software from various vendors).

Monitoring Data Center Components

Client
HBA
Port
HBA
Port
IP
Keep Alive
IP
SAN
Storage Arrays
Network
Health
Capacity
Performance
Cluster
Hosts/Servers with Security
Applications
The Business Continuity Overview module discussed the importance of resolving all single
points of failure when designing data centers. Having designed a resilient data center, the next
step is to ensure that all components that make up the data center are functioning properly and
are available on a 24x7 basis. The way to achieve this is by monitoring the data center on a
continual basis.
System Monitoring is essential to ensure that the underlying IT infrastructure business critical
applications are operational and optimized. The main objective is to ensure that the various
hosts, network systems and storage are running smoothly and to know how loaded each system
and component is and how effectively it is being utilized.
The major components within the data center that should be monitored include:
y Servers, databases and applications
y Network ((SAN) and IP Networks (switches, routers, bridges))
y Storage Arrays
Each of these components should be monitored for health, capacity, performance, and security.

Why Monitor Data Centers

y Availability
– Continuous monitoring ensures availability
– Warnings and errors are fixed proactively
y Scalability
– Monitoring allows for capacity planning/trend analysis which in turn
helps to scale the data center as the business grows
y Alerting
– Administrators can be informed of failures and potential failures
– Corrective action can be taken to ensure availability and scalability
Continuous monitoring of health, capacity, performance and security of all data center
components is critical to ensure data availability and scalability. For example, information about
component failures can be sent to appropriate personnel for corrective actions.
Ongoing trends show that the data storage environment continues to grow at a rapid pace.
According to the International Data Corporation (IDC), external storage-system capacity growth
will increase at a compound annual growth rate (CAGR) of approximately 50% through 2007.
This represents a doubling of the current capacity every 2 years or so. Automated monitoring
and alerting solutions are becoming increasingly important.
Monitoring the data center closely and effectively optimizes data center operations and avoids
downtime.

Monitoring Health
y Why monitor health of different components
– Failure of any hardware/software component can lead to outage of a
number of different components
¾ Example: A failed HBA could cause degraded access to a number of
data devices in a multi-path environment or to loss of data access in a
single path environment
y Monitoring health is fundamental and is easily understood

and interpreted
– At the very least health metrics should be monitored
– Typically health issues would need to be addressed on a high priority
Health deals with the status/availability of a particular hardware component or a software

process. (i.e., status of SAN device or port, database instance up/down, HBA status, disk/drive
failure, etc.)
If a component has failed, it could lead to down time unless redundancy exists.
Monitoring the health of data center components is very important and is easy to understand and
interpret (i.e., a component is either available or it has failed). Monitoring for capacity,
performance, and security depend on the health and availability of different components.

Monitoring Capacity
y Why monitor capacity
– Lack of proper capacity planning can lead to data un-availability and
the ability to scale
– Trend reports can be created from all the capacity data
¾ Enterprise is well informed of how IT resources are utilized
y Capacity monitoring prevents outages before they can

occur
– More preventive and predictive in nature than health metrics
¾ Based on reports one knows that 90% of a file system is full and that the
file system is filling up at a particular rate
¾ 95% of all the ports have been utilized in a particular SAN fabric, a new
switch should added if more arrays/servers are to be added to the same
fabric
From a monitoring perspective, capacity deals with the amount of resources available.
Examples:
y Available free/used space on a file system or a database table space
y Amount of space left in a RAID Group
y Amount of disk space available on storage arrays
y Amount of file system or mailbox quota allocated to users.
y Number of available ports in a switch (e.g., 52 out of 64 ports in use, leaving 12 free ports
for expansion)

Monitoring Performance
y Why monitor Performance metrics
– Want all data center components to work efficiently/optimally
– See if components are pushing performance limits or if they are
being under utilized
– Can be used to identify performance bottlenecks
y Performance Monitoring/Analysis can be extremely

complicated
– Dozens of inter-related metrics depending on the component in
question
– Most complicated of the various aspects of monitoring
Performance monitoring measures the efficiency of operation of different data center

components.
Examples:
y Number of I/Os thorough a front-end port of a storage array
y Number of I/Os to disks in a storage array
y Response time of an application
y Bandwidth utilization
y Server CPU utilization

Monitoring Security
y Why monitor security
– Prevent and track unauthorized access
¾ Accidental or malicious
y Enforcing security and monitoring for security breaches is

a top priority for all businesses
Security prevents and tracks unauthorized access.

Examples of security monitoring are:
y Login failures
y Unauthorized storage array configuration/re-configuration
y Monitoring physical access (via badge readers, biometric scans, video cameras, etc.)
y Unauthorized Zoning and LUN masking in SAN environments or changes to existing zones

Monitoring Servers
y Health
– Hardware components
¾ HBA, NIC, graphic card, internal disk …
– Status of various processes/applications
y Capacity HBA
– File system utilization HBA
– Database
¾ Table space/log space utilization
– User quota
Any failure of a hardware component such as HBA or NIC, should be immediately detected and
rectified. As seen earlier, component redundancy can prevent total outage. Mission critical
applications running on the servers should also be monitored continuously. A database might
spawn a number of processes that are required to ensure operations. Failure of any of these
processes can cause non-availability of the database. Databases and applications usually have
mechanisms to detect such errors and report them.
Capacity monitoring on a server will involve monitoring file system space utilization. By
continuously monitoring file system free space, estimate the growth rate of the file system and
effectively predict as to when it will become a 100% full. Corrective action such as extending
the space of a file system can be taken well ahead of time to avoid a file system full condition.
In many environments, system administrators enforce space utilization quota on users. For
example, a user cannot exceed 10 GB of space or a particular file cannot be greater than 100
MB.

Monitoring Servers
y Performance
– CPU utilization
– Memory utilization
– Transaction response times
y Security HBA
– Login HBA
– Authorization
– Physical security
¾ Data center access
Two key metrics of performance of servers are the CPU and memory utilization. A continuously
high value (above 80%) for CPU utilization is an indication that the server is running out of
processing power. During periods of high CPU utilization, applications running on the server,
and consequently end-users of the application, will experience slower response times. Corrective
action could include upgrading processors, adding more processors, shifting some applications
to different Servers, or restricting the number of simultaneous client access. Databases,
applications, and file systems utilize Server physical memory (RAM) to stage data for
manipulation. When sufficient memory is not available, data has to be paged in and out of disks.
This process will also result in slower response times.
Login failures and attempts by unauthorized users to execute code or launch applications should
be closely monitored to ensure secure operations.

Monitoring the SAN

y Health
– Fabrics
¾ Fabric errors, zoning errors
– Ports
¾ Failed GBIC, status/attribute change
– Devices
¾ Status/attribute Change
– Hardware Components
¾ Processor cards, fans, power supplies
y Capacity
– ISL utilization
– Aggregate switch utilization
– Port utilization
Uninterrupted access to data over the SAN depends on the health of its physical and logical
components. The GBICs, power supplies, and fans in switches and cables are the physical
components. Any failure in these must be immediately reported. Constructs such as zones and
fabrics are the logical components. Errors in zoning such as specifying the wrong WWN of a
port will result in failure to access that port. These have to be monitored, reported, and rectified
as well.
By way of capacity, the number of ports on different switches that are currently used/free should
be monitored. This will aid in planning expansion by way of adding more Servers or connecting
to more storage array ports. Utilization metrics at the switch level and port level, along with
utilization of Interswitch Links (ISLs), are also a part of SAN capacity measurements. These can
be viewed as being a part of performance metrics as well.

Monitoring the SAN

y Performance
– Connectivity ports
¾ Link failures
¾ Loss of signal
¾ Loss of synchronization
¾ Link utilization
¾ Bandwidth MB/s or frames/s
– Connectivity devices
¾ Statistics are usually a cumulative value of all the port statistics
A number of SAN performance/statistical metrics can be used to determine/predict hardware

failure (health). For example, an increasing number of link failures may indicate that a port is
about to fail. The following are metrics which describe these failures:
y Link Failures - the number of link failures occurring on a connectivity device port. A high
number of failure could indicate a hardware problem (bad port, bad cable …)
y Loss of Signal - the number of loss of signal events occurring on a connectivity device port.
A high number indicates a possible hardware failure.
y Loss of Synchronization - the number of loss of synchronization events occurring on a
connectivity device port. High counts may indicate hardware failure.
Connectivity device port performance can be measured with the Receive or Transmit Link
Utilization metrics. These calculated values give a good indicator of how busy the switch port is
based on the assumed maximum throughput. Heavily used ports can cause queuing delays on the
host.

Monitoring the SAN

y Security
– Zoning
¾ Ensure communication between dedicated sets of ports (HBA and
Storage Ports)
– LUN Masking
¾ Ensure the only certain hosts have access to certain Storage Array
volumes
– Administrative Tasks
¾ Restrict administrative tasks to a select set of users
¾ Enforce strict passwords
– Physical Security
¾ Access to Data Center should be monitored
SAN Security includes monitoring the fabrics for any zoning changes. Any errors in the zone set
information can lead to data inaccessibility. Unauthorized zones can compromise data security.
User login/authentication to switches should be monitored to audit administrative changes.
Ensure that only authorized users are allowed to perform LUN masking tasks. Any such tasks
performed should be audited for proper authorization.

Monitoring Storage Arrays

y Health
– All hardware components
¾ Front End
¾ Back End
¾ Memory
¾ Disks
¾ Power Supplies
¾…
– Array Operating Environment
¾ RAID processes
¾ Environmental Sensors
¾ Replication processes
Storage arrays typically have redundant components to function when individual components
fail. Performance of the array might be affected during such failures. Failed components should
be replaced quickly to restore optimal performance. Some arrays include the capability to send a
message to the vendor’s support center in the event of hardware failures. This feature is typically
known as “call-home”.
It is equally important to monitor the various processes of the storage array operating
environment. For example, failure of replication tasks will compromise disaster recovery
capabilities.


y Capacity
– Configured/unconfigured capacity
– Allocated/unallocated storage
– Fan-in/fan-out ratios
y Performance
– Front End utilization/throughput
– Back End utilization/throughput
– I/O profile
– Response time
– Cache metrics
Physical disks in a storage array are partitioned into LUNs for use by hosts.
y Configured capacity is the amount of space that has been partitioned into LUNs
y Unconfigured capacity is the remaining space on the physical disks
Allocated storage refers to LUNs that have been masked for use by specific hosts/servers.
Unallocated storage refers to LUNs that have been configured, but not yet been masked for host
use.
Monitoring storage array capacity enables you to predict and react to storage needs as they
occur.
Fan-in/fan-out ratios and availability of unused front end ports (ports to which no host has yet
been connected) is useful when new hosts/servers have to be given access to the storage array.
Performance: Numerous performance/statistical metrics can be monitored for storage arrays.
Some of the key metrics to monitor are the utilization rates of the various components that make
up the storage arrays. Extremely high utilization rates can lead to performance degradation.


y Security
– LUN Access
¾ Ensure the only certain hosts have access to certain Storage Array
volumes
¾ Disallow WWN spoofing
– Administrative tasks
¾ Most arrays allow the restriction of various array configuration tasks
Device configuration
LUN masking
Replication operations
Port configuration
– Physical Security
¾ Monitor access to data center
World Wide Name (WWN) spoofing is a security concern. For example, an unauthorized host
can be configured with a HBA that has the same WWN as another authorized host. If this host is
now connected to the storage array via the same SAN, then zoning and LUN Masking
restrictions will be bypassed. Storage arrays have mechanisms in place which can prevent such
security breaches.
Auditing of array device configuration tasks, as well as replication operations is important, to
ensure that only authorized personnel are performing these.

Monitoring IP Networks
y Health
– Hardware Components
¾ Processor cards, fans, Power Supplies, ...
– Cables
y Performance
– Bandwidth
IP
– Latency
– Packet Loss
– Errors
– Collisions
y Security
Network performance is vital in a storage environment. Monitor network latency, packet loss,
availability, traffic, and bandwidth utilization for:
− I/O (Bandwidth Usage)
− Errors
− Collisions

Monitoring the Data Center as a Whole

y Monitor data center environment
– Temperature, humidity, airflow, hazards (water, smoke, etc.)
– Voltage – power supply
y Physical security
– Facility access (Monitoring cameras, access cards, etc.)
Monitoring the environment of a data center is just as crucial as monitoring the different
components. Most electrical/electronic equipment are extremely sensitive to heat, humidity,
voltage fluctuations, etc. Data center layout and design have to account for correct levels of
ventilation, accurate control of temperature/humidity, uninterrupted power supplies, and
corrections to voltage fluctuations. Any changes to the environment should be monitored and
reported immediately. Physical security is easy to understand.

End-to-End Monitoring
Client
HBA
Port
HBA
Port
IP
Keep Alive
IP
SAN
Storage Arrays
Network
Single Failure
Multiple Symptoms
Root Cause Analysis

Cluster
Hosts/Servers with Business Impact
Applications
A good end-to-end monitoring system should be able to quickly analyze the impact that a single
failure can cause. The monitoring system should be able to deduce that a set of seemingly
unrelated symptoms are result of a root cause. It should also be able to alert on the impact to
business arising from different component failures.

Monitoring Health: Array Port Failure
Degraded
HBA
H1
HBA
SW1
Degraded
HBA Port
H2 HBA
Port
SW2
Storage Arrays
Degraded
HBA
H3 SAN
HBA
Hosts/Servers with
Applications
Here is an example of the importance of end-to-end monitoring. In this example, 3 Servers (H1,
H2, and H3) have 2 HBA each and are connected to the storage array via two switches (SW1
and SW2). The three servers share the same storage ports on the Storage Array.
If one of the storage array ports fails it will have the following effect on the whole data center:
y Since all servers are sharing the ports, all the storage volumes that were accessed via SW1
will be unavailable.
y The servers will experience path failures. Redundancy enables them to continue operations
via SW2.
y The applications will experience reduced performance (degraded), because the number of
available paths to the storage devices has been cut in half.
y If the applications belong to different business units all of these would be affected even
though only a single port has failed.
This example illustrates the importance of monitoring the health of storage arrays.
By constantly monitoring the array, you can detect the fault as soon as it happens and fix it right
away so as to minimize the time that applications have to run in a degraded mode.

Monitoring Health: HBA failure

Degraded
H1 HBA
HBA
SW1
HBA Port
H2
HBA Port
SW2
Storage Arrays
HBA
H3
HBA
SAN
Hosts/Servers with
Applications
The scenario presented here is the same as the previous (3 Servers H1, H2 and H3 have 2 HBA
each and are connected to the storage array via two switches SW1 and SW2. The three servers
share the same storage ports on the storage array). In this example, if there is a single HBA
failure, the server with the failed HBA will experience path failures to the storage devices that it
had access to. Application performance on this server will be affected.

Monitoring Health: Switch Failure
SW1
Port
Port
All Hosts
Degraded Port
Port
SW2
Hosts/Servers with Storage Arrays
Applications SAN
In this example, a number of servers (with 2 HBAs each) are connected to the storage array via
two switches (SW1 and SW2). Each server has independent paths (2 HBAs) to the storage array
via switch SW1 and switch SW2.
What happens if there is a complete switch failure of switch SW1?
All the hosts that were accessing storage volume via switch SW1 will experience path failures.
All applications on the servers will run in a degraded mode. Notice that the failure of a single
component (a switch in this case) has a ripple effect on many data center components.

Monitoring Capacity: Array
New Server
SAN
Storage Array
SW1 Port
Port
Port
SW2
Port
Hosts/Servers with
Applications Can the Array provide the required
storage to the new server?
This example illustrates the importance of monitoring the capacity of arrays.

A number of servers (with 2 HBAs each) are connected to the storage array via two switches
(SW1 and SW2). Each server has independent paths (2 HBAs) to the storage array via switch
SW1 and switch SW2. Each of the servers has been allocated storage on the storage array.
An application on the new server has to be given access to storage devices from the array, via
switches SW1 and SW2.A new server has to be deployed. Monitoring the amount of configured
and unconfigured space on the array is critical for deciding if this is possible. Proactive
monitoring will help from the initial planning stages to final deployment.

Monitoring Capacity: Servers File System Space
No Monitoring FS Monitoring
File System
File System
Extend FS
Warning: FS is 66% Full
Critical: FS is 80% Full
This example illustrates the importance of monitoring capacity on servers.

y On the left is an application server which is writing to a file system without monitoring the
file system capacity. Once the file system is full, the application will no longer be able to
function.
y On the right is a similar setup. An application server is writing to a file system. In this case,
the file system is monitored. A warning is issued at 66%, then a critical message at 80%. We
can take action and extend the file system before the file system full condition is reached.
Proactively monitoring the file system can prevent application outages caused by lack of file
system space.

Monitoring Performance: Array Port Utilization

New Server 100%
Port Util. %
HBA
H4 HBA
HBA
H1 H1 + H2 + H3
HBA
SW1
HBA
H2
HBA
SW2 Port
Port
Port
HBA
H3 HBA SAN
Storage Arrays
Hosts/Servers with
Applications
This example illustrates the importance of monitoring performance metrics on storage arrays.
Three Servers (H1, H2 and H3) have two HBAs each and are connected to the storage array via
two switches (SW1 and SW2). The three servers share the same storage ports on the storage
array. A new server H4 has to be deployed and must share the same storage ports as H1, H2 and
H3.
To ensure that the new server does not adversely affect the performance of the others, it is
important to monitor the array port utilization. In this example, the utilization for the shared
ports is shown using the green and red lines in the line graph. If the actual utilization prior to
deploying the new server is the green line, then there is room to add the new server. Otherwise,
the deployment of the new server will impact performance of all servers.

Monitoring Performance: Servers

Critical: CPU Usage above 90% for
the last 90 minutes
Most servers have tools that allow you to interactively monitor CPU usage. For example,
Windows Task Manager displays the CPU and Memory usage (as shown above). Interactive
tools are fine if only a few servers are being managed. In a data center, with potentially
hundreds of servers, the tool must be capable of monitoring many servers simultaneously. Tools
tool should send a warning to the System Administrator whenever the CPU utilization exceeds a
specified threshold.

Monitoring Security: Servers
Login 1
Login 2
Login 3
Critical: Three successive

login failures for username
“Bandit” on server “H4”,
possible security threat
Login failures could be accidental (mistyping) or could be the result of a deliberate attempt to
break into a system. Most servers will usually allow two successive login failures and will not
allow any more attempts after a third successive login failure. In most environments, this
information may simply be logged in a system log file. Ideally, you should monitor for such
security events. In a monitored environment when there are three successive login failures, a
message could be sent to the System Administrator to warn them of a possible security threat.

Monitoring Security: Array – Local Replication
SAN Storage Array

SW1 Port
WG2
Workgroup 2 (WG1) Port
SW2 Port
Replication WG1
CMD Port
Warning: Attempted replication

Workgroup 1 (WG2) of WG2 devices by WG1 user
– Access denied
This example illustrates the importance of monitoring security breaches in a storage array. A
storage array is a shared resource. In this example, the array is being shared between two
workgroups. The data of WG1 should not be accessible by WG2. Likewise, WG2 should not be
accessible by WG1.
A user from WG1 may try to make a local replica of the data that belongs to WG2. Typically,
mechanisms will be in place to prevent such an action. If this action is not monitored or
recorded in some fashion, be unaware that some one is trying to violate security protocols. But if
this action is monitored, a warning message can be sent to the Storage Administrator.

Monitoring: Alerting of Events

y Warnings require administrative attention
– File systems becoming full
– Soft media errors
y Errors require immediate administrative attention

– Power failures
– Disk failures
– Memory failures
– Switch failures
Monitoring systems allow administrators to assign different severity levels for different
conditions in the data center. Health related alerts will usually be classified as being critical or
fatal – meaning that a failure in a component has immediate adverse consequences. Other alerts
can be arranged in a spectrum from Information to Fatal.
Generically:
y Information – useful information requiring no administrator intervention, e.g. an authorized
user has logged in
y Warning – administrative attention is required, but the situation is not critical. An example
may be that a file system has reached the 75% full mark. The administrator has time to
decide what action should be taken
y Fatal – immediate attention is required, because the condition will affect system
performance or availability. If a disk fails, for example, the administrator must ensure that it
is replaced quickly.
The sources of monitoring messages may include hardware components, such as servers and
storage systems, and software components, such as applications.
Continuous monitoring, in combination with automated alerting, enables administrators to:
y Reactively respond to failures quickly
y Proactively avert failures by looking at trends in utilization and performance

Monitoring: Challenges
EMC
Hitachi
Storage Arrays NetApp
CAS
HP NAS
DAS
IBM SAN Cisco
TLU
SUN
Servers Network
McData
MF
UNIX
WIN SAN IP
Databases Applications Brocade
Oracle Informix MS SQL

The core elements of the data center are the storage arrays, networks, servers, databases, and
applications.
y Storage arrays could be NAS, CAS, DAS, SAN attached or Tape/Disk Library Units
y The network consists of the SAN and the IP Network
y Servers could be Open Systems (UNIX or Windows) or Mainframe. There are numerous
vendors who supply these data center components
The challenge is to monitor and manage each of these components. Typically, each vendor will
provide monitoring/management tools for each of the components. As a consequence, in order
to successfully monitor and manage a data center, learn multiple tools and terminologies. In an
environment where multiple tools are in use, it is almost impossible to get a complete picture of
what is going on a single page.
Most data center components are inter-related (i.e. a SUN host is connected to a EMC storage
array via a Cisco SAN). In an ideal world, the monitoring tool should be able to correlate the
information from all objects in one place, so that you can make an informed decision on any of
the metrics that is monitored.

Monitoring: Ideal Solution

Monitoring/Management
One UI
Engine
Storage Arrays
Storage Arrays
Network
CAS
NAS Servers, Databases,
DAS
Applications
SAN TLU
Servers Network
MF
UNIX
WIN
SAN IP
Databases Applications
The ideal solution to monitoring all data center objects from all the vendors would be a
Monitoring/Management engine that would be able to gather information on all the objects and
be able to manage all the same via a single user interface.
The engine should also be able to perform root cause analysis and indicate how individual
component failures affect various business units.
y Single interface to monitor all objects in the data center
y Root cause analysis - multiple symptoms may be triggered by single root cause
y How to individual component failures affect various business units
y Should have mechanism to inform administrators of events via e-mail/page/SNMP traps etc.
y Should provide the ability to generate reports

Without Standards
y No common access layer between
managed objects and applications –
Network Management
vendor specific
y No common data model Applications Management
y No interconnect independence
Host Management
y Multi-layer management difficulty
y Legacy systems can not be Storage Management
accommodated
y No multi-vendor automated discovery Database Management
y Policy-based management is not

possible across entire classes of
devices
Interoperability!
SAN Administrators have often been faced with the dilemma of integrating multi-vendor
hardware and software under a single management umbrella. It is relatively easy for
administrators to monitor individual switches. But, monitoring a set of switches together and
correlating data is a more complex challenge.
Users and administrators want the flexibility to select the most suitable products for a particular
application or set of applications and then easily integrate those products into their computing
environments. Traditionally this has not been possible for the reasons listed above.
Without standards, policy-based management is not possible across entire classes of devices.
This poses a big dilemma for diverse environments.

Simple Network Management Protocol (SNMP)

y SNMP
– Meant for network management
– Inadequate for complete SAN Management
y Limitations of SNMP
– No Common Object Model
– Security - only newer SAN devices support v3
– Positive response mechanism
– Inflexible - No auto discovery functions
– No ACID (Atomicity, Consistency, Isolation, and Durability)
properties
– Richness of canonical intrinsic methods
– Weak modeling constructs
Until recently, Simple Network Management Protocol (SNMP) has been a protocol of choice
that has been used quite effectively to manage multi-vendor SAN environments. However,
SNMP, being primarily a network management protocol, is inadequate when it comes to
providing a detailed treatment on the fine grain elements in a SAN. Some of the limitations of
SNMP are shown here. While SNMP still retains a predominant role in SAN management,
newer and emerging standards may change this.

Storage Management Initiative (SMI)

y Created by the Storage Networking
Industry Association (SNIA)
y Integration of diverse multi-vendor
storage networks
Management Application
y Development of more powerful Integration Infrastructure

management applications Object Model Mapping Vendor Unique Features
y Common interface for vendors to •Platform Independent

•Distributed
develop products that incorporate the SMI-S •Automated Discovery CIM/WBEM
management interface technology Interface •Security
•Locking
Technology
•Object Oriented
y Key components Standard

Tape Library Switch Array Many Other
– Inter-operability testing Object
Model per
Device
– Education and collaboration MOF MOF MOF MOF
Vendor
– Industry and customer promotion Unique
Function
– Promotions and demonstrations
– Technology center
– SMI specification
– Storage industry architects and
developers
The Storage Networking Industry Association (SNIA) has been engaged in an initiative to
develop a common, open storage and SAN management interface based on the Distributed
Management Task Force’s (DMTF) Common Information Model. This initiative is known as the
Storage Management Initiative (SMI).
One of the core objectives of this initiative is to create a standard that will be adopted by all
Storage and SAN vendors, hardware and software alike, that will bring about true
interoperability and allow administrators to manage multi-vendor and diverse storage networks
using a single console or interface.
The Storage Management Initiative Specification (SMI-S) offers substantial benefits to users
and vendors. With SMI-S, developers have one complete, unified and rigidly specified object
model, and can turn to one document to understand how to manage the breadth of SAN
components. Management application vendors are relieved of the tedious task of integrating
incompatible management interfaces, letting them focus on building management engines that
reduce cost and extend functionality. And device vendors are empowered to build new features
and functions into subsystems.
SMI-S-compliant products will lead to easier, faster deployment and accelerated adoption of
policy-based storage management frameworks. A test suite developed by the SNIA will certify
compliance of hardware components and management applications with the specification.
Certified components also will be subjected to rigorous interoperability testing in an SMI
laboratory.

Storage Management Initiative Specification

(SMI-S)
Graphical User Management Users
y Based on: Management Tools
Storage Resource Management Container Management Data Management
– Web Based Enterprise Performance
Capacity Planning
Volume Management
Media Management
File System
Database Manager
Management (WBEM) architecture Removable Media Other Backup and HSM
– Common Information Model (CIM) Storage Management Interface Specification

Managed Objects
y Features: Physical Components

Removable Media
Logical Components
Volume
Tape Drive Clone
– A common interoperable and Disk Drive
Robot
Snapshot
Media Set
Enclosure Zone
extensible management transport Host Bus Adapter Other
Switch
– A complete, unified and rigidly
specified object model that
provides for the control of a SAN
– An automated discovery system
– New approaches to the application
of the CIM/WBEM technology
SMI-S forms a layer that resides between managed objects and managed applications. The
following features of SMI-S provide the key to simplifying SAN management:
y Common data model: SMI-S is based on Web Based Enterprise Management (WBEM)
technology and the Common Information Model (CIM). SMI-S agents interrogate a device,
such as a switch, host or storage array, extract the relevant management data from CIM-
enabled devices, and provide it to the requester.
y Interconnect independence: SMI-S eliminates the need to redesign the management transport
and lets components be managed using in-band or out-of-band communications, or a mix of
the two. SMI-S offers further advantages by specifying the CMI-XML over HTTP protocol
stack and utilizing the lower layers of the TCP/IP stack, both of which are ubiquitous in
today's networking world.
y Multilayer management: SMI-S has been developed to work with server-based volume
managers, RAID systems and network storage appliances, a combination that most storage
environments currently employ.
y Legacy system accommodation: SMI-S has been developed to incorporate the management
mechanisms in legacy devices with existing proprietary interfaces through the use of a proxy
agent. Other devices and subsystems also can be integrated into an SMI-S network using
embedded software or a CIM object manager.
y Automated discovery: SMI-S-compliant products announce their presence and capabilities to
other constituents. Combined with the automated discovery systems in WBEM to support
object model extension, this will simplify management and give network managers the
freedom to add components to their SAN more easily.
y Policy-based management: SMI-S includes object models applicable across entire classes of
devices, which lets SAN managers implement policy-based management for entire storage
networks.

Common Information Model (CIM)

y Describes the management of data
y Details requirements within a domain
y Information model with required syntax
The Common Information Model (CIM) is the language and methodology for describing
management data.
Information used to perform tasks is organized or structured to allow disparate groups of people
to use it. This can be accomplished by developing a model or representation of the details
required by people working within a particular domain. Such an approach can be referred to as
an information model.
An information model requires a set of legal statement types or syntax to capture the
representation, and a collection of actual expressions necessary to manage common aspects of
the domain.
A CIM schema includes models for systems, applications, Networks (LAN), and devices. The
CIM schema will enable applications from different developers on different platforms to
describe management data in a standard format so that it can be shared among a variety of
management applications.

Web Based Enterprise Management (WBEM)
Web Based Enterprise Management (WBEM) is a set of management and internet standard
architectures developed by the Distributed Management Task Force (DMTF) to unify the
management of enterprise computing environments, traditionally administered through
traditional management stacks like SNMP and CMIP.
WBEM provides the ability for the industry to deliver a well-integrated set of standard-based
management tools leveraging emerging web technologies.
The DMFT has developed a core set of standards that make up WBEM, which includes a data
model, the CIM standard; an encoding specification, xml CIM encoding specification; and a
transport mechanism, CIM Operation over HTTP .

Enterprise Management Platforms (EMPs)

y Graphical applications
y Monitoring of many (if not all) data center components
y Alerting of errors reported by those components
y Management of many (if not all) data center components
y Can often launch proprietary management applications
y May include other functionality
– Automatic provisioning
– Scheduling of maintenance activities
y Proprietary architecture
Enterprise Management Platforms (EMPs) are complex applications, or suites of applications,

that simplify the tasks of managing and monitoring data center environments.
They will monitor data center components such as network switches, SAN switches, hosts, and
alert the user of any problems with those components. At a minimum, the icon associated with
the component in the GUI will change color to indicate its condition. Other forms of alerting,
such as email or paging, may also be used.
In addition to the monitoring functionality, management functionality is usually included as
well. This may take the form of ‘native’ management by code embedded into the EMP, or may
involve launching the proprietary management utility supplied by the manufacturer of the
component.
Other included functionality often allows easy scheduling of operations that must be performed
on a regular basis, as well as provisioning of resources such as disk capacity.

Module Summary
Key points covered in this module:
y It is important to continuously monitoring of data center
components to support the availability and scalability
initiatives of any business
– Components include the server, SAN, network, and storage arrays
y The four areas of monitoring:

– Health
– Capacity
– Performance
– Security
y There are attempts to define a common monitoring and

management model
These are the key points covered in the module. Please take a moment to review them.


Upon completion of this topic, you will be able to:
y Describe how EMC ControlCenter can be used to monitor
the Data Center

EMC ControlCenter Architecture

User Interface Tier
• Console (many) Agent Tier
• Optional applications • Master Agent (1)
• Application Agents (many)
Infrastructure Tier
• Server (one)
• Repository (one)
• Store (many)
EMC ControlCenter is a multi-tiered application, with multiple hosts running processes at each
tier to support monitoring and management functions. The three tiers are:
y User Interface Tier - the ControlCenter Console is an application that runs on a host and
provides the main user interface for monitoring and managing the Storage Environment.
y Infrastructure Tier - the ControlCenter Server, Repository, and one or more Stores provide
central data storage and agent coordination at this tier. Infrastructure components can be
installed on different hosts to allow a single infrastructure to scale to manage a large
datacenter environment. ControlCenter Server and Store(s) are actually processes that run on
a (usually) dedicated host. This host is referred to as the Infrastructure host. The repository is
a database on the Infrastructure host.
y Agent Tier - agents are responsible for gathering data about, and the management of
different objects in the storage environment. Objects in a Storage Environment can be
physical such as host, storage array, SAN switches or logical such as database, file system.
Each host in the Storage Environment that needs to be monitored/managed via
ControlCenter must have one Master Agent, and one Host Agent specific to that host type.
The hosts can also have other agents to monitor/manage physical objects connected to them
or logical objects residing on them.
ControlCenter commands are passed from the console to the ControlCenter Server over a
TCP/IP network, for execution. The ControlCenter Server then either retrieves the requested
data from the Repository and returns it to the Console for display, or forwards the command to
an agent. Agents pass the data they collect from the customer environment to the Store, which
writes it to the Repository. Agents can also collect transient data, such as alerts and real-time
performance data; they pass this directly to the ControlCenter Server for display on the console.

EMC ControlCenter Console

y Primary interface through which the storage environment is
viewed and managed
y Java-based application supported on Windows and Solaris
platforms
y Objects managed by various agents are organized into groups
such as Storage, Hosts, and Connectivity
y Information about an object can be retrieved by the Console
from the Repository or in real-time directly from the agent
y Any command issued for the object is passed from the
Console to the ControlCenter Server and handled
appropriately
y There can be several Consoles spread across the network
The ControlCenter console is the main management and monitoring interface for ControlCenter.
It is a Java application supported on a Windows or Solaris host.
The Console retrieves most of its information from data stored in the Repository via the
ControlCenter Server. Property and configuration information about managed objects is
reported by the agents to the Repository on a periodic basis, and a simple database query
retrieves the data for use in Console displays. To provide immediate updates, real-time data
such as object status changes or alert information is passed from the Agent directly to running
Consoles via the ControlCenter Server. In all cases, the ControlCenter Server manages Console
information gathering and presentation.

EMC ControlCenter Server

y ControlCenter Server is the primary interface between the Console and the
ControlCenter infrastructure
y ControlCenter Server provides a diverse collection of services including:
– Web Applications Server – used for installing the Java Console
– Security and access management, such as licensing, login, authentication, and
authorization
– Communication with the Console
– Alert and event management
– Real-time statistics
– Object management to maintain a list of managed objects
– Agent management to maintain a list of available agents
y ControlCenter Server retrieves data from the Repository for display by the
Java and Web Console
y User initiated real-time data requests from some agents, are also handled
by the ControlCenter Server
y Balances Agent to Store communication based on workload
The ControlCenter Server handles data transfers between the Console and the Infrastructure.
Much of the information presented to the Console is retrieved from stored records in the
Repository. Real-time status updates, and alerts, are generated by the Agents and transferred
directly to the Console by the ControlCenter Server.

EMC ControlCenter Repository

y Licensed, embedded Oracle 9i database that holds
current and historical information about the managed
environment
y ControlCenter Server executes transactions on the
Repository to retrieve information requested by the
Console
y Store(s) populate the Repository with persistent data from
the agents
y Repository requires minimal user interaction or
maintenance. The database has restricted access and
can be updated only by ControlCenter applications
The ControlCenter Repository is an Oracle database, used for storing information about the
managed environment and the objects therein. Data is entered and retrieved only through
ControlCenter components: the Server retrieves information for user displays, while the Store
populates the database with information from the agents.
The Repository is a protected database—users can not directly access or change the tables or
records. All access, including administration rights, is reserved by ControlCenter components.
An automated task backs up the database daily.

EMC ControlCenter Store

y Store receives the data sent by the agents, processes the
data and updates the Repository
y There can be multiple Stores in the environment,
providing load balancing, scaling, and failover
The ControlCenter Store records data delivered by the Agents into the Repository. The
Infrastructure can have multiple Stores. The ControlCenter Server load balances multiple Stores
by choosing the Store with the lowest load for each Agent transaction. This provides fail over as
well, since a new Store can be chosen at any time. ControlCenter scalability for large
environments is in part achieved by adding multiple Stores as the number of managed objects
grows.

EMC ControlCenter Agents

y Master agent:
– One per host
– Manages other agents on the host – start/stop,
monitor agent status and health
y ControlCenter Agents:
– Runs on hosts to collect data and monitor object
health
– Generate alerts
– Multiple agents can exist on a host
– Passes information to the ControlCenter Store and
the ControlCenter Server.
Agents monitor and issue management commands to objects. An agent of some type is needed
for any object-related activity or information.
A Master Agent must be running on any host that has any other ControlCenter Agents. The
Master Agent starts, stops, and monitors the status of the other agents.
The other ControlCenter Agents typically monitor and manage objects. Their primary function
is to scan their managed object(s) at regular intervals set by ControlCenter Data Collection
Policies (DCPs). Data is typically reported to a Store for addition to the Repository. Agents can
also route information through the ControlCenter Server directly to a Console for immediate
updates.
Most Agents are very specific in their focus. A Storage Agent for CLARiiON can only manage
CLARiiON arrays, for instance. Many Agents can monitor multiple objects at the same time.

EMC ControlCenter Support for Storage Arrays

The following Storage Arrays are supported by EMC ControlCenter
y EMC Symmetrix
y EMC CLARiiON
y EMC Centera
y EMC Celerra and Network Appliances NAS servers
y EMC Invista
y Hitachi Data Systems (including the HP and Sun resold versions)
y HP Storageworks
y IBM ESS
y SMI-S (Storage Management Initiative Specification) compliant
arrays
A large number of enterprise storage arrays are supported by ControlCenter, including EMC
Symmetrix, CLARiiON, Centera, Invista, and the Celerra NAS (Network Attached Storage)
server. Some level of support is also available for other vendor’s storage, including Network
Appliance NAS servers, Hitachi Data Systems arrays, Hewlett-Packard Storageworks arrays,
and IBM ESS (Shark) arrays.
Further support is provided for any array that is SMI-S (Storage Management Initiative
Specification) compliant. This new SNIA (Storage Industry Networking Association) initiative
specifies a standard set of storage management commands. Arrays that are SMI-S compliant
can be managed in the same way by one client—in this case, a ControlCenter agent. EMC and
many other vendors are already building SMI-S compliance into their arrays.

EMC ControlCenter support for SAN Devices

The following SAN devices are supported by ControlCenter
y EMC Connectrix
y Brocade
y McData
y Cisco
y Inrange (CNT)
y IBM Blade Server (IBM-branded Brocade models only)
y Dell Blade Server (Dell-branded Brocade models only)
A variety of SAN devices can be managed with ControlCenter, including any EMC Connectrix
switches. Fibre Channel switches from other vendors such as Brocade, McData, Cisco, and
Inrange are also supported. Re-branded Brocade switches sold by IBM and Dell are also
supported.

EMC ControlCenter Support for Hosts

The following hosts are supported by ControlCenter
y Dedicated Host agents
– Microsoft Windows
– Hewlett-Packard HP-UX
– IBM AIX
– IBM mainframe
– Linux
– Novell Netware
– Sun Solaris
y Proxy management via Common Mapping Agent (CMA)
– Compaq Tru64
– Fujitsu-Siemens BS2000
– Windows, Solaris, AIX, Linux, and HP-UX hosts can also be monitored by
Common Mapping Agent proxy
Most host management is handled by dedicated ControlCenter agents. Usually, an operating-

system specific agent must be installed on each host that you want to manage.
Common Mapping Agent allows the management of hosts for which dedicated agents are not
provided. Functionality provided by the Common Mapping Agent is limited in comparison to
that provided by dedicated agents.

EMC ControlCenter Support for Database and Backup

The following databases are supported by ControlCenter
y Dedicated database agent
– Oracle
– DB2 on mainframe
y Proxy management via Common Mapping Agent (CMA)
– SQL Server
– Sybase
– Informix
– DB2
y Dedicated backup agent
– EMC EDM
– IBM Tivoli
– EMC Networker
– Veritas Netbackup
ControlCenter can monitor several types of database and backup applications. Most are
managed by dedicated agents that can manage only one type of application. However, the
Common Mapping Agent can also manage several types of databases by proxy.

Discovery of Managed Objects by Agents

y Automatic Discovery: Many agents discover data objects
automatically
y Assisted Discovery: These agents must discover their
objects by administrator action
– Common Mapping Agent
– Database Agent for Oracle
– Fibre Channel Connectivity Agent
– Storage Agents for CLARiiON, Centera, Invista, NAS, SMI, HP
StorageWorks, HDS and ESS
Many ControlCenter agents automatically discover the object they monitor as soon as they are
installed and started (e.g. Symmetrix, SDM, and Host Agents). The agent collects information
about the object and forwards it to the Store. The Store then populates the Repository with this
information.
Some Objects, listed under Assisted Discovery, must be manually discovered. Typically, this
happens when the agent must monitor the object via a network connection. An administrator
can issue a manual discover command through the Console (the appropriate agent must be
installed and running first). The dialogs available under this menu allow the administrator to
choose the discovered object type, enter information (network address, version, etc.), and
monitor the results of the discovery.

Data Collection Policies (DCP)

y Formal set of statements used to manage the data
collected by ControlCenter agents
y Policies specify the data to collect and the frequency of
collection
y ControlCenter agents have predefined collection policy
definitions and templates
– Default definitions can be easily modified, or new definitions can
be created from the templates provided
Data Collection Policies are a formal set of statements that define how ControlCenter Agents
gather and report configuration information about the objects that they manage. Data Collection
Policies define which objects should be monitored and with what frequency they should be
polled. By default, all agents of the same type discover their information at the same time every
day. The defaults can be changed very easily to suit the environment.

Console View of the Storage Environment
SAN Switch
Server
Dual HBAs
WWN of HBAs
Storage Array
Storage Array Front-end

Directors and Ports
After discovery, the managed objects can be displayed in a number of ways in the ControlCenter
Console. Shown here is the Topology View. This view shows the Server with its two HBAs (and
their WWN), connected to a Storage Array via a SAN switch.

Alerts - Overview
y Why Alert? - Data availability
– Monitor and report on events that could lead to application
outages
– Every ControlCenter agent can monitor a number of metrics
¾30 agents and 700+ alerts
y Alert categories
– Health
¾Examples - Database instance up/down, Symmetrix service
processor down, Connectivity device port status
– Capacity
¾Examples - File System Space, File/Directory Size Change
– Performance
¾Examples – Symmetrix Total Hit %, Host CPU Usage
Alerts are categorized by ControlCenter as being related to health, capacity, or performance.

Alerts can be configured and customized by the administrator. Customization includes:
y Setting threshold values to trigger alerts
y Assigning severity (from Information to Fatal) based on threshold values
y Specifying different means of Notification of Alerts
y Include/exclude objects to be monitored for Alert conditions

Alert Notification
Notification capabilities
y Messages are directed to the ControlCenter console by
default
y Messages can be directed to a Management Framework
via Integration Gateway (SNMP) – governed by
Management Policy associated with the Alert
y E-mail notification as specified in the Management Policy
Listed are the notification capabilities of ControlCenter. In addition to these, one can specify
custom scripts to be executed upon encountering different alert conditions.

EMC ControlCenter Console View of Alerts

Object Name Message
Alert state Severity

Alert severity
This is the Alerts View. It is the main functional view for alerts, enabling you can quickly find
the cause of the alert, which object is affected, and what the level of the alert is. The columns in
this view include:
y Alert State: Yellow Bell – New alert (Text will be in Bold font). Gray Bell – Acknowledged
or Assigned Alert (Text is in normal font).
y Severity: Ranges from 1 to 5. 1 = Fatal, 2= Critical, 3 = Warning, 4 = Minor, 5 =
Information
y Object Name: Host, storage array, network component (such as a switch), or other managed
object for which the alert triggered.
y Message: A description of the condition that caused the alert. Look here for information
about the specific resources affected.

Managing in the Data Center

After completing this module, you will be able to:
y Describe individual component tasks that would have to
be performed in order to achieve overall data center
management objectives
y Explain the concept of Information Lifecycle Management
The objectives for this module are shown here. Please take a moment to review them.

Managing Key Data Center Components

Client
HBA
Port
HBA
Port
IP
Keep Alive
IP
SAN
Storage Arrays
Network
Availability
Reporting
Capacity
Performance
Cluster
Hosts/Servers with Security
Applications
In the module on Monitoring, we learned about the importance of monitoring the various data
center components for Health, Capacity, Performance, and Security. In this section, we will
focus on the various management tasks that need to be performed in order to ensure that
Capacity, Availability, Performance, and Security requirements are met.
The major components within the data center to be managed are:
y IP Networks
y Servers and all applications and databases running on the servers
y Storage Area Network (SAN)
y Storage Arrays
Data Center Management can be broadly categorized as Capacity Management, Availability
Management, Security Management, Performance Management and Reporting. Specific
management tasks could address one or more of the categories. E.g. A LUN Masking task,
addresses Capacity (storage capacity is provided to a specific host), Availability (if a device is
masked via more than one path then single point of failure is eliminated), Security (masking
prevents other hosts from accessing a given device) and Performance (if a device is accessible
via multiple paths then host based multipathing software can improve performance by load
balancing).

Data Center Management

y Capacity Management
– Allocation of adequate resources
y Availability Management
– Business Continuity
¾ Eliminate single points of failure
¾ Backup & Restore
¾ Local & Remote Replication
Capacity Management ensures that there is adequate allocation of resources for all applications
at all times. Capacity Management involves tasks that need to be performed on all data center
components in order to achieve this goal. Let us take the example of allocating storage to a new
applications that will be deployed on a new server from an intelligent storage array (we will
explore this specific example in much more detail later in this module). To achieve this
objective the following tasks would have to be performed on the storage array, the SAN and on
the server:
y Storage Array: Device configuration, LUN Masking
y SAN: Unused Ports, Zoning
y Server: HBA Configuration, host reconfiguration, file system management,
application/database management
Availability Management ensures business continuity by eliminating single points of failure in
the environment and ensuring data availability though the use of backups, local replication and
remote replication. Backup, local and remote replication have been discussed in Section 4 –
Business Continuity. Availability management applies to all data center components.
In this example, of a new application/server, availability is achieved as follows:
y Server: At least two HBAs, multi-pathing software with path failover capability, Cluster,
Backup.
y SAN: Server is connected to the storage array via two independent SAN Fabrics, SAN
switches themselves have built-in redundancy of various components.
y Storage Array: Devices have some RAID protection, Array devices are made available to the
host via at least two front-end ports (via independent SAN fabrics), Array has built-in
redundancy for various components, local and remote replication, backup.

Data Center Management, continued

y Security Management
– Prevent unauthorized activities or access
y Performance Management
– Configure/Design for optimal operational efficiency
– Performance analysis
¾ Identify bottlenecks
¾ Recommend changes to improve performance
Security Management prevents unauthorized access to, and configuration tasks on, the data center
components. Unauthorized access to data is prevented as well. In the new application/server deployment
example, security management is addressed as follows:
y Server: Creation of user logins, application/database logins, user privileges.
Volume/Application/Database management can only be performed by authorized users.
y SAN: Zoning (restricts access to front-end ports by specific HBAs). Administrative/Configuration
operations can only be performed by authorized users.
y Storage Array: LUN Masking (restrict access to specific devices by specific HBAs).
Administrative/Configuration operations can only be performed by authorized users. Replication
operations are restricted to authorized users as well.
Performance Management ensures optimal operational efficiency of all data center components.
Performance analysis of metrics collected is an important part of performance management and can be
complicated because data center components are all inter-related. The performance of one component will
have an impact on other components. In the new application/server deployment example performance
management will involve:
y Server: Volume Management, Database/Application layout, writing efficient applications, multiple
HBAs and multi-pathing software with intelligent load balancing.
y SAN: Design sufficient ISLs in a multi-switch fabric. Fabric design – core-edge, full mesh partial
mesh …
y Storage Array: Choice of RAID type and layout of the devices (LUNs) on the back-end of the array,
choice of front-end ports (are the front-end ports being shared by multiple servers, are the ports
maxed out), LUN Masking devices on multiple ports for multi-pathing.

Data Center Management, continued

y Reporting
– Encompasses all data center components is used to provide
information for Capacity, Availability, Security and Performance
Management
– Examples
¾ Capacity Planning
Storage Utilization
File System/Database Tablespace Utilzation
Port usage
¾ Configuration/Asset Management
Device Allocation
Local/Remote Replica
Fabric configuration – Zone and Zonesets
Equipment on lease/rotation/refresh
¾ Chargeback
Based on Allocation or Utilization
¾ Performance reports
Reports can be generated for all data center components. Data center reports can be used for
trend analysis, capacity planning, chargeback, basic configuration information, etc.

Scenario 1 – Storage Allocation to a New Server
Host SAN Array

Storage File / File Allocate Assign Config
Allocation Database System Volume SAN Volumes Volumes New
Mgmt Mgmt Mgmt Zoning Hosts Ports Volumes
Tasks
File System Host Host

/ Database Used Allocated Reserved Mapped Unconfigured
Used
Configured
Volume
Group
Allocated
Let us explore the various management tasks with the help of an example. Let us assume that a
new server has to be deployed in an existing SAN environment and has to be allocated storage
from a storage array. The allocated storage is to be used by an application which uses a
relational database. The database uses file systems. The picture breaks down the individual
allocation tasks. We will explore the individual tasks in the next few slides.
Storage Array Management
y Configure new volumes on the array for use by the new server
y Assign new volumes to the array front end ports
SAN Management
y Perform SAN Zoning – Zone the new servers HBAs via redundant fabrics to the front end
ports of the storage Array
y Perform LUN Masking on the storage array – Give the new server access to the new
volumes via the array front end ports
Host Storage Management
y Configure HBAs on new server
y Configure server to see new devices after zoning and LUN Masking is done
y Volume Management (LVM tasks)
y File System Management
y Database/Application Management
Array Management – Allocation Tasks

y Configure new volumes (LUNs)
– Choose RAID type, size and number of volumes
– Physical disks must have the required space available
y Assign volumes to array front end ports

– This is automatic on some arrays while on others this step must be
explicitly performed
Intelligent Storage System
Front End Back End Physical Disks

LUN 0
Host Connectivity Cache
LUN 1
RAID 0
RAID 1
RAID 5
…
As we learned previously, the physical disks at the backend of the storage array are not directly
presented as LUNs to a Host. Typically, a RAID Group or RAID set would be created and then
LUNs could be created within the RAID set. These LUNs are then eventually presented to a
host. These LUNs appear as physical disks from a host point of view. The space on the array
physical disks that has not been configured for use as a host LUN is considered un-configured
space and can be used to create more LUNs.
Based on the storage requirements configure enough LUNs of the required size and RAID type.
On many arrays, when the LUN is created, it is automatically assigned to the Front End ports of
the array. On some arrays, the LUNs have to be explicitly assigned to array front end ports – this
operation is called Mapping.

Server Management – HBA Configuration

y Server must have HBA hardware installed and configured
– Install the HBA hardware and the software (device driver) and
configure
HBA Driver
HBA
New Server Multi-path
HBA
y Optionally install multi-pathing software

– Path failover and load balancing
The installation of the HBA hardware, software, and HBA configuration has to be performed
before the server can be connected to the SAN. Multi-pathing software can be optionally
installed. Most enterprises would opt to use multi-pathing because of availability requirements.
Multi-pathing software can also perform load balancing, which will help performance.

SAN Management – Allocation Tasks

y Perform Zoning
– Zone the HBAs of the new server to the designated array front end
ports via redundant fabrics
¾ Are there enough free ports on the switch?
¾ Did you check the array port utilization?
Storage Array
SW1
Port
HBA
Port
HBA Port
New Server SW2 Port
y Perform LUN Masking

– Grant the HBAs on the new server access to the LUNs on the array
Zoning and LUN Masking operations have been discussed in detail in the section on FC SAN.
Zoning tasks are performed on the SAN Fabric. LUN Masking operations are typically
performed on the storage array.
The switches should have free ports available for the new server. Check the array port utilization
if the port is shared between many servers.

Server Management – Allocation

y Reconfigure Server to see new devices
y Perform Volume Management tasks
y Perform Database/Application tasks
VG
LV HBA
FS
HBA
DB App
Reconfigure Server to see new devices

y Bus rescan or a reboot
Perform Volume Management tasks
y Create Volume Groups/Logical Volumes/File Systems
− # of Logical Volumes/File Systems depends on how the database/application is to be laid
out
Database/Application tasks
y Install database/application on the Logical Volumes/File Systems that were created
y Startup database/application

Scenario 2 – Running out of File System Space

Solutions
y Offload non-critical data
– Delete non-essential data
File System – Move older/seldom used data to
other media
¾ ILM/HSM strategy
¾ Easy retrieval if needed
y Extend File System

– Operating System and Logical
Volume Manager dependent
– Management tasks seen in
Scenario 1 will apply here as well
Warning: FS is 66% Full
Critical: FS is 80% Full
In this scenario, we will explore data center management tasks that you would possibly have to
execute to prevent a file system from getting 100% full.
When a file system is running out of space, either:
y Actively perform tasks which off load data from the existing file system (keep file system
the same size)
− Delete unwanted files
− Offload files that have not been accessed for a long time to tape or to some other media
from which it can be easily retrieved if necessary
y Extend the file system to make it bigger
− Considerations for extending file systems
¾ Dynamic extension of file systems is dependent on the specific operating system or
logical volume manager (LVM) in use
− The possible tasks to extend file systems is discussed in more detain in the next slide
In reality, a good data center administrator should constantly monitor file systems and offload
non-critical data and also be ready to extend the file system, if necessary.

Scenario 2 – Running out of File System Space, continued
Correlate File System with Volume Group Done

or Disk Group.
No
Yes Execute Command
Is the File System being
Is there free space available in the VG? to extend File
replicated?
System.
No Yes
Yes
Does the server have additional Execute Command to
devices available? Perform tasks to ensure that
extend VG.
the larger File System and
No Volume Group are replicated
correctly
Does the Array have configured Yes
Allocate LUNs to server
LUNs that can be allocated?
No
Yes
Does the array have unconfigured Configure new LUNs
capacity?
No Identify/Procure another array

The steps/considerations prior to the extension of a file system have been illustrated in the flow
chart. The goal is to increase the size of the file system to avoid application outage. Other
considerations revolve around local/remote replication/protection employed for the application.
For instance, if the application is protected via remote/local replication and a new device is
added to the Volume Group, ensure that this new device is replicated as well.
The steps include:
y Correlate the file system to the logical volume and volume group if an LVM is in use
y If there is enough space in the volume group – extend the file system
y If the volume group does not have space – does the server have access to other devices
which can be use to extend the volume group – extend the volume group – extend the file
system
y If the server does not have access to additional devices – allocate additional devices to the
server – many or all of the steps discussed in scenario 1 will have to be used to do this
(configure new LUNs on array, LUN mask, reconfigure server to recognize new devices –
extend volume group – extend file system)

Scenario 3 – Chargeback Report

Storage Arrays
VG
LV
Production Remote Replica
FS
VG (Green) (Red)
DB App
LV
FS
Port
DB AppVG
LV
SW1 Port
FS
DB App
Port
Port
SW2
Hosts/Servers Local Replica
with Applications (Blue)
Scenario 3: In this scenario, we will explore the various data center tasks that will be necessary
to create a specific report.
A number of servers (50 – only 3 shown in picture) with 2 HBAs each and are connected to a
Storage Array via two switches SW1 and SW2. Each server has independent paths (2 HBAs) to
the storage array via switch SW1 and switch SW2. Applications are running on each of the
servers, array replication technology is used to create local and remote replicas. The Production
devices are represented by the green devices, local replica by the blue devices and the remote
replicas by the red devices.
A report documenting the exact amount of storage used by each application (including that used
for local and remote replication) has to be created. The amount of raw storage used must be
reported as well. The cost of the raw storage consumed by each application must be billed to the
application owners. A sample report is shown in the picture. The report shows the information
for two applications. Application Payroll_1 has been allocated 100 GB of storage. Production
volumes are RAID 1 volumes hence the raw space used by the production volumes is 200 GB.
Local replicas are on unprotected (no fault tolerance) volumes, hence raw space used by local
replicas is 100 GB. The remote replicas are on RAID5 (5 disk group) volumes, hence raw space
used for remote replicas is 125 GB.
What are the various data center management steps to perform in order to create such a report?

Scenario 3 – Chargeback Report – Tasks

– Correlate Application Æ File Systems Æ Logical Volumes Æ Volume
Groups Æ Host Physical Devices Æ Array Devices (Production)
– Determine Array Devices used for Local Replication
– Determine Array Devices used for Remote Replication
– Determine storage allocated to application based on the size of the
array devices
Example: Remote
Array 1 Array
VG
Local Remote
Source Replica Replica
LV Vol 1 Vol 1 Vol 1
DB App
Local Remote
Source Replica Replica
FS Vol 2 Vol 2
Vol 2
The first step in determining the chargeback costs associated with an application is to correlate
the application with the array devices that are in use. As indicated in the picture, trace the
application to the file systems, logical volumes, volume groups, and eventually to the array
devices. Since the applications are being replicated, determine the array devices used for local
replication and the array devices used for remote replication. In the example shown, the
application is using “Source Vol 1&2” (in Array 1). The replication devices are “Local Replica
Vol 1&2” (in Array 1) and “Remote Replica Vol 1&2” (in the Remote Array).
Keep in mind that this can change over time. As the application grows, more file systems and
devices may be used. Thus, before a new report is generated, the correlation of application to the
array devices should be done to ensure that the most current information is used.
After the array devices are identified, the amount of storage allocated to the application can be
easily computed. In this case “Source Vol 1&2” are each 10GB in size. Thus the storage
allocated to the application is 20GB (10+10). The allocated storage for replication would be
20GB for local and 20GB for remote. The allocated storage is the actual storage that can be
used, it does not represent the actual raw storage used by the application. To determine the raw
space, determine the RAID protection that is used to the various array devices.

Scenario 3 – Chargeback Report – Tasks, continued

– Determine RAID type for Production/Local Replica/Remote Replica
devices
– Determine the total raw space allocated to application for
production/local replication/remote replication
– Compute the chargeback amount based of price/raw GB of storage
– Repeat steps for each application and create report
– Repeat the steps each time the report is to be created
(weekly/monthly)
Example:
2 Source Vols = 2*10GB RAID 1 = 2* 20GB raw = 40GB
2 Local Replica Vols = 2*10GB = 2*10GB raw = 20GB
2 Remote Replica Vols = 2*10 GB RAID 5 = 2*12.5 GB raw = 25GB
Total raw storage = 40+20+25 = 85GB
Chargeback cost = 85*0.25/GB = 21.25
To determine the raw space, review the steps displayed on the slide using the example listed.
Determine RAID type for Production/Local Replica/Remote Replica devices. In the example
shown, production devices are 10GB RAID 1, Local replica devices are 10GB with no
protection, and remote replica devices are 10GB RAID 5 (5 disk group) devices. Determine the
total raw space allocated to application for production, local replication, and remote replication.
Based on the values from step 1, you can determine that the total raw space used by the
application is 85GB. (Total raw storage = 40+20+25 = 85GB). Compute the chargeback amount
based on price per raw GB of storage. Based on the cost per GB of storage (for the example this
equals .25/GB), the chargeback cost can be computed. (Chargeback cost = 85*0.25/GB = 21.25).
Repeat these steps for each application and create a report. Repeat the steps each time the report
is to be created (weekly/monthly).
The exercise would have to repeated for every single application in the enterprise in order to
generate the require report. These tasks can be done manually. Manual creation of the report
may be acceptable if only one or two applications exist. The process can become extremely
tedious if many applications exist. The best way to create this report would be to automate these
various tasks.

Information Lifecycle Management

y Information Management Challenges
y Information Lifecycle
y Information Lifecycle Management
– Definition
– Process
– Benefits
– Implementation
Information Lifecycle Management (ILM) is a key approach for assuring availability, capacity,
and performance. Let’s look at some of the aspects of ILM.

Key Challenges of Information Management

CHALLENGE 1
Scaling infrastructure within budget constraints
Information CHALLENGE 2
growth is relentless Scaling resources to manage complexity
CHALLENGE 3
Access, availability, and protection of
Information critical information assets at optimal cost
is more strategic CHALLENGE 4
than ever Reducing risk of non-compliance
CHALLENGE 5
Information Ability to prioritize information
management based on data value
changes in value
over time
Companies face three key challenges related to information management:

Strong growth of information:
y Post-dot com rate of growth is around 50%, driven by digitization, increased use of e-mail,
etc.
y Just planning for growth can take up to 50% of storage resources
y Meeting growth needs has increased the complexity of a customer environment
Information is playing a more important role in determining business success:
y New business applications provide more ways to extract a competitive advantage in the
marketplace, e.g., companies like Dell, WalMart, and Amazon, where, at the heart of their
respective business models, is the strategic use of information.
Finally, information changes in value, and many times not necessarily in a linear fashion.
y For example, customers become inactive, reducing the need for account information;
pending litigation makes certain information more valuable, etc.
y Understanding the value of information should be at the heart of managing information in
general

The Information Lifecycle
Sales Order Application Example
New Order Order

Record Processing Warranty
Claim
TIME
VALUE
Orders Warranty
Fulfilled Voided
Protect Migrate Dispose

Create Access Archive
Information that is stored on a computer has a different value to a company, depending on how
long it is stored on the network. In the above example, this sales order goes through differing
value to the company from the time that it is created to the time that the warrantee is eventually
voided.
In a typical sales example as this one, the value of information is highest when a new order is
created and processed. After order fulfillment, there is potentially less need to have real-time
access to customer/order data, unless a warranty claim or other event triggers that need.
Similarly, after the product has entered EOL, or after the account is closed, there is little value in
the information and it can be disposed.

Information Lifecycle Management Definition
Information Lifecycle Management is a strategy, not a product or service in itself; further, this
strategy is proactive and dynamic in helping plan for IT growth as it relates to business needs,
and reflects the value of information in a company.
A successful information lifecycle management strategy must be:
y Business-centric by tying closely with key processes, applications, and initiatives of the
business
y Centrally managed, providing an integrated view into all information assets of the business,
both structured and unstructured
y Policy-based, anchored in enterprise-wide information management policies that span all
processes, applications, and resources
y Heterogeneous, encompassing all types of platforms and operating systems
y Aligned with the value of data, matching storage resources to the value of the data to the
business at any given point in time

Information Lifecycle Management Process

Policy-based Alignment of Storage Infrastructure with Data Value
AUTOMATED
Classify Implement Manage Tier
data / policies with storage storage
applications information environment resources to
based on management align with data
business rules tools classes
FLEXIBLE
Storage infrastructure that is

Application and Lifecycle Aware
The process of implementing the ongoing modification of an Information Lifecycle

Management strategy consists of four activities:
y Classify data and applications on the basis of business rules and policies to enable
differentiated treatment of information
y Implement policies with information management tools—from creation to disposal of data
y Manage the environment with integrated tools that interface with multi-vendor platforms,
and reduce operational complexity
y Tier storage resources to align with data classes - storing information in the right type of
infrastructure based on its current value.

Information Lifecycle Management Benefits

1. Improve utilization of assets
through tiered storage platforms
Information 2. Simplify and automate management
growth is relentless of information and storage
infrastructure
3. Provide more cost-effective

Information options for access, business
continuity and protection
is more strategic 4. Ensure easy compliance through
than ever policy-based management
5. Deliver maximum value at lowest

Information TCO by aligning storage
changes in value infrastructure and management
with information value
over time
Implementing an ILM strategy delivers key benefits that directly address the challenges of
information management.
y Improved utilization, by the use of tiered platforms, and increased visibility into all
enterprise information
y Simplified management by integration of process steps and interfaces to individual tools in
place today, and by increased automation
y A wider range of options backup, protection, and recovery to balance the need for continuity
with the cost of losing specific information
y Painless compliance by having better control upfront in knowing what data needs to be
protected and for how long
y Lower TCO while meeting required service levels through aligning the infrastructure and
management costs with information value so resources are not wasted or complexity
introduced by managing low-value data at the cost of high-value data

Path to Enterprise – Wide ILM
App App App App App App App App App
Data Data Data Data Data
Automated Networked Storage ILM for Specific Applications Cross-Application ILM
Step 1 Step 2 Step 3

Networked Tiered Storage Application-specific ILM Enterprise-wide ILM
y Enable networked storage y Define business policies for y Implement ILM
y Automate environment various information types across applications
y Classify applications / data y Deploy ILM components into y Policy-based automation
principal applications y Full visibility into all information
Lower cost through increased automation
Implementing ILM enterprise wide will take time, and no one believes it can be done
instantaneously. A three step roadmap to enterprise-wide ILM is illustrated.
y Step 1 and 2 are tuned to products and solutions available today, with the goal to be “ILM-
enabled” across a few enterprise-critical applications. In step 1, the goal is to get the
environment to an automated networked storage environment. This is the basis for any
policy-based information management. The value of tiered storage platforms can be
exploited manually. In fact, many enterprises are already in this state.
y Step 2 takes ILM to the next level with detailed application/data classification and linkage to
business policies. While done in a manual way, the resultant policies can be automatically
executed with tools for one or more applications, resulting in better management and
optimal allocation of storage resources.
y Step 3 of the vision is to automate more of the “front-end” or classification and policy
management activities so as to scale to a wider set of enterprise applications. It is consistent
with the need for more automation and greater simplicity of operations.

Module Summary
Key points covered in this module:
y Individual component tasks that would have to be
performed in order to achieve overall data center
management objectives were illustrated
– Allocation of storage to a new application server
– Running out of file system space
– Creating a chargeback report
y Concept of Information Lifecycle Management
These are the key points covered in this module. Please take a moment to review them.


Upon completion of this topic, you will be able to:
y Describe how EMC ControlCenter can be used to
manage the Data Center
The next set of slides illustrate a small subset out of a vast set of management tasks that can be
performed using EMC ControlCenter. Representative examples of EMC Symmetrix and
CLARiiON array configurations, as well as reporting are presented.

Array Configuration – Symmetrix
The drop-down menus shown allow users to perform an exhaustive set of array configuration
tasks on the Symmetrix. The two tasks listed in the module – namely Configure new volumes
and Assign volumes to array front end ports can be accomplished via the menu choices Logical
Device Configuration and SDR Device Mapping respectively.

Array Configuration - CLARiiON
Some configuration options are only

available when you select a specific
RAID Group, Storage Group or LUN
Shown in this slide are the array configuration options for the EMC CLARiiON array. These
include creation of new RAID groups, Binding LUNs within a RAID group, as well as certain
operations on a RAID group.

SAN Management – Zoning Operations
As seen previously in this module, zoning and LUN masking are the two key SAN management
allocation tasks. Zoning tasks such as creating/modifying/deleting zones and zone sets can be
performed from EMC ControlCenter. Fabric level management tasks such as
activating/deactivating zoning and importing zone sets can also be performed.

Array Device Masking - Symmetrix
Shown here is an example of LUN masking on the EMC Symmetrix array from EMC
ControlCenter. The menu driven user interface lets the administrator grant access to devices to
different hosts connected to the Symmetrix via the same front-end director and port.

Array Device Masking - CLARiiON
This slide shows the LUN masking operation for a CLARiiON array. A host connected to a
CLARiiON array is given access to LUNs by placing the host and the desired LUNs in the same
Storage Group.

Allocation Reports
How much storage is allocated, and where is it?
EMC ControlCenter StorageScope is a part of the EMC ControlCenter software. StorageScope

provides extensive reports on various components of the data center. Shown here are examples
of two such reports – one from the hosts perspective (top) and one from the arrays perspective
(bottom).
Host-based allocation data shows how much storage has been allocated to a host, breaking the
measures down by type:
y Primary or Replica (mirror or copy storage)
y Logical structure allocation like volume groups, logical volumes, filesystems, and databases
(not shown in this example)
y Array-based allocation data shows how much storage has been allocated for use. Many
categories of allocated storage are reported:
− Unconfigured (raw) or Configured (configured for use by hosts or the array itself)
− Allocated (presented to hosts or used by the array itself)
− Primary, Local Replica (same-array mirror or copy storage), or Remote Replica (cross
array mirror)
The array based report shows corresponding information for each of the arrays managed by
EMC ControlCenter.

Hosts Chargeback Report
y Chargeback summarizes host accessible

storage devices, filesystems, and
databases, independently dragged to the
group
As seen previously in this module, Chargeback reporting is a key task in managing the data
center. EMC ControlCenter StorageScope can generate summary reports on the amount of
storage allocated to hosts/groups of hosts/functional organizations which own many groups of
hosts etc.

Summary
This topic introduced a very small subset of data center
management tasks that can be performed using EMC
ControlCenter.

Section Summary
Key points covered in this section:
y Areas of the data center to monitor
y Considerations for monitoring the data center
y Techniques for managing the data center
This completes Section 5 – Monitoring and Managing the Data Center.

Please take a moment to review the key points covered in this section.

Course Summary
Key points covered in this course:
y Storage concepts and architecture
y Evolution of storage and storage environments
y Logical and physical components of storage systems
y Storage technologies and solutions
y Core data center infrastructure elements and activities for
monitoring and managing the data center
y Options for Business Continuity
This completes the Storage Technology Foundations training. Please take a moment to review
the key points covered in this course.

Monitoring Managing Data Centre

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Monitoring Managing Data Centre

Uploaded by

Copyright:

Available Formats

Copyright © 2006 EMC Corporation. Do not Copy - All Rights Reserved.

Section 5 - Monitoring and Managing the

© 2006 EMC Corporation. All rights reserved.

Monitoring and Managing the Data Center - 1

Monitoring and Managing the Data Center - 2

Monitoring and Managing the Data Center - 3

Apply Your Knowledge

Monitoring and Managing the Data Center - 4

Monitoring in the Data Center

Monitoring and Managing the Data Center - 5

Monitoring Data Center Components

Monitoring and Managing the Data Center - 6

Why Monitor Data Centers

Monitoring and Managing the Data Center - 7

y Monitoring health is fundamental and is easily understood

Health deals with the status/availability of a particular hardware component or a software

Monitoring and Managing the Data Center - 8

y Capacity monitoring prevents outages before they can

Monitoring and Managing the Data Center - 9

y Performance Monitoring/Analysis can be extremely

Performance monitoring measures the efficiency of operation of different data center

Monitoring and Managing the Data Center - 10

y Enforcing security and monitoring for security breaches is

Security prevents and tracks unauthorized access.

Monitoring and Managing the Data Center - 11

Monitoring and Managing the Data Center - 12

Monitoring and Managing the Data Center - 13

Monitoring the SAN

Monitoring and Managing the Data Center - 14

Monitoring the SAN

A number of SAN performance/statistical metrics can be used to determine/predict hardware

Monitoring and Managing the Data Center - 15

Monitoring the SAN

Monitoring and Managing the Data Center - 16

Monitoring Storage Arrays

Monitoring and Managing the Data Center - 17

Monitoring Storage Arrays

Monitoring and Managing the Data Center - 18

Monitoring Storage Arrays

Monitoring and Managing the Data Center - 19

Monitoring and Managing the Data Center - 20

Monitoring the Data Center as a Whole

Monitoring and Managing the Data Center - 21

Root Cause Analysis

Monitoring and Managing the Data Center - 22

Monitoring Health: Array Port Failure

Monitoring and Managing the Data Center - 23

Monitoring Health: HBA failure

Monitoring and Managing the Data Center - 24

Monitoring Health: Switch Failure

Monitoring and Managing the Data Center - 25

Monitoring Capacity: Array

This example illustrates the importance of monitoring the capacity of arrays.

Monitoring and Managing the Data Center - 26

Monitoring Capacity: Servers File System Space

Warning: FS is 66% Full

Critical: FS is 80% Full

This example illustrates the importance of monitoring capacity on servers.

Monitoring and Managing the Data Center - 27

Monitoring Performance: Array Port Utilization

Monitoring and Managing the Data Center - 28

Monitoring Performance: Servers

Monitoring and Managing the Data Center - 29