You are on page 1of 7

Top 14 Considerations for Addressing

Data Center Facilities Management Risks


Meeting operational challenges in the data center requires
organization, planning, and focus
By Stephen Burgess

Data center facilities managers face enormous pressure every day. The challenge of operating a complex and
ever-changing facility is a considerable one, especially considering the increasing business demands and budget
pressures prevalent in the industry. Yet, successful data center facilities managers continue to meet the constant
challenge. Uptime Institute has compiled these 14 considerations for reducing risk in the data center that
facilities managers can embrace to identify and minimize problems affecting operations.

1. Overtime
Sustained overtime rates of 10% or more can produce chronically overworked facility personnel, which
correlates very strongly to increased rates of incidents, including outages and even serious injury or loss of life.
Staffing a facility properly and correctly aligning workloads to the real needs of the facility is the best way to
eliminate chronic fatigue-inducing OT, maximize personnel safety, and minimize the potential for outages, given
the vast majority of them are due to operator error.

Trying to save money by running a facility with a very lean staff is just about the most dangerous, risky, and worst
decision any data center owner can make, given the cost of facility operations relative to the cost of the facility
and that of the IT asset it supports.

2. Critical Spares
Ensuring the ready availability of spare parts in the event of a loss of infrastructure redundancy or availability is
essential for mission critical facilities such as data centers. These critical spare parts can either be stored on-site
or can be provided by vendors. When there is vendor dependency, due diligence should be applied to ensure the
availability of the required spare parts. This can be stipulated in the service and maintenance contracts service
level agreements (SLA) held by the data center owner.

Developing a comprehensive critical spares inventory starts with a single point of failure (SPOF) analysis of the
data center design. Most high-quality data centers actually do not have high-impact SPOF points, so identifying

50
which failures would reduce redundancy is what is really important. For example, if an uninterruptible power
supply (UPS) module unexpectedly goes into static bypass, there is typically no loss of critical load; rather, the
impact is a loss of redundancy of fully conditioned, battery-backed power. The availability of a UPS critical
spares kit will dramatically reduce the time spent at the reduced redundancy level.

An effective critical spares kit should include large circuit breakers or automatic/manual/static transfer
switches (ATS/MTS/STS). With breakers in particular, the need for a critical spare often manifests itself during
scheduled maintenance, such as during tri-annual or 5-year maintenance and inspections that use primary
injection testing. This is particularly important in older facilities, where a large, expensive breaker may be
difficult to procure, if it can be located at all.

Finally, inventory management is critical to maintaining any critical spares kit. The inventory should be a
detailed asset list with robust controls for ensuring parts readiness (such as breaker certifications) and timely
replenishment when any part is taken from the inventory.

3. Diesel Fuel
Ensuring a reliable supply of usable diesel fuel can be a source of numerous concerns. These include:

Suppliers. For most data centers, formal contracts should be in place with at least three local suppliers.
These contracts must have well-defined service level agreements (SLA) to guarantee fuel delivery quantity
and time minimums.

Certificate of Conformance. Every fuel supplier should be required to maintain and comply with a Certificate
of Conformance to ensure any fuel delivered conforms to the ASTM D975 standard. Additional language
should be included that forbids any contamination with biofuels.

Fuel quality and polishing. Given that most fuel must be stored on site for a very long time (even decades),
fuel quality must be maintained. Data center managers should be sure to polish the fuel (multi-stage filtering
and circulation) regularly and regularly test fuel, remove water, and manage additives. An independent
laboratory that specializes in diesel fuel should be retained to address phenomena such as stratification and
self-polymerization, unless this service is included in generator service and maintenance contracts. Fuel
quality can be maintained by

Using it

Polishing it with permanently installed systems

Having a vendor polish it with an on-site visit no less frequently than annually

Having a vendor remove the fuel and replace it with fresh fuel conforming to ASTM D975.

Acceptance testing. Any fuel received on site should be sampled with tools such as a bacon bomb, taking
samples at several depths once the fuel truck has been parked for at least 15 minutes to let the fuel adequately
settle. Classic tests such as the visual beaker test (bright and clear) should be performed for any fuel
delivery. This scrutiny, along with regular lab samples sent to an independent laboratory and the fuel
vendors compliance Certificates of Conformance, will help ensure contaminant free and chemically correct
fuel is always delivered by the fuel vendors.

Correct fuel filter size. The simplest of errors incorrect filtration specification (micron size) of fuel filters
have caused some of the most dramatic data center failures. Data center managers should load bank the
engine generators at 100% load to prevent against fuel starvation after changing fuel filters. This also
validates the quality of the fuel.

4. Emergency Operating Procedures (EOPs)


EOPs should be developed for the ten most likely and high-impact abnormal conditions. These are pre-approved,
fully scripted responses for abnormal high-impact conditions that could reasonably occur.

51
Most modern data centers do not actually require a physical response to unexpected abnormal conditions. The real
purpose of the EOP is to verify the condition of the facility and correctly escalate and report it. The other essential
purpose of a well-developed EOP library is to ensure that facility operators do not try to be heroes, which often makes
matters worse and can endanger personnel. Training personnel to follow EOPs helps prevent the hero response.

Eight essential EOPs might include:

Loss of municipal power

Loss of municipal water

Activation of the fire alarm, including sustained level three detection, charged pipes, or a dry agent dump event

Recovery from emergency power off (EPO) activation

Loss of controls/PLC (programmable logic control) or automation of either mechanical or electrical systems

Loss of chilled water flow

Generator fail to start

UPS in static bypass

5. Drills
Having well-written EOPs means little if facility personnel are not familiar with them. The best way to maintain
a high level of operational readiness is to regularly simulate all the scenarios addressed by the sites EOP library.
These simulations are usually referred to as site drills. The more realistic the site drills the better. Site drills are
important refresher training that should be conducted in any data center.

In a live data center, there is usually very limited opportunity, if any, to replicate the actual infrastructure
conditions that warrant the use of EOPs. Many data center owners are uncomfortable at the notion of abruptly
disconnecting pumps, chillers, computer room air conditioners, and other equipment to trigger authentic
BMS alarms and require the personnel to interpret them and exercise the appropriate EOPs, with just a few
exceptions such as scheduled pull the plug tests.

Given this limitation, effective drilling requires the use of visual aids and props to safely simulate abnormal
conditions or behavior of real infrastructure. For example, a combination of printouts of building management
system (BMS)/emergency power management system (EPMS) graphics, switchgear enunciators, and human
machine interface (HMI) screens, with various signs and markings that can be taped to computer screens, panel
boards, and equipment can help simulate abnormal conditions that are anticipated by the sites EOP library.

The operations team should drill using the actual procedures in use at the facility. This produces a detailed historical
document that accurately measures the performance of the drill. Any drill conducted should produce one or more
completely filled out EOPs for the scenario. These documents should be filed and retained as formal site training.

Scheduling and performing formal site drills must consider any scheduled maintenance activity, meaning it needs
full visibility to data center operations management and approval by the formal change management process and
policies established to control all activities in the data center facility environment.

6. A Procedure-Based Control Methodology


Any and all interaction with data center facility infrastructure should be done according to pre-approved,
detailed, and fully vetted procedures. These include:

Methods of procedure (MOP). A detailed and scripted activity for formally scheduled and approved
preventive and corrective maintenance activities. MOPs ideally capture all details about the purpose of

52
the maintenance and everyone involved with it. A good MOP has very detailed steps to complete the activity,
including time stamps, initial blocks, and signature fields.

Standard operating procedures (SOP). Any routine interaction that involves a basic change of state or
configuration of the infrastructure, often to support planned maintenance, should be controlled with a well-
written SOP. SOPs share many features of the MOP, such as time stamp and operator-annotated steps.

Many data centers require procedure libraries that include hundreds of documents. Such a large collection
of documents requires a formal policy that defines how these documents are written, reviewed, and formally
approved for use. These policies should also address revision and formatting processes and controls.

Finally, SOPs and MOPs are meaningless if they are not followed. Procedure deviation is a major cause of
incidents and outages. Experienced facility technicians can become cavalier and complacent, especially with
the repetition of large maintenance evolutions. Therefore, it is crucial that management strictly enforce strict
adherence to the steps in all procedures and provide training to ensure the procedures are understood.

7. Safety Program
Any facility or portfolio must have a local authority having jurisdiction (AHJ/LAHJ) compliant safety program.
Having a current NFPA 70E compliant program is especially important for data centers (OSHA defers to NFPA
70E for electrical safety in the workplace). This includes a complete, fully tested personal protective equipment
(PPE) kit and associated lockout-tagout (LOTO) kit for hazardous energy isolation from both electrical and
mechanical sources. Fully formalized and AHJ/LAHJ compliant safety programs entail writing various program
definitions, policies, and procedures that explicitly define how safety is managed and administered for the facility.

8. Short Circuit Coordination Study (SSCS) and Arc Flash Assessment


A facility must have a current SSCS and associated arc-flash hazard assessment with arc-flash stickers correctly
placed in all areas of the environment. All breakers must be verified to have trip unit settings set to those
recommended in the SCCS.

9. Battery Monitoring System


Analogous to fuel for engine generators, having a UPS means nothing if the batteries do not respond when the
UPS input voltage goes away or out of tolerance (loss of city power or severe power quality problems). Using
a battery monitoring system that gives real-time condition and predictive maintenance capabilities with
associated alarming is the best way to achieve full confidence in the UPS batteries. If no battery monitoring
system exists, then quarterly battery inspections should be performed with industry standard tools. This is
especially important for valve regulated lead-acid (VRLA) absorbent glass mat (AGM) batteries because the cells
usually fail open. One open cell in a 40-jar string renders the whole string useless.

Real-time data provided by contemporary battery systems not only validates the availability of the battery plant, it
allows very accurate measurement of its capacity and expected reasonable end of life replacement period, typically
extending VRLA-type battery retention by 25% or more. Such an extension of battery utilization amounts to a very
significant operational cost deferral given the multimillion-dollar value of many data center battery installations.

Battery spares should be purchased from the same batch as the battery installation and should be kept in
the same environmental and charging conditions as those connected to the UPS itself, so that the spares age
and degrade at the same rate as the batteries in use. In this way, when a battery develops unacceptably high
internal resistance and must be changed, the replacement battery has very similar or nearly identical functional
characteristics as the other batteries in the string. This ensures no upset or imbalance to the charging voltage
applies to the other batteries in the string.

The battery monitoring system should ideally be extended to the UPS battery spares and the batteries used for starting
engine generators. Using real-time, condition-based maintenance rather than the time-interval replacement that is
common produces confidence in these batteries. Using a battery monitoring system for such components generates
reliable expectations from them, results in their maximum utilization, and reduces maintenance.

53
Deploying a high-quality battery management system does not preclude physical inspections of the battery
plant, which should include visual checks on all battery connections and connector fastener torque checking. A
combination of a battery monitoring system and periodic physical inspection will ensure the maximum reliable
utilization of a data centers battery plant.

10. Training
Training is a complicated topic that can cover many components and activities. The only formal training
curriculum in many data centers relates to corporate compliance (how to be a company employee), not actual
facilities activities or knowledge. This is because many facilities rely on informal on-the-job training (OJT).
While this approach can be effective, it means that achieving fully qualified staff depends on a large number
of undocumented quality variables, with the quality "fully qualified" being a largely subjective determination.
Informal OJT may also be deficient in key areas because it is a largely reactive approach.

At the minimum, a formalized training program and curriculum can be divided into two main categories: operational
readiness and planned activities. Formal training includes mastering the facilitys sequence of operations (SOO) for
electrical and mechanical systems and the integrated system SOO related to how all systems work together in concert.
This training often involves studying the alarms generated by the controls, BMS, and EPMS to respond correctly
to them, often leading to the use of an EOP for critical impact alarms. Studying a facilitys SOOs and the alarms the
monitoring systems generates can enable the staff to correctly respond to any abnormal facility condition.

Formal training related to planned activities should focus on things like access control, vendor escort, and
supervision, and the use of procedures to conduct what are mostly preventive maintenance activities. Thus,
this training might include policy review, courses, and materials focused on the use of procedures, where the
approved procedures are located, how to write a procedure, the use of the change management system, the use of
the maintenance management system, the basis of the maintenance program, the navigation of the BMS/EPMS,
and other shift presence and site rounds requirements.

11. Maintenance
A high-quality maintenance program keeps equipment in like-new condition and maximizes its reliability,
performance, and lifespan. At a minimum, all major assets equipment should be maintained to original
equipment manufacturer (OEM) recommendations. Expanding maintenance considerations to include
ASHRAE, International Electrical Testing Association (NETA), National Electrical Manufacturers Association
(NEMA), Institute of Electrical and Electronics Engineers (IEEE), National Fire Protection Association
(NFPA), ASTM International; and American National Standards Institute (ANSI), design engineer
recommendations; and authorized contractor recommendations further enhances the maintenance standard of
the facility. Once fully informed, service and maintenance contracts can be configured beyond the conservative
and sometimes excessive recommendations from the OEMs.

Maintenance should be performed at the minimum intervals needed to maintain good equipment condition that
minimizes abnormal behavior and maximizes the efficiency and life of the asset, typically monthly, quarterly,
semi-annually, or annually. Many times this interval can be less frequent than OEM recommendations, which
can be overly conservative.

Since scheduled maintenance usually involves some direct manipulation of equipment, facilities should be wary
of maintenance-induced failure, a phenomenon associated with unnecessary interactions with equipment that
increases the potential for human error and incidents. The minimum frequency of interaction with equipment
should be the level of interaction that captures its condition and keeps the asset in like-new condition. Any
greater frequency is excessive, offers no real benefit to the equipment, consumes personnel resources, and
increases risk of incidents.

In one case, a data center with 100 large air handling units (AHUs) determined that there was no real benefit
to performing monthly or quarterly preventive maintenance inspections, so those were removed from the
maintenance calendar and replaced by enhanced semi-annual inspections that still kept the equipment in like-
new condition but greatly reduced workload and unnecessary interaction with the equipment, allowing those
resources to be better applied elsewhere in the environment.

54
The industry currently follows several dominant maintenance methodologies, with most plans combining
traditional condition-based maintenance, run-to-fail, and predictive maintenance. Because of their sheer
size and high levels of redundancy and resiliency, some very large data centers may find it cost effective to let
some asset classes operate until they begin to show degraded performance, at which point maintenance can
be performed to restore the normal operating condition. Such approaches have to be carefully considered in
order to ensure risk is appropriately addressed. Ultimately the goals of any maintenance plan should be the
elimination of incidents due to abnormal equipment behavior or excessive interaction with the equipment using
the most cost-effective approach.

Deferred maintenance, or skipping of maintenance due to scheduling or resource issues, must be aggressively
avoided, especially when the deferral is a consequence of pushback against intrusive or redundancy-
reducing maintenance from the IT organization. Ultimately postponing important maintenance can be
counterproductive. Any deferred maintenance should be recorded, tracked, and communicated to IT asset
stakeholders to ensure it gets appropriate managerial visibility and resolution.

Predictive maintenance programs such as infrared scanning of power distribution systems, vibration analysis
of rotating assemblies, and lubrication oil analysis are powerful ways of getting advanced warning of potential
equipment degradation. Predictive maintenance can capture potential problems early, well before they begin
to impair the performance of critical equipment. The key to predictive maintenance is creating an equipment
baseline and then trending the data being collected in order to detect unusual rates of rise for degraded
condition indicators.

A well-formulated maintenance program requires a maintenance management system, or MMS. An effective


MMS contains all the asset information and the scheduling, approval, and tracking information needed to
complete all recurring and corrective maintenance activities. The MMS can be flat-file or computer based, with
the primary benefit of a computer-based MMS being resource tracking and administration (staff hours and work
orders completed on time) coupled with a relational database that can quickly access all aspects of scheduled and
recorded maintenance activities. Whether computerized or not, a key requirement for any MMS is the capture
and accessibility of maintenance history per asset. This facilitates the ability to clearly trend any maintenance per
asset as well as meet SLA compliance requirements and client due diligence information requests.

12. Access Control and Vendor Supervision


Only authorized personnel should be allowed into critical infrastructure areas; therefore, some access control
policy and some type of physical system must be in place to control traffic into the facility, with measures in
place to keep access lists current and enforced. Vendors must also be screened and qualified and supervised
based on area and activity in the facility. The standard approach to vendors is complete supervision in addition
to formal compliance with the facilitys house rules, or policy documents, often referred to as critical facility
or data center house rules, which list and define allowed and non-allowed activities and what do to in the case of
abnormal situations or emergencies.

13. SOO, Integrated Systems Testing (IST), and Major Switchgear Validation
Most normal, steady-state automation is continuously verified in any running data center; however, the most
important automation is often merely assumed to work. Specifically, in the event of a loss of municipal power,
many data centers are stressed in a way that hasnt happened since the facility was originally commissioned.
Coupled with lack of preparation due to poor EOPs and failure to drill, loss of utility power can be a make or
break moment for a data center.

Maintenance oversight often overlooks the importance of preventive maintenance inspections of the
programmable logic controller (PLC) for the switchgear, which include protective relays, power quality meters
(PQMs), ATS/MTS/STS programming, and firmware revisions, and PLC used in generator paralleling switchgear
lineups. Additionally, operator interaction with human-machine interface (HMI) and other high level normal mode
override functions can change the original intended configuration of the automation settings over time.

Without a regular (at least annual) pull the plug (PTP) test, neither the automation nor the switchgear itself
is validated to perform as expected. Many data centers are averse to the PLP test, with IT departments and
customers pushing back on any such testing with the mistaken idea that such testing is not needed and exposes

55
customers pushing back on any such testing with the mistaken idea that such testing is not needed and exposes
them to unneeded risk.

In addition to regularly performing a PTP test, there are many routine checks of the PLC environment that
should be as regularly conducted as any other scheduled maintenance of major infrastructure assets.

14. Change Management


A robust change management system should be put in place for any activity that crosses pre-established level of
risk (LOR) criteria. The change management system should include a format review process based on a well-
defined LOR matrix that captures and ranks all activities that can occur at the data center. Basically, any activity
with real potential for impact on the data center must be formally scheduled and then approved by accountable
persons in the data center facilities and IT organizations, before any such scheduled activities can occur.

Stephen Burgess is a consultant with Uptime Institute Professional


Services. He performs reviews and assessments for Tier topology design
and constructed facility certifications, assessments for Operational
Sustainability certifications and the M&O Stamp of Approval as well as
teaching the Accredited Tier Specialist (ATS) course.

56