You are on page 1of 14

VMware HA / DRS

Solution Scenarios on
Dell PowerEdge Servers

By Scott Hanson
Dell Enterprise Technology Center

Dell | Enterprise Technology Center


dell.com/techcenter

April 2007
Contents
Executive Summary .......................................................................................................3
Introduction ....................................................................................................................4
HA / DRS Concepts ........................................................................................................5
HA Concepts.............................................................................................................6
HA Setup ........................................................................................... 6
HA Isolated Hosts.............................................................................. 6
HA Heartbeat Network ...................................................................... 6
DRS Concepts ..........................................................................................................7
DRS Automation Features ................................................................ 7
DRS Resource Distribution ............................................................... 8
DRS Affinity Rules............................................................................. 9
HA / DRS Scenarios .....................................................................................................10
Lab Setup................................................................................................................10
Planned Maintenance ............................................................................................10
Unplanned Outages ...............................................................................................12
Dynamic VM Movement with Workload Spikes ...................................................13
Proactive Maintenance and Power Management ................................................13
Conclusion....................................................................................................................14

Figures
Figure 1 - VMware Cluster and DRS Automation Levels ................................................................ 7
Figure 2 - DRS Resource Distribution Graph .................................................................................. 8
Figure 3 - Unbalanced Cluster......................................................................................................... 8
Figure 4 - DRS Affinity Rules........................................................................................................... 9
Figure 5 - Dialog Box for Confirmation of Maintenance Mode ...................................................... 11

Talk Back
Tell us how the Dell Enterprise Technology Center can help your organization better simplify,
utilize, and scale enterprise solutions and platforms. Send your feedback and ideas to
setc@dell.com.

April 2007 Page 2 Dell Enterprise Technology Center


Section 1
Executive Summary
Servers are increasing in performance at an exponential rate and customers are
discovering that they are able to run many applications on a single server
through the use of virtualization. This means that customers run the risk of
multiple virtual server downtime if a single physical server goes down.
To mitigate this risk, it makes sense to use industry standard based Dell™
PowerEdge™ servers running VMware® Infrastructure 3 to create a virtualization
farm or virtualization cluster to host these virtual servers and the applications
running on them. This solution provides server virtualization with load-balancing
and high availability features that make it possible to manage a large number of
virtual systems across a cluster and also provide a mechanism to make the
applications running on these virtual machines highly available in the case of a
physical server failure.
Using several scenarios, this paper will evaluate the setup, use, and performance
of VMware Infrastructure High Availability (HA) and Distributed Resource
Scheduler (DRS) using Dell PowerEdge 2950 and 1955 Servers.
Enterprise class reliability and functionality can be obtained by using VMware’s
HA and DRS Solution on Dell PowerEdge servers. This solution provides cost
effective enterprise class computing with easy to use availability features.

April 2007 Page 3 Dell Enterprise Technology Center


Section 2
Introduction
In the previous paper “Advantages of Dell PowerEdge 2950 Two Socket Servers
over Hewlett-Packard Proliant DL 585 G2 Four Socket Servers for
Virtualization”1, it was shown how Dell two socket servers can be a better
solution for virtualization than HP four socket servers when considering
performance, price/performance, or performance per watt.
The central comparison in this previous paper was three Dell PowerEdge 2950s
against two HP DL 585 G2 servers to examine them from a cluster point of view
because most customers are not running virtualization on a single server. It can
be argued that using more servers in a VMware Infrastructure 3 (VI3) cluster
solution increases overall availability. With smaller more cost effective servers
used as building blocks for the virtualization cluster, it is easier to recover from a
server failure than if those building blocks were larger systems with more VMs.
The two-socket server is an ideal base for building VI3 clusters as there are not
as many “eggs in one basket”. However, it would be best if there were a
mechanism to restart the virtual machines from the failed hosts on remaining
hosts in the VMware farm. Using VMware’s HA (High Availabilility) and DRS
(Distributed Resource Scheduler) it is possible to obtain this functionality.
VMware HA allows virtual machines on failed ESX server hosts to restart on
surviving ESX hosts. The DRS solution uses system algorithms and user
created rules to determine the optimal placement of the virtual machines.
In order to better understand the management and performance of these VI3
clusters, the Dell TechCenter set up and documented the results from the
following scenarios with PowerEdge 2950 and 1955 servers.:
• Planned Maintenance
• Unplanned Outages
• Dynamic VM Movement with Workload Spikes
• Proactive Maintenance and Power Management

1
http://www.dell.com/downloads/global/power/dell2socket_vs_hp4socket_vmware.pdf

April 2007 Page 4 Dell Enterprise Technology Center


Section 3
HA / DRS Concepts
A cluster in VMware ESX is simply a collection of hosts with shared resources
and a shared management interface. Any host added to the cluster will have its
resources added to the cluster’s resource pool.
VMware High Availability is a licensable feature that can be added to the cluster.
VMware HA allows virtual machines in the cluster to restart on other hosts in the
cluster in the event of a host failure. This failover occurs automatically within the
cluster.
VMware Distributed Resource Scheduler is also a licensable feature that can be
added to the cluster. VMware DRS collects resource usage information for all
hosts and virtual machines in the cluster and generates recommendations for VM
placement. These recommendations can be automatically applied or applied
through administrator interaction with the VirtualCenter Client console.
VMware HA and DRS are separate features but are typically used together and
complement each other in the cluster.
For detailed information on the topics discussed in this section, please refer to
Sections 4 through 8 of “VMware’s Virtual Infrastructure 3 Resource
Management Guide”2

2
http://www.vmware.com/pdf/vi3_esx_resource_mgmt.pdf

April 2007 Page 5 Dell Enterprise Technology Center


HA Concepts
It is important to understand the difference in VMware HA and traditional HA
failover solutions. Traditional HA solutions such as Microsoft Cluster Server
(MSCS) focus on keeping an application alive during a failure. In order to obtain
that level of availability it is necessary to duplicate much of the hardware. The
standby boxes in the traditional HA solution are typically not running any
workload. When the primary server fails, only then does the standby server pick
up the work of the primary. Users would typically experience no loss of
connectivity, or very minimal downtime.
VMware HA handles a failover situation by restarting VMs on remaining hosts in
the cluster. When a host fails with VMware HA, all VMs on that host are powered
off and then restarted on another host. When this happens, users must wait for
the server to go through the boot process before they can regain connectivity.
Traditional HA and VMware HA can be used together to provide a higher level of
availability. Setup of this configuration is covered in VMware’s documentation,
“Setup for Microsoft Cluster Service”3

HA Setup
VMware HA is set up in the VirtualCenter interface through the click of a
checkbox after setting up a cluster. This action installs an agent on each ESX
Host in the cluster. This agent is responsible for sending heartbeats to all the
other nodes in the cluster. If the agents stop receiving heartbeats from the other
nodes in the cluster, it is considered failed and the restarting of VMs process
begins. VirtualCenter is only required for the initial setup of HA. VMware HA will
function without VirtualCenter running in the environment.

HA Isolated Hosts
If a host in the cluster becomes isolated, the default action for VMware HA is to
power down the VMs on the isolated host. This allows remaining hosts in the
cluster to restart the VMs on surviving hosts. If the VMs were allowed to run on
the isolated hosts, VMFS disk locking would prevent the starting of VMs on other
hosts in the cluster.

HA Heartbeat Network
VMware HA monitors status of the nodes in the cluster by sending and receiving
network packets called heartbeats to all nodes. If nodes do not respond to
heartbeat packets within 15 seconds, they are considered failed. Therefore, it is
important to have robust network connections for the heartbeats. A best practice
is to install redundant physical network adapters in each host of the cluster. You
can then either team them with NIC teaming, or setup two service console
interfaces to enable a redundant heartbeat connection.

3
http://www.vmware.com/pdf/vi3_vm_and_mscs.pdf

April 2007 Page 6 Dell Enterprise Technology Center


DRS Concepts
VirtualCenter using DRS, manages all resources of a VMware cluster. The
memory and CPU resources of each individual ESX host become part of a global
resource pool that all VMs in the cluster can use. DRS provides automatic
resource optimization and dynamic movement of VMs through VMware’s
VMotion technology.

DRS Automation Features


With DRS, when you create a new VM in the cluster, you do not specify the
specific host. DRS handles the initial placement of the host at power on,
depending on the Automation Level that has been specified for DRS.
Manual – DRS will make recommendations for which host when the VM
is powered on. DRS will not move the VM during normal operations,
only make recommendations in the VirtualCenter Console.
Partially Automated – DRS will automatically choose the host for the
VM when powered on. DRS will not move the VM during normal
operations, only make recommendations in the VirtualCenter Console.
Fully Automated - DRS will automatically choose the host for the VM
when powered on. DRS will automatically migrate machines during
normal operations to optimize resource usage. Automatic migration is
controlled by a Migration Threshold slider bar that the user can set from
conservative to aggressive.
Figure 1 shows an example cluster of HA-DRS Test. The HA-DRS Test Settings
box shows the cluster set to an automation level of Partially Automated. The
General box in the Summary tab also shows the total memory and CPU
resources for the cluster. The resources are a result of adding all the resources
from the 3 Hosts in the cluster.

Figure 1 - VMware Cluster and DRS Automation Levels

April 2007 Page 7 Dell Enterprise Technology Center


When a VM is created within a cluster enabled for DRS, it is not associated with
any particular host until powered on. Then DRS according to the automation
level and Resource Distribution levels will place the VM on a host in the cluster.

DRS Resource Distribution


DRS keeps track of the CPU and memory utilization percentages of the hosts in
the cluster and the percent of entitled resources delivered. DRS uses these
statistics to determine optimal placement of VMs within the cluster.
The DRS Resource Distribution graph is located on the Summary tab of the
VirtualCenter management console for the cluster. An example is shown below
in Figure 2
Figure 2 shows an example of a balanced cluster. The “number of hosts” vertical
axis is dynamic and will change with changing utilizations of the hosts. This
shows that all three hosts are within a 40-50% Memory Utilization range and
within a 0-10% CPU Utilization range. This shows that the cluster is balanced
and DRS does not need to move VMs. If this changes the bars will split and show
different ranges.

Figure 2 - DRS Resource Distribution Graph

Figure 3 shows a cluster in an unbalanced condition. Notice that the “number of


hosts” axis has changed to 2. This shows that 2 of the hosts are in the 60-70%
Memory Utilization range and one in the 20-30% range. In general, DRS will
move VMs to other hosts to try to get the bars closer together, with a single bar
representing completely balanced. So in this example, DRS would move some
VM’s from the 2 hosts in the 60-70% memory utilization range to the host in the
20-30% range. Thus increasing utilization on the single host and decreasing on
the other hosts, bringing the bars closer together.

Figure 3 - Unbalanced Cluster

April 2007 Page 8 Dell Enterprise Technology Center


DRS Affinity Rules
DRS allows the creation of rules to determine if VMs should run on the same
host, or be separated from each other. The figure below shows DRS Affinity
rules. In this example, wina1, wina3, and wina5 will always run on the same host
in the cluster. Wina1 and wina2 will always be started on separate hosts in the
cluster.

Figure 4 - DRS Affinity Rules

April 2007 Page 9 Dell Enterprise Technology Center


Section 4
HA / DRS Scenarios

Lab Setup
The PowerEdge 1955 is a dual-socket server that supports Intel® Xeon® 5000,
5100, and 5300 series processors. The Dell test team configured the PowerEdge
1955 with two quad-core Intel Xeon X5355 processors at 2.66 GHz. The
PowerEdge 1955 was configured with 8 GB of memory using 2GB DIMMs.
The PowerEdge 1955 was connected to a storage area network (SAN) with dual-
port QLogic 2462 PCI Express host bus adapters (HBA) and utilized storage on a
Dell/EMC CX3-80 array with twenty 146 GB, 15,000 rpm disks.
The scenarios tested below used VMware Infrastructure 3 as the virtualization
platform; this package includes ESX Server 3 and VirtualCenter 2 as well as the
VMware DRS and VMware HA features. ESX Server allows multiple virtual
machines (VMs) to run simultaneously on a single physical server. Each VM runs
its own OS, which in turn has its own set of applications and services. Because
ESX Server isolates each VM from other VMs on the same physical server just
as physical systems are isolated from one another, administrators have flexibility
in using ESX Server to run different types of applications and operating systems
at the same time. VirtualCenter 2 enables administrators to consolidate control
and configuration of ESX Server systems and VMs, which can improve
management efficiency in large environments.

Planned Maintenance
The first logical scenario to use VMware HA/DRS functionality is with planned
maintenance windows. Servers typically require several updates to BIOS or
firmware per year. Not to mention the fact that many I/O adapters in the system
typically require updates as well. These updates will typically require a reboot of
the physical server.
With VMware HA/DRS enabled, you simply have to place the host into
Maintenance Mode. This event triggers DRS to VMotion VMs running on that
host to remaining hosts in the cluster based on the DRS rules. The DRS
Automation level must be set to Fully Automatic for this to occur. If the
automation level is set to Manual or Partially Automated, the host will not enter
Maintenance Mode until the VMs have been manually migrated. The Virtual
Center console interface uses a dialog box, figure 5 below, to remind the
administrator about this requirement.

April 2007 Page 10 Dell Enterprise Technology Center


Figure 5 - Dialog Box for Confirmation of Maintenance Mode
One of the scenarios set up in the Dell TechCenter lab was a simulation of
planned maintenance using 3 blades with 2 VMs on each blade for a total of 6
VMs. One of three blades was placed into maintenance mode which caused the
two VMs from that blade to migrate via VMotion to the remaining blades in the
cluster. DRS moves each of the VMs to different servers in the cluster to
maintain balance.
The time it took DRS to migrate the two VMs after placing the host in
maintenance mode took 1 minute and 10 seconds. During this time, users
connected to the two VMs experienced no loss of connectivity. The host in
maintenance mode can then be updated and rebooted with no loss of user
connectivity to the VMs because they are now running on the other servers in the
cluster.
When the maintenance with the server is finished and maintenance mode is
turned off, DRS will automatically migrate VMs back to the hosts to balance the
workload.
The team increased the number of VMs on the hosts to see the increased
amount of time this task would take with a larger configuration. The number of
VMs on each host was increased to 16 on each host, for a total of 48 VMs in the
cluster. Each host was then put into maintenance mode separately and timed.
The comparison of the first test with these tests average time is shown in the
table below.

One Host with 2 VMs One Host with 16 VMs


Completion Time 1 min 10 secs 8 mins 15 secs
Table 1 - Time to Enter Maintenance Mode

The increased amount of time for more VMs is due to the fact that VMware
queues up the VMotions two at a time. Therefore; the more VMs on a host, the
longer time it takes to enter maintenance mode. When planning your cluster
design, this should be taken into consideration.
A video of this planned downtime scenario can be viewed at the Dell Enterprise
Technology Center website. Please visit www.dell.com/techcenter to download
the demonstration.

April 2007 Page 11 Dell Enterprise Technology Center


Unplanned Outages
One of the biggest advantages to using VMware HA is the ability of VMs to
restart in the event of an unplanned outage.
The VMware HA Agent on each host sends a heartbeat to all the hosts in the
cluster. In the event that the heartbeat is lost, the VMs on that host are restarted
on remaining hosts in the cluster. DRS affinity rules can also be defined in the
event that you require VMs to either stay together on a single host, or to separate
the VMs to different hosts.
To test how VMware HA and DRS function during an unplanned outage scenario
the Dell TechCenter team used three blades installed in a VI3 cluster with 2 VMs
running on each blade. The unplanned outage was then simulated by physically
removing one of the blade servers from the chassis.
It took about 15 seconds for the VirtualCenter console to detect the host loss.
This is the amount of time in which the HA heartbeat timed out. The VMs were
then automatically restarted on remaining hosts in the cluster. This operation
took approximately 2 minutes and 30 seconds.
The total time that the user experiences application outage will vary depending
on the amount of time it takes to restart the VM and associated applications and
services. Traditional HA solutions such as Microsoft Cluster Server experience
no application outages, or outages in the seconds range. It is recommended that
these solutions be combined with VMware HA to provide a more robust solution
for applications that truly require high-availability.
To get an idea of how long a user might experience the outage, a ping test was
performed while simulating the unplanned outage. A continuous ping was
initiated to a VM on the blade in test. After the blade was pulled from the
chassis, the amount of time from when the ping stopped to when it started again
was measured. This test was performed separately three times on the three
blades in the cluster. The time it took for each ping to recover is shown in the
table below.

VM on Blade 1 VM on Blade 2 VM on Blade 3


Ping Recovery Time 5 mins 40 secs 7 mins 28 secs 3 mins 10 secs
Table 2 - Ping Recovery Time

The average time for these three tests was 5 mins and 26 secs. If the
applications being hosted by VMware are not of a mission critical nature and
downtime of a few minutes is acceptable, then a VMware HA only solution may
be a good choice.
This unplanned outage scenario is illustrated in a video that can be viewed at the
Dell Enterprise Technology Center website. Please visit
www.dell.com/techcenter to download the demonstration.

April 2007 Page 12 Dell Enterprise Technology Center


Dynamic VM Movement with Workload Spikes
A strength of DRS is the ability to move VMs around the cluster in response to
dynamic changes in workload. An example would be spikes in internet activity to
an online retailer during the holiday season. Another example is end of month
processing or payroll processing at financial firms. In each case, specific servers
are suddenly much more heavily loaded than under normal conditions. When all
of these systems are running as VMs as part of a VI3 cluster the DRS function
has the ability to use VMotion to automatically rebalance the cluster so that
performance is optimized.
The Dell TechCenter setup a third lab scenario to test the dynamic movement of
VMs in response to changes in workload. The test used the Dell DVD Store
Version 2 4 to simulate online web orders to 3 VMs in the cluster. The VMs are
named wina1, wina2, and wina3. Wina1 and wina2 are hosted by the first blade
in the cluster. Wina3 is hosted by the second blade in the cluster, and the third
blade in the cluster has no VMs. Workload was started on wina1 and wina2
which maximizes CPU utilization of the first blade to almost 100%. A workload
was then started on wina3 which generated about 50% CPU utilization to that
blade. The third blade in the cluster had no workload and therefore very low
CPU utilization.
DRS recognized that the heavy workload on blade 1 should be spread out across
the cluster. Using VMotion technology, DRS migrates the wina1 VM to Blade 3.
After the DRS migration, each host in the cluster had a workload that was
generating approximately 50% CPU utilization.
The demonstration scenario for this section can be viewed at the Dell Enterprise
Technology Center website. Please visit www.dell.com/techcenter to download
the demonstration.

Proactive Maintenance and Power Management


The Dell Virtualization Solutions Engineering team (www.dell.com/vmware) has
produced a paper that describes how administrators can combine Dell’s
OpenManage Systems Management suite with the cluster resources of VMware
Infrastructure 3 to achieve proactive maintenance that enhances service
continuity and adaptive power utilization that further drives down power and
cooling costs.
The scripts and programs provide a framework to integrate systems
management with VMware VirtualCenter by using the VMware SDK. Users can
download and modify the code to fit their environment.
The title of the whitepaper is “Proactive Maintenance and Adaptive Power
Management using Dell OpenManage Systems Management for VMware DRS
Clusters” by Balasubramanian Chandrasekaran and Puneet Dhawan5

4
http://linux.dell.com/dvdstore/
5
http://www.dell.com/downloads/global/solutions/prctv.pdf

April 2007 Page 13 Dell Enterprise Technology Center


Section 5
Conclusion
Customers looking for enterprise class reliability and functionality can obtain it by
using VMware’s HA and DRS Solution on Dell PowerEdge servers. To obtain
the highest levels of availability it is recommended to combine traditional high
availability solutions, such as Microsoft Cluster Server, with these VMware
Infrastructure 3 features. This solution provides cost effective enterprise class
computing with easy to use availability features.
By using more servers in a VMware Infrastructure 3 (VI3) cluster solution you can
increase overall availability. Enterprise class reliability and functionality can be
obtained by using VMware’s HA and DRS Solution on Dell PowerEdge servers.
This solution provides cost effective enterprise class computing with easy to use
availability features.

THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS
AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED
WARRANTIES OF ANY KIND.
1 “Advantages of Dell PowerEdge 2950 Two Socket Servers Over Hewlett-Packard Proliant DL 585 G2 Four Socket
Servers for Virtualization” by Todd Muirhead, Dave Jaffe, and Terry Schroeder, Dell Enterprise Product Group, December
2006, http://www.dell.com/downloads/global/power/dell2socket_vs_hp4socket_vmware.pdf.
Dell and PowerEdge are trademarks of Dell Inc. EMC, and is a registered trademark of EMC Corp. Intel and Xeon are
registered trademarks of Intel Corp. Qlogic is a registered trademark of QLogic Corporation. Microsoft, Microsoft SQL
Server, and Microsoft Windows Server are registered trademarks of Microsoft Corporation. VMware, Virtual Center, and
VMware Infrastructure 3 are registered trademarks of VMware Inc. NetBench is a registered trademark of Ziff Davis Media
Inc, or its affiliates in the U.S. and other countries. Other trademarks and trade names may be used in this document to
refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks
and names of others.
©Copyright 2007 Dell Inc. All rights reserved. Reproduction in any manner whatsoever without the express written
permission of Dell Inc. is strictly forbidden. For more information, contact Dell. Information in this document is subject to
change without notice

April 2007 Page 14 Dell Enterprise Technology Center

You might also like