Professional Documents
Culture Documents
Many customers are turning to VMware to solve the problem of server sprawl
by implementing VMware Hypervisors in their server environment. When
planning for server consolidation through virtualization many organizations
have a very limited understanding of their existing server environment in
terms of the different hardware they have, how that hardware is being used,
and which servers will make good server virtualization candidates.
In either case the estimates should be conservative in order to allow room for
growth, or for the occasional spikes in workload on the server.
Organizations that don’t know what capacity they have in their physical
environment, have a difficult time determining how much capacity will be
needed for virtualization.
To make good decisions about capacity planning and about consolidation the
project team must begin by obtaining a detailed understanding of the
capacity that is currently present. A starting point of any inventory is simply
to count the existing resources.
There are a number of ways to collect the data from manual to fully
automated.
In manual data collection the customer and consultant must work together to
get an understanding of the inventory from the existing methods used to
track servers. The drawback of this method is that often retired servers get
repurposed for other projects so they stays on the network long after they
are no longer being tracked. Manual methods also break down when the
method used to account for servers is not kept update when changes occur.
Partially automated solutions are agent based and usually provide a wealth of
information about the system. In this method of collection, an agent is
usually installed on each system. The agent collects the data and sends the
information to centralized management server. The management console
can then query a database for information about each system.
The drawback of an agent based solution is that those systems that do not
have agents will be missed in the inventory.
Fully automated solutions will search the network and discover servers
through various network protocols. For example servers may be located
through a WINS database search, a network broadcast, or a TCP/IP Ping
sweep.
No matter which method is used to collect data, it’s important to collect data
on the four core resources of the server; the CPU, Memory, Disk, and
Network. Inventory information on these four key resources need to be
collected so that when performance information is collected, the performance
data can be correlated with the inventory.
Storage plays a pivotal role in many of the features of vSphere and Virtual
Infrastructure 3.5. It is therefore important to gain a through understanding
of the storage environment in the organization. An understanding of the Disk
Size in megabytes is important for Capacity Planning purposes, but other
factors are important as well.
Is the servers operating system data and program files stored separately
from the application data? Does the customer use external storage to store
data? What type of storage is used, direct attached, Network Attached, or
ISCSI, or Fiber SANs.
If external storage is used, find out how much storage capacity is available to
store newly created virtual machines. Some of this information may not be
available through the use of an automated data collection tool and may
require manual investigation.
Module 2
Understand the core performance metrics that you need to collect to make
good capacity decisions.
Understand Load Profiling.
In order to accurately model what future workloads will look like on ESX
servers it’s important to capture each servers utilization levels. Just like the
Understanding your environment module, we will focus on the four core
resources CPU, Memory, Disk, and network.
Planning for virtual machines on an ESX server requires that you look at the
four core resources and size them appropriately. In order to size the ESX
server appropriately, it is important to understand the workload that will be
placed on these core resources.
The slide shows the four core resources in a ESX Server. When sizing an ESX
server to run your virtual machines, VMware recommends:
For the CPU you sum needed cycles for all Virtual machines.
For the Memory you size the desired RAM Maximum for all virtual machines.
For the disk you sum the desired disk sizes for all virtual machines virtual
disks plus space for other files such as the virtual machine swqp file.
For the NIC you sum the needed bandwidth for all virtual machines.
All performance monitoring tools collect data in their own unique ways. This
lesson will focus on the core metrics that can aid in understanding future
workloads that will run on the ESX servers.
Move to the semi-qualified candidates next. These are systems that are less
than optimal candidates but can still be virtualized with additional
consideration.
Qualified candidates are those systems that are excellent candidates for
virtualization. They have relatively low utilization rates and whose individual
configurations present no obvious barriers to virtualization. These systems
are known as low hanging fruit because they offer the highest consolidation
rates and most immediate returns.
The thresholds are important because they leave some spare capacity on the
ESX host to handle workload spikes that may occur. The example in the slide
makes some basic assumptions about the ESX environment in that it
assumes that Transparent Page Sharing on the ESX is used. This is why the
thresholds are higher for memory than for the CPU. This assumes that not all
the memory used in the physical server environment will necessarily be used
in the virtual environment.
CPU Queue is the number of threads in the processor queue. This counter
counts ready threads, not threads that are running. A sustained processor
queue of greater than two threads generally indicates processor congestion.
Available Memory this is the actual RAM utilization of the server or the
amount of memory on the server that is free.
Page file usage is the percentage of the page file being used y the system.
Paging or Memory Pages per second is the amount of data being sent to and
from page file because a specific page was not found in a processes working
set or elsewhere in memory.
File Cache is the amount of memory the operating system has set aside for
file cache.
When monitoring the disk, monitor the disk IOPS per second to determine the
number of read and write operations being sent to the disk subsystem. Many
monitoring tools will only report the logical I/O being sent from the operating
system to the disk drive subsystem. When calculating physical I/O or the
amount of data the disk controller is actually moving, you must factor in the
RAID overhead to get an accurate representation of I/O that the drives
subsystem is actually performing.
I/O Speed is the rate that bytes are transferred to and from the disk drive
during read or write operations.
For the network, collect information on the bandwidth of the interface. The
bandwidth is the number of bytes the server has sent to and received from
the network.
To turn the performance data into useful information, it must be correlated
with inventory data. Many organizations attempt to use a performance
monitoring tool by itself. The performance logs paint a picture like the one
shown in the table labelled typical.
For server consolidation and capacity planning, the conclusion that the
utilization is 25 percent is not accurate. When inventory and performance
information are combined, as they are in the table labelled Correlated, the
results give a more useful picture.
When you apply the inventory information you will discover that actual
capacity is 5.2 gigahertz and that 830 megahertz of that capacity is actually
being used. This equates to 16% utilization of capacity, a significantly lower
number than was yielded by the non-correlated example.
Older, slower CPU’s with high utilization cause this skewing to occur. VMware
has found that 40% of the servers at a typical client site are slower than 500
MHz. CPU Utilization is not the only metric that needs to be correlated with
inventory. Other examples would include CPU Queue, Page File Utilization,
Paging, Memory Available, and others.
When determining the utilization of the system the goal is to capture the
Peak Workload. This does not mean the max observed value. If you have
ever watched a performance monitor while you start up a program you have
seen the processor utilization jump to almost 100% during startup. Every
machine will hit 100% utilization or come close to it at some point or another.
The key is to understand sustained loads.
(Editors note start animation 1) Now say that these servers are Exchange
servers and in the morning, they typically run 3 to 4 times hotter than the
average. The same is true at closing time and after lunch. This is
represented by the red line on the chart.
When planning for capacity, if we were only account for the average
utilization of these Exchange servers to be able to meet the needs of just the
“average” utilization, we would have a lot of very unhappy users in the
morning, at lunch and at closing time.
If peak load is not considered, we might have thought that combining the
load of these eight exchange servers into one server was reasonable.
However, when Peak load is considered, we would never attempt that type of
consolidation.
It’s important to capture performance metrics for the CPU, Memory, Disk and
NIC.
Module 4
If you are comfortable with the course material and are ready for the
assessment, close this window and you will see instructions for taking the
quiz in the MyLearn learning management system. To demonstrate
proficiency, you must complete the quiz with a score of 80 percent or better.
If you are reviewing the presentation, you have several choices, depending
on your learning style.
You can simply let the presentation run, and it will play an audio track as the
presentation unfolds. During the presentation you can use the buttons here
to pause, go back, or move forward.
Or, you can skip around the module. If the presentation navigation is visible
in either Outline or Thumb view, you can click on any slide in the presentation
to jump to that slide.
Another option is to read the material rather than listen to the audio track.
Click the Notes tab to view the transcript of the slide being viewed.
Finally, you can use the search tab to locate specific information within the
module.
In the example, a target server is selected from a vendor that the customer
wishes to use. The server will have four processor cores that have a three
gigahertz speed. The memory size will be twenty-four gigabytes and the
server will have four network ports that are a rated at a gigabit each. For
storage, the customer has selected a storage system that will be able to
perform 2000 I/Os per second and have a transfer rate of 100 megabytes per
second. This only an example configuration, the configuration that your
customer selects may be different from the one show here.
The thresholds in the slide are just an example of how you might define
thresholds. This should not be considered best practice, only a starting
point. In this example, the CPU threshold will be set to 50%. This means that
only 50% of the total processing capacity is available to be used by the
consolidated workload. The other 50% is spare capacity that is available for
spikes in demand or for additional vSphere features like HA.
Some of the other restrictions are memory set to only 80% of the capacity
available on the target, The NIC traffic is restricted to 100 Megabytes per
second and disk I/Os are restricted to 1000 I/Os per second and transfers to
fifty megabytes per second.
Take the simple example on the slide, the customer has three groups
Marketing, Sales, and HR. The customer has no problems combining servers
from Marketing and Sales on the same servers, but the customer does not
want to combine servers from the HR group with any other servers.
Customers may not want virtual machines grouped together on ESX servers
for a number of reasons. This could be because of the department that
owns the server doesn’t want virtual machines from other departments
grouped together with their virtual machines.
The customer may not want test and development environments mixed with
production environments.
The customer may not want certain application servers to mix with other
application servers due to the function they perform.
The customer may have machines in multiple locations that need to remain
separate.
There are any number of other grouping situations that exist from customer
to customer. Learning these grouping rules early on the planning process is
important when performing a consolidation estimation.
In most cases determining grouping rules cannot be done using a tool. This
must be done by interviewing the customer and determining what grouping
requirements the customer has.
Group always lowers consolidations ratios. When you don’t have consider
grouping you can simply fit the workloads on the server until you reach you
the defined thresholds. With grouping you have to keep in mind the grouping
requirement as well as the thresholds.
Expect roughly that 3 to 4 virtual machines per core can run on an ESX/ESXi
host provided that:
And the estimate assumes that all virtual machines are single virtual CPU
virtual machines.
The slides that follow offer some high-level architecture considerations when
sizing these core resources.
CPU Capacity is one of the core VMware benefits of consolidation. Many CPUs
are underutilized and allow for easy consolidation. However ESX can impose
it’s own overhead as well. ESX overhead varies based on three factors,
application type, load, and operating system.
The applications that load the CPU due to processing intensity cause the least
overhead, while those loading CPU due to disk I/O intensity cause more, and
those loading CPU due to network intensity cause the most overhead.
The higher the CPU Utilization, the higher the overhead on the ESX Server.
Are the machines being consolidated running the same operating system, or
the same applications? When they are a significant amount of memory is
saved through the use of transparent page sharing.
Is the RAM on the machines being consolidated fully utilized? If not the
virtual machine may not need as much memory as it was given in the
physical world.
Are the machines being consolidated running the same operating system, or
the same applications? When they are a significant amount of memory is
saved through the use of transparent page sharing.
Is the RAM on the machines being consolidated fully utilized? If not the
virtual machine may not need as much memory as it was given in the
physical world.
When measuring RAM to be consolidated, focus on machine similarity and
actual current RAM utilization.
The key to storage capacity planning is having enough storage to contain the
aggregated machines’ data – but one should also consider future needs for
increased storage, virtual machine snapshots, and virtual machine swapfiles.
What is the average, sustained NIC load, as compared to peak load? How
often do peaks occur, and what is their timing relative to other machines?
Understanding the way ESX uses the CPU, NIC, Disk, and NIC will benefit you
in placing workloads on the ESX Server.
Module 5
This training offers you several methods for completing the module and
navigating through it.
If you are comfortable with the course material and are ready for the
assessment, close this window and you will see instructions for taking the
quiz in the MyLearn learning management system. To demonstrate
proficiency, you must complete the quiz with a score of 80 percent or better.
If you are reviewing the presentation, you have several choices, depending
on your learning style.
You can simply let the presentation run, and it will play an audio track as the
presentation unfolds. Animation 4) During the presentation you can use the
buttons here to pause, go back, or move forward.
Or, you can skip around the module. (Animation 6)If the presentation
navigation is visible in either Outline or Thumb view, you can click on any
slide in the presentation to jump to that slide.
Another option is to read the material rather than listen to the audio track.
Click the Notes tab to view the transcript of the slide being viewed.
Finally, you can use the search tab to locate specific information within the
module.
TCO is normally used by IT Managers. In most every case, the lower the TCO
the better.
and SAN costs such as adding additional storage or fibre channel switches.
Indirect costs are hidden charges accounted for at an aggregate data center
level for costs associated with administering and running servers and not
directly billed per server.
Server Administration
Server Provisioning
In order to calculate the savings in Total Cost of Ownership with VMware, you
must understand the cost of how much the organization would spend with out
VMware then calculate the costs with VMware. These costs should be broken
down between direct costs and indirect costs.
For example say an organization has plans to replace all of its servers in the
next three years because of rising support costs.
If the organization has 100 servers, and the cost of the servers is 4000 dollars
per year in a three year amortized hardware purchase including annual
support and maintenance contract costs, then the total cost purchasing all
those servers is 1.2 million over the next three years.
If each server has an amortized cost of than $1100 per year for storage and
networking cost, then over the next three years $330,000 will be spent on
networking and storage costs.
Therefore in hardware costs alone 1.5 million will be spent in hardware costs
to replace the existing servers.
Next you need to calculate to indirect costs. This simple example will only
look at the costs associated with administration, power, and cooling.
Assume for this example that on-going administration costs the organization
$2,000 per server per year. In a three year period the cost to the
organization would be $600,000.
Data center costs such as power and cooling are going differ from
organization to organization, in this example assume that the power and
cooling cost per server is $400 in annual costs per year. So over three years
the cost of power and cooling plus administration would be $720,000.
So In this simple example including direct costs and indirect costs the total
cost of ownership without VMware to simply replace the customers 100
servers with 100 new servers would be 2.2 Million dollars
Next lets offer the customer an alternative, replacing the 100 servers with
virtual machines so we don’t have to do a 1 for 1 server replacement.
The server cost remains the same over three years but since there are fewer
servers to purchase the total server cost is now close to $400,000. The new
cost of VMware Software must now be calculated for 10 ESX servers plus the
license for one vCenter would around $189,000. This would be at a cost
$5,750 per node.
The Network and SAN cost stay the same but again since there are fewer
costs the cost drops to about $36,000
The direct costs with VMware in this example are about $622,000
As far as indirect costs go, with VMware the costs are the same but again
there are fewer servers to deal with so the total indirect costs with VMware
are about $100,000.
Now lets compare the costs of doing business without VMware and with
VMware.
Over the next three years the costs of doing business without VMware is $2,2
million where the cost of doing business with VMware is about $700,000.
This would equal a savings of $1.5 Million.
Keep in mind there are many other costs to consider when calculating Total
Cost of Ownership. This example simply showed the basics and it’s cost
estimates should not be relied on for engagements.
The payback period is the amount of time required for the benefits to pay
back the cost of the project. Remember earlier that we calculated the cost of
doing business with VMware to be $700,000 and the savings to be $1,5
million over a three year period. The yearly savings then is about $500,000.
This means if we divide the Investment by the Yearly Savings the payback
period for this example is 18 months.
Which would you rather have I return to you in 1 year, one hundred dollars or
one thousand dollars?
However before I return that money to you in a year, you must invest
something today. So now, which would you rather I return to you in one year:
One hundred dollars if you give me fifty dollars today or one thousand dollars
if you give me six hundred today.
Financially speaking, to answer this, you have just calculated an ROI on these
two competing offers, For the first the ROI is one hundred percent, while for
the second the ROI is sixty-seven percent. From an ROI perspective, the first
one looks better.
Keep in mind that ROI is only one financial measure companies use to
compare initiatives and determine winners in the financial competition for
budgeting and funding.
During the meeting you need to ask questions to elicit all the TCO and ROI
inputs that you will need to perform the Virtualization Assessment. Be aware
that it is entirely possible that the customer will not be able to provide you
answers to all of the TCO/ROI questions that you ask. Be sure to leave the
organization with the list of questions were unanswered and follow up with
them at a later time.
When performing a TCO / ROI analysis, there are many factors that you
should consider. The list above, while extensive, is not necessarily
comprehensive. Nor are all the items in the list above mandatory. If you are
designing your company's own assessment practice, you will decide for
yourself which items are relevant for the needs of your customers and your
company.