You are on page 1of 41

OPS-CIM1900

Moving from Reactive to Predictive Performance and Systems Management

Bernd Harzog, The Virtualization Practice

#vmworldops

Bernd Harzog Virtualization Performance and Capacity Management Analyst


Analyst and Consultant Focused upon:
Infrastructure Performance and Capacity Management of Virtualized Systems Application Performance Management Transaction Performance Management End User Experience Management

Clients include:

Vendors offering solutions Enterprises seeking

virtualization performance management solutions

Key Findings

Virtualization introduces

sharing and dynamic behavior Agile Development produces rapidly changing applications Both combine to require a new tools, organizations and management processes

Key Trends
The demand for business functionality implemented in

software is infinite (therefore so is the backlog)

Innovation in software development tools and platforms Scaled out open source commodity deployment platforms Distribution of applications across data centers and

private/public clouds applications

Virtualization of business critical and performance critical More than one hypervisor in the enterprise Rapidly changing applications running on dynamic platforms

The Software Development Backlog is Infinite


Maybe we should measure it

in inches instead of manmonths!

Business benefits from

automation in software are compelling software and new updates continues to grow processes, tools, and platforms

Business demand for new

This is driving innovation in

Tool and Process Innovation


Not just a Java/.NET world

any more. PHP, Ruby, Python and NodeJS matter

The OS is not the platform

any more. Modern languages come with runtimes evolving to support multiple languages and runtimes Agile and Extreme Programming continues

Cloud Platforms are rapidly

Process innovation driven by

Distribution of applications across data centers and private/public clouds


Simple three tier architecture

From This

(Web-Business LogicDatabase) replaced by:

Applications broken into

modules with their own Agile teams Modules scaled out on commodity platforms Modules distributed across private and public clouds Applications running on opaque infrastructure

To This

Scaled Out and Open Source Platforms


Cheap hardware (commodity

Intel compatible servers)

Large number of servers Cheap software

(both physical and virtual)

More than One Hypervisor


VMware remains the clear

technical and market leader in the enterprise narrowing the functionality gap choosing to tier virtualization platforms just as they tier storage

Competing platforms are

Some enterprises are

Your New World


Agile Development creates Built in diverse languages

rapidly changing applications and running on diverse language runtimes deployment platforms

Running on next generation Deployed on multiple

virtualization platforms commodity hardware with multiple owners

Running on scaled out Located in multiple clouds


Your Cloud Hybrid Cloud Public Cloud

Virtualization is Progressing

So How Are We Going To Manage This Mess?

Management Principles for the New World


Start Over Start with a new Reference Architecture - do not assume

that any tool you have purchased automatically makes the cut Insist upon easy to try, easy to buy, easy to manage, and results in production before purchase Organize for the successful virtualization of business critical applications Move from managing Images to managing Models and Objects Define Performance as Latency and Response time, not Resource Utilization Manage every application for performance, not just the 5% most painful and important ones Get Real Time, Deterministic and Comprehensive about Data Collection Design your management architecture for the distributed cloud case even if you are not there yet Focus upon preventing problems instead of reacting to them Employ Predictive technologies to help you Pick the best vendors and products for the job

The Worse than Useless Test


Apply this test to every single monitoring and management solution in your company
1. 2. 3. 4. 5. 6.

Does it operate on a real-time, continuous, and deterministic basis? Does it support workloads distributed across data centers (yours and ones you rent (cloud))? Can it re-configure itself every time you change something in the environment or in the applications? Can you support it and use it without the continuous presence of on premise consultants from the vendor of the tool? If it is a monitoring tool, does it focus upon response time and latency? Can you try it, for free, in production, before you buy it or more of it?

If the answer is not Yes to all six junk the tool and start over

Starting Over Get Rid of ITIL and the CMDB

ITIL is designed to get you to document and slow down the rate of change Dont tell the Change Control Committee about vMotion! Your CMDB will never be able to keep up with rate of change given Agile

Development and Dynamic Operations Every configuration change needs to be tracked in real time, and crosscorrelated with performance degradations and resource contention

Starting Over - The Big Four (IBM, BMC, HP, CA) are worse than useless
Blind Dinosaur

=
The new environment produces requirements that legacy

solutions cannot meet Legacy solutions get broken by virtualization and the cloud Legacy vendors are not going to be able to acquire themselves out of this mess Put the dino in a cage and do not let him out build a new management stack for your virtualization/cloud environment isolate the dino to your legacy physical environment

Start Over with a Reference Architecture


Cloud Management Reference Architecture Res./Perf./Cap./Cnfg. Mgmt Image Provisioning/Mgmt Infrastructure Perf. Mgmt App Performance Mgmt Self-Learning Analytics Applications Virtualization Platform Servers LAN, Switch Fabric and Routers SAN and Storage

Pick a solution for each layer in the infrastructure (blue boxes) Implement the functions in each vertical box across the infrastructure layers Tie it all together with self-learning analytics

Cloud Management

Embrace the New Way of Trying, Implementing, and Buying Management Software
The Old Way The New Way

Rep takes the CIO to play

golf

You get to download and use

Enterprise software deal gets

the software in production first really does work and add value in your environment buy it

signed dont

You prove to yourself that it

Some products work, others People go around the ELA to

Then (and only then) do you

get the tools they need

Organize for Virtualization of Critical Applications, Agility, and Success


Virtualization is Just One Team Virtualization and Application Operations are THE Teams
Data Center Operations

Virtual Operations Systems Engineering WAN Team LAN Team Windows Server Team Linux Server Team SAN Team Storage Team

Application Operations Programmer/Analyst Team Java Server Team Web Server Team Database Team

Support Tier 3 Support Tier 2 Support Tier 1 Help Desk

Virtualization pervades IT Operations, and becomes Virtual Operations Application Operations is responsible for the performance of every application

in production (purchased and custom developed)

Move from Managing Images to Managing Models and Objects


Puppet Legacy Image Management Process
Populate the Image

Chef

vFabric AppDirector

Assemble the Application

Performance Resource Utilization Performance = Response Time & Latency


The Root of All Evil CPU and Memory are horrible indicators of performance Latency is the appropriate measure of infrastructure peformance

Response Time is the appropriate measure of application peformance

New Environments = New Performance and Capacity Management Challenges


Public Cloud Challenges Cloud vendor is not going to share detailed infrafratructure latency metrics with you Published metrics (CloudWatch) are basically useless Places a premium on high fidelity cloud aware Application Performance Management

Public Cloud

Private Cloud/ITaaS

Private Cloud/IT as a Service Challenges Constantly changing set of applications Requirements and changes business driven Service Delivery is fully automated Constant application discovery, autoinstantiation and configuration of monitoring

Data Center Virtualization

Data Center Virtualization Challenges IT takes responsibility for more of the stack Business demands service level delivery Noisy Neighbor problem Dynamic Operations Performance = Response Time & Latency

Summarizing The Performance Problem


Right now all of the benefits of virtualization accrue to the

team managing the virtual infrastructure


Reduced Server costs Storage consolidation Simplified management of IT resources IT agility

But, to Application Owners (and their constituents),

virtualization is all risk and no reward

Dedicated hardware is a comfort blanket they are unwilling to give

up (server huggers) Hardware over-provisioning provides a performance safety net Physical system service level agreements do not translate into the world of virtualized systems and the cloud

There is a fundamental disconnect between IT and

application owners with respect to risk and reward!

The Ugly Reality of Application Performance on Dedicated Physical Hardware


Very few applications are instrumented for response time Most of the time performance is inferred from resource utilization

metrics collected from the OS

The relationship between performance and capacity is not well

understood = over-provisioning software technology

The tiers of the application system are siloed by hardware or The trouble-shooting process is an ungodly mess blamestorming

meetings (mean time to innocence)

The time required to solve a problem (the duration of the outage

or the degraded performance) is unacceptably long

This should be an easy baseline to improve upon!

Configuration, Resource, and Capacity Management


Virtualization Resource and Availability Monitoring
Key Features
1. Host and guest resource

utilization monitoring

Example Vendors
Quest (vFoglight) Solarwinds (Hyper9) Veeam VKernel VMTurbo VMware vC OPS Zenoss

2. Capacity Mgmt & Planning 3. Used by IT Operations

Key Criteria for Resource Based Performance and Capacity Monitoring


Out of the box value if it is not providing value in

10 minutes junk it and find something else (autodiscovery is key) Collect data from vCenter AND the other virtualization platforms that you support or plan to support Look for the integration of performance management, capacity management, and configuration management Collecting, dashboarding, alerting, and reporting on vCenter data is commodity functionality look for value in analytics and automation

Infrastructure Performance (Latency) Management


Infrastructure Performance Management
Key Features 1. Understanding of end-to-end infrastructure performance 2. Capacity management and planning 3. Infrastructure response time is the key metric 4. Used by the team supporting the virtual infrastructure

Servers

Network Fabric

SAN Fabric

Storage

Example Vendors Confio Software NetApp (BalancePoint) Sevone Virtual Instruments Xangati

Key Criteria for Infrastructure Response Time Solutions


Measure IRT Monitor how long it takes the infrastructure to

respond to requests for work, not how much resource it takes Deterministic Get the real data, not a synthetic transaction, or an average Real Time Get the data when it happens, not seconds or minutes later Comprehensive Get all of the data, not a periodic sample of the data Zero-Configuration (Discovery) Discover the environment and its topology, and keep this up to date in real time Application (or VM) Aware Understand where the load is coming from and where it is going Application Agnostic Work for every workload or VM type in the environment irrespective of how the application is built or deployed

Example - Infrastructure Performance Management & Real Time Metrics


Knowing whether performance is good or not all of the time, requires

measuring performance in a comprehensive, deterministic, and real time manner Averaging good transactions with bad transactions obscures the true nature and impact of the bad transactions
VMware vCenter 5 Minute Average Data Virtual Instruments VirtualWisdom Real Time Data

Application Performance Management


Application Performance Management
Key Features
1. Understanding of app response

time across the application system Application Support

2. Used by Operations and

Age nt

Age nt

Age nt

Age nt

Age nt

Age nt

Age nt

Age nt

Age nt

Age nt

Example Vendors
AppDynamics AppFirst BlueStripe Correlsense Compuware (dynaTrace) New Relic Quest (Foglight) VMware (vF APM)

APM is not just for Custom Applications Apps Ops = Every Application!
Legacy Custom Developed Apps (DevOps)
CA/Wily HP Diagnostics IBM ITCAM Precise BMC Patrol

Modern AppDynamics
New Relic VMware vFabric APM dynaTrace AppFirst BlueStripe Confio Software ExtraHop VMware vFabric APM

Every App (AppOps))

NetIQ HP BAC

CA Unicenter/Spectrum Correlsense

Key Criteria for Application Response Time Solutions


Measure Actual Application Response Time How long did it take, not

how much resource it used Breadth of Application Support Ideally support every application running in the environment automatically (conflicts with depth) Depth of Root Cause Diagnostics Provide deep analysis into the application stack for root cause (conflicts with breadth) Deterministic Get the real data, not a synthetic transaction, or an average Real Time Get the data when it happens, not seconds or minutes later Comprehensive Get all of the data, not a periodic sample of the data Application Discovery and Topology Mapping Automatically discover new applications and their topology and keep this update to date automatically and continuously Analytics and Baselining Avoid manual thresholds, learn normal behavior and alarm based upon deviations from normal Public Cloud Ready Allow applications to be distributed across organizational boundaries, and have monitoring work with no firewall work

Examples Dynamic, Continuous, Real-Time Application Response Time

VMware

dynaTrace

AppDynamics

BlueStripe

Pick The Right Vendors

Start Over with a Reference Architecture


Cloud Management Reference Architecture Res./Perf./Cap./Cnfg. Mgmt Image Provisioning/Mgmt Infrastructure Perf. Mgmt App Performance Mgmt Self-Learning Analytics Applications Virtualization Platform Servers LAN, Switch Fabric and Routers SAN and Storage

Pick a solution for each layer in the infrastructure (blue boxes) Implement the functions in each vertical box across the infrastructure layers Tie it all together with self-learning analytics

Cloud Management

The Virtualization Management Ecosystem

Abiquo AppDynamics Cirba Phd Virtual Splunk Reflex Systems SolarWinds Veeam VMTurbo VKernel (Quest) Zenoss vFabric Hyperic vC Operations Perf. & Cap Mgmt Confio Software NetApp Balance Sevone Virtual Instruments Xangati Infr. Perf. Mgmt AppFirst BlueStripe Correlsense ExtraHop dynaTrace New Relic Netuitive Prelert Quest (Foglight) vS App Inf Mgr vFabric APM App Perf. Mgmt Cicso/NewScale Cloupia Citrix (Cloud.com) Dell VIS Creator DynamicOps Embotics Eucalyptus Gale Technologies Nimbula OpenStack VirtuStream vCloud Director Cloud Management Puppet Opscode (Chef) ScaleXtreme rPath App Director Image Provisioning Netuitive Prelert vCenter Operations
Self-Learning Anlytics

Virtualization Platform (vSphere, VMware vCloud, Hyper-V, KVM, XenServer)

Before You Try to be Predictive.


Instrument your infrastructure for end-to-end latency

(Infrastructure Performance Management) Implement a real-time CMDB that can keep up with the rate of change in your virtual environment Implement a modern Developer focused APM solution for your critical custom developed applications Implement an Operations focused APM solution to measure response time for every application Get as real time, deterministic, and comprehensive as possible with all of your response time and latency metrics Reorganize and implement an Application Operations function staffed with application domain experts Operationalize finding and fixing problems in real time Then and only then try to get truly predictive

Self-Learning Analytics The only way to do cross-stack Root Cause Analysis

Applications Virtualization Platform Servers LAN, Switch Fabric and Routers SAN and Storage

Real Time, Deterministic and Comprehensive Data

Self-Learning Analytics

Root Cause Analysis

The right organization, the right tools, and the right data Combined with the right self-learning Analytics

Leads to an automated across the stack Root Cause Analysis Process

Evaluation Criterial for Performance Analytics


How automated is the learning (really) Diversity of accepted data (time series, events) Frequency and quantity of data inputs Breadth of plug-ins to the monitoring products you

own, or are going to own Process for learning (handling) normal events Tradeoffs between false positives (false alarms) and false negatives (you missed something) Ease of implementation (time and cost) Quality of the Analysis (can you trust it?)

Thank You

FILL OUT A SURVEY


EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A $25 VMWARE COMPANY STORE GIFT CERTIFICATE

OPS-CIM1900

Moving from Reactive to Predictive Performance and Systems Management

Bernd Harzog, The Virtualization Practice

#vmworldops

You might also like