CIM1900-Moving From Reactive To Predictive Performance and Systems Management - Final - US PDF

OPS-CIM1900
Moving from Reactive to Predictive Performance and Systems Management
Bernd Harzog, The Virtualization Practice
#vmworldops
Bernd Harzog Virtualization Performance and Capacity Management Analyst

Analyst and Consultant Focused upon:
Infrastructure Performance and Capacity Management of Virtualized Systems Application Performance Management Transaction Performance Management End User Experience Management
Clients include:
Vendors offering solutions Enterprises seeking
virtualization performance management solutions
Key Findings
Virtualization introduces
sharing and dynamic behavior Agile Development produces rapidly changing applications Both combine to require a new tools, organizations and management processes
Key Trends
The demand for business functionality implemented in
software is infinite (therefore so is the backlog)
Innovation in software development tools and platforms Scaled out open source commodity deployment platforms Distribution of applications across data centers and
private/public clouds applications
Virtualization of business critical and performance critical More than one hypervisor in the enterprise Rapidly changing applications running on dynamic platforms
The Software Development Backlog is Infinite

Maybe we should measure it
in inches instead of manmonths!
Business benefits from
automation in software are compelling software and new updates continues to grow processes, tools, and platforms
Business demand for new
This is driving innovation in
Tool and Process Innovation

Not just a Java/.NET world
any more. PHP, Ruby, Python and NodeJS matter
The OS is not the platform
any more. Modern languages come with runtimes evolving to support multiple languages and runtimes Agile and Extreme Programming continues
Cloud Platforms are rapidly
Process innovation driven by
Distribution of applications across data centers and private/public clouds

Simple three tier architecture
From This
(Web-Business LogicDatabase) replaced by:
Applications broken into
modules with their own Agile teams Modules scaled out on commodity platforms Modules distributed across private and public clouds Applications running on opaque infrastructure
To This
Scaled Out and Open Source Platforms

Cheap hardware (commodity
Intel compatible servers)
Large number of servers Cheap software
(both physical and virtual)
More than One Hypervisor

VMware remains the clear
technical and market leader in the enterprise narrowing the functionality gap choosing to tier virtualization platforms just as they tier storage
Competing platforms are
Some enterprises are
Your New World

Agile Development creates Built in diverse languages
rapidly changing applications and running on diverse language runtimes deployment platforms
Running on next generation Deployed on multiple
virtualization platforms commodity hardware with multiple owners
Running on scaled out Located in multiple clouds

Your Cloud Hybrid Cloud Public Cloud
Virtualization is Progressing
So How Are We Going To Manage This Mess?
Management Principles for the New World

Start Over Start with a new Reference Architecture - do not assume
that any tool you have purchased automatically makes the cut Insist upon easy to try, easy to buy, easy to manage, and results in production before purchase Organize for the successful virtualization of business critical applications Move from managing Images to managing Models and Objects Define Performance as Latency and Response time, not Resource Utilization Manage every application for performance, not just the 5% most painful and important ones Get Real Time, Deterministic and Comprehensive about Data Collection Design your management architecture for the distributed cloud case even if you are not there yet Focus upon preventing problems instead of reacting to them Employ Predictive technologies to help you Pick the best vendors and products for the job
The Worse than Useless Test

Apply this test to every single monitoring and management solution in your company
1. 2. 3. 4. 5. 6.
Does it operate on a real-time, continuous, and deterministic basis? Does it support workloads distributed across data centers (yours and ones you rent (cloud))? Can it re-configure itself every time you change something in the environment or in the applications? Can you support it and use it without the continuous presence of on premise consultants from the vendor of the tool? If it is a monitoring tool, does it focus upon response time and latency? Can you try it, for free, in production, before you buy it or more of it?
If the answer is not Yes to all six junk the tool and start over
Starting Over Get Rid of ITIL and the CMDB
ITIL is designed to get you to document and slow down the rate of change Dont tell the Change Control Committee about vMotion! Your CMDB will never be able to keep up with rate of change given Agile
Development and Dynamic Operations Every configuration change needs to be tracked in real time, and crosscorrelated with performance degradations and resource contention
Starting Over - The Big Four (IBM, BMC, HP, CA) are worse than useless
Blind Dinosaur
=
The new environment produces requirements that legacy
solutions cannot meet Legacy solutions get broken by virtualization and the cloud Legacy vendors are not going to be able to acquire themselves out of this mess Put the dino in a cage and do not let him out build a new management stack for your virtualization/cloud environment isolate the dino to your legacy physical environment
Start Over with a Reference Architecture

Cloud Management Reference Architecture Res./Perf./Cap./Cnfg. Mgmt Image Provisioning/Mgmt Infrastructure Perf. Mgmt App Performance Mgmt Self-Learning Analytics Applications Virtualization Platform Servers LAN, Switch Fabric and Routers SAN and Storage
Pick a solution for each layer in the infrastructure (blue boxes) Implement the functions in each vertical box across the infrastructure layers Tie it all together with self-learning analytics
Cloud Management
Embrace the New Way of Trying, Implementing, and Buying Management Software
The Old Way The New Way
Rep takes the CIO to play
golf
You get to download and use
Enterprise software deal gets
the software in production first really does work and add value in your environment buy it
signed dont
You prove to yourself that it
Some products work, others People go around the ELA to
Then (and only then) do you
get the tools they need
Organize for Virtualization of Critical Applications, Agility, and Success

Virtualization is Just One Team Virtualization and Application Operations are THE Teams
Data Center Operations
Virtual Operations Systems Engineering WAN Team LAN Team Windows Server Team Linux Server Team SAN Team Storage Team
Application Operations Programmer/Analyst Team Java Server Team Web Server Team Database Team
Support Tier 3 Support Tier 2 Support Tier 1 Help Desk
Virtualization pervades IT Operations, and becomes Virtual Operations Application Operations is responsible for the performance of every application
in production (purchased and custom developed)
Move from Managing Images to Managing Models and Objects

Puppet Legacy Image Management Process
Populate the Image
Chef
vFabric AppDirector
Assemble the Application
Performance Resource Utilization Performance = Response Time & Latency

The Root of All Evil CPU and Memory are horrible indicators of performance Latency is the appropriate measure of infrastructure peformance
Response Time is the appropriate measure of application peformance
New Environments = New Performance and Capacity Management Challenges

Public Cloud Challenges Cloud vendor is not going to share detailed infrafratructure latency metrics with you Published metrics (CloudWatch) are basically useless Places a premium on high fidelity cloud aware Application Performance Management
Public Cloud
Private Cloud/ITaaS
Private Cloud/IT as a Service Challenges Constantly changing set of applications Requirements and changes business driven Service Delivery is fully automated Constant application discovery, autoinstantiation and configuration of monitoring
Data Center Virtualization
Data Center Virtualization Challenges IT takes responsibility for more of the stack Business demands service level delivery Noisy Neighbor problem Dynamic Operations Performance = Response Time & Latency
Summarizing The Performance Problem

Right now all of the benefits of virtualization accrue to the
team managing the virtual infrastructure

Reduced Server costs Storage consolidation Simplified management of IT resources IT agility
But, to Application Owners (and their constituents),
virtualization is all risk and no reward
Dedicated hardware is a comfort blanket they are unwilling to give
up (server huggers) Hardware over-provisioning provides a performance safety net Physical system service level agreements do not translate into the world of virtualized systems and the cloud
There is a fundamental disconnect between IT and
application owners with respect to risk and reward!
The Ugly Reality of Application Performance on Dedicated Physical Hardware

Very few applications are instrumented for response time Most of the time performance is inferred from resource utilization
metrics collected from the OS
The relationship between performance and capacity is not well
understood = over-provisioning software technology
The tiers of the application system are siloed by hardware or The trouble-shooting process is an ungodly mess blamestorming
meetings (mean time to innocence)
The time required to solve a problem (the duration of the outage
or the degraded performance) is unacceptably long
This should be an easy baseline to improve upon!
Configuration, Resource, and Capacity Management

Virtualization Resource and Availability Monitoring
Key Features
1. Host and guest resource
utilization monitoring
Example Vendors
Quest (vFoglight) Solarwinds (Hyper9) Veeam VKernel VMTurbo VMware vC OPS Zenoss
2. Capacity Mgmt & Planning 3. Used by IT Operations
Key Criteria for Resource Based Performance and Capacity Monitoring

Out of the box value if it is not providing value in
10 minutes junk it and find something else (autodiscovery is key) Collect data from vCenter AND the other virtualization platforms that you support or plan to support Look for the integration of performance management, capacity management, and configuration management Collecting, dashboarding, alerting, and reporting on vCenter data is commodity functionality look for value in analytics and automation
Infrastructure Performance (Latency) Management

Infrastructure Performance Management
Key Features 1. Understanding of end-to-end infrastructure performance 2. Capacity management and planning 3. Infrastructure response time is the key metric 4. Used by the team supporting the virtual infrastructure
Servers
Network Fabric
SAN Fabric
Storage
Example Vendors Confio Software NetApp (BalancePoint) Sevone Virtual Instruments Xangati
Key Criteria for Infrastructure Response Time Solutions

Measure IRT Monitor how long it takes the infrastructure to
respond to requests for work, not how much resource it takes Deterministic Get the real data, not a synthetic transaction, or an average Real Time Get the data when it happens, not seconds or minutes later Comprehensive Get all of the data, not a periodic sample of the data Zero-Configuration (Discovery) Discover the environment and its topology, and keep this up to date in real time Application (or VM) Aware Understand where the load is coming from and where it is going Application Agnostic Work for every workload or VM type in the environment irrespective of how the application is built or deployed
Example - Infrastructure Performance Management & Real Time Metrics

Knowing whether performance is good or not all of the time, requires
measuring performance in a comprehensive, deterministic, and real time manner Averaging good transactions with bad transactions obscures the true nature and impact of the bad transactions
VMware vCenter 5 Minute Average Data Virtual Instruments VirtualWisdom Real Time Data
Application Performance Management

Application Performance Management
Key Features
1. Understanding of app response
time across the application system Application Support
2. Used by Operations and
Age nt
Age nt
Age nt
Age nt
Age nt
Age nt
Age nt
Age nt
Age nt
Age nt
Example Vendors
AppDynamics AppFirst BlueStripe Correlsense Compuware (dynaTrace) New Relic Quest (Foglight) VMware (vF APM)
APM is not just for Custom Applications Apps Ops = Every Application!
Legacy Custom Developed Apps (DevOps)
CA/Wily HP Diagnostics IBM ITCAM Precise BMC Patrol
Modern AppDynamics
New Relic VMware vFabric APM dynaTrace AppFirst BlueStripe Confio Software ExtraHop VMware vFabric APM
Every App (AppOps))
NetIQ HP BAC
CA Unicenter/Spectrum Correlsense
Key Criteria for Application Response Time Solutions

Measure Actual Application Response Time How long did it take, not
how much resource it used Breadth of Application Support Ideally support every application running in the environment automatically (conflicts with depth) Depth of Root Cause Diagnostics Provide deep analysis into the application stack for root cause (conflicts with breadth) Deterministic Get the real data, not a synthetic transaction, or an average Real Time Get the data when it happens, not seconds or minutes later Comprehensive Get all of the data, not a periodic sample of the data Application Discovery and Topology Mapping Automatically discover new applications and their topology and keep this update to date automatically and continuously Analytics and Baselining Avoid manual thresholds, learn normal behavior and alarm based upon deviations from normal Public Cloud Ready Allow applications to be distributed across organizational boundaries, and have monitoring work with no firewall work
Examples Dynamic, Continuous, Real-Time Application Response Time
VMware
dynaTrace
AppDynamics
BlueStripe
Pick The Right Vendors
Start Over with a Reference Architecture

Cloud Management Reference Architecture Res./Perf./Cap./Cnfg. Mgmt Image Provisioning/Mgmt Infrastructure Perf. Mgmt App Performance Mgmt Self-Learning Analytics Applications Virtualization Platform Servers LAN, Switch Fabric and Routers SAN and Storage
Pick a solution for each layer in the infrastructure (blue boxes) Implement the functions in each vertical box across the infrastructure layers Tie it all together with self-learning analytics
Cloud Management
The Virtualization Management Ecosystem
Abiquo AppDynamics Cirba Phd Virtual Splunk Reflex Systems SolarWinds Veeam VMTurbo VKernel (Quest) Zenoss vFabric Hyperic vC Operations Perf. & Cap Mgmt Confio Software NetApp Balance Sevone Virtual Instruments Xangati Infr. Perf. Mgmt AppFirst BlueStripe Correlsense ExtraHop dynaTrace New Relic Netuitive Prelert Quest (Foglight) vS App Inf Mgr vFabric APM App Perf. Mgmt Cicso/NewScale Cloupia Citrix (Cloud.com) Dell VIS Creator DynamicOps Embotics Eucalyptus Gale Technologies Nimbula OpenStack VirtuStream vCloud Director Cloud Management Puppet Opscode (Chef) ScaleXtreme rPath App Director Image Provisioning Netuitive Prelert vCenter Operations
Self-Learning Anlytics
Virtualization Platform (vSphere, VMware vCloud, Hyper-V, KVM, XenServer)
Before You Try to be Predictive.

Instrument your infrastructure for end-to-end latency
(Infrastructure Performance Management) Implement a real-time CMDB that can keep up with the rate of change in your virtual environment Implement a modern Developer focused APM solution for your critical custom developed applications Implement an Operations focused APM solution to measure response time for every application Get as real time, deterministic, and comprehensive as possible with all of your response time and latency metrics Reorganize and implement an Application Operations function staffed with application domain experts Operationalize finding and fixing problems in real time Then and only then try to get truly predictive
Self-Learning Analytics The only way to do cross-stack Root Cause Analysis
Applications Virtualization Platform Servers LAN, Switch Fabric and Routers SAN and Storage
Real Time, Deterministic and Comprehensive Data
Self-Learning Analytics
Root Cause Analysis
The right organization, the right tools, and the right data Combined with the right self-learning Analytics

Leads to an automated across the stack Root Cause Analysis Process
Evaluation Criterial for Performance Analytics

How automated is the learning (really) Diversity of accepted data (time series, events) Frequency and quantity of data inputs Breadth of plug-ins to the monitoring products you
own, or are going to own Process for learning (handling) normal events Tradeoffs between false positives (false alarms) and false negatives (you missed something) Ease of implementation (time and cost) Quality of the Analysis (can you trust it?)
Thank You
FILL OUT A SURVEY

EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A $25 VMWARE COMPANY STORE GIFT CERTIFICATE
OPS-CIM1900
Moving from Reactive to Predictive Performance and Systems Management
Bernd Harzog, The Virtualization Practice
#vmworldops

CIM1900-Moving From Reactive To Predictive Performance and Systems Management - Final - US PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CIM1900-Moving From Reactive To Predictive Performance and Systems Management - Final - US PDF

Uploaded by

Copyright:

Available Formats

OPS-CIM1900

Moving from Reactive to Predictive Performance and Systems Management

Bernd Harzog, The Virtualization Practice

Bernd Harzog Virtualization Performance and Capacity Management Analyst

Vendors offering solutions Enterprises seeking

virtualization performance management solutions

software is infinite (therefore so is the backlog)

private/public clouds applications

The Software Development Backlog is Infinite

in inches instead of manmonths!

Business benefits from

Business demand for new

This is driving innovation in

Tool and Process Innovation

any more. PHP, Ruby, Python and NodeJS matter

The OS is not the platform

Cloud Platforms are rapidly

Process innovation driven by

Distribution of applications across data centers and private/public clouds

(Web-Business LogicDatabase) replaced by:

Applications broken into

Scaled Out and Open Source Platforms

Intel compatible servers)

Large number of servers Cheap software

(both physical and virtual)

More than One Hypervisor

Competing platforms are

Some enterprises are

Your New World

Running on next generation Deployed on multiple

virtualization platforms commodity hardware with multiple owners

Running on scaled out Located in multiple clouds

So How Are We Going To Manage This Mess?

Management Principles for the New World

The Worse than Useless Test

Starting Over Get Rid of ITIL and the CMDB

Start Over with a Reference Architecture

Rep takes the CIO to play

You get to download and use

Enterprise software deal gets

You prove to yourself that it

Some products work, others People go around the ELA to

Then (and only then) do you

get the tools they need

Organize for Virtualization of Critical Applications, Agility, and Success

Support Tier 3 Support Tier 2 Support Tier 1 Help Desk

in production (purchased and custom developed)

Move from Managing Images to Managing Models and Objects

Assemble the Application

Performance Resource Utilization Performance = Response Time & Latency

Response Time is the appropriate measure of application peformance

New Environments = New Performance and Capacity Management Challenges

Data Center Virtualization

Summarizing The Performance Problem

team managing the virtual infrastructure

Reduced Server costs Storage consolidation Simplified management of IT resources IT agility

But, to Application Owners (and their constituents),

virtualization is all risk and no reward

Dedicated hardware is a comfort blanket they are unwilling to give

There is a fundamental disconnect between IT and

application owners with respect to risk and reward!

The Ugly Reality of Application Performance on Dedicated Physical Hardware

metrics collected from the OS