You are on page 1of 23

ITT3241

Operating a More
Reliable Cloud Through
Proactive Incident and
Problem Management

Rich Benoit, VMware, Inc.

Doug Huber, VMware, Inc.

#vmworldittran
Disclaimer

 This session may contain product features that are


currently under development.
 This session/overview of the new technology represents
no commitment from VMware to deliver these features in
any generally available product.
 Features are subject to change, and must not be included in
contracts, purchase orders, or sales agreements of any kind.
 Technical feasibility and market demand will affect final delivery.
 Pricing and packaging for any new technologies or features
discussed or presented have not been determined.

2
Session Executive Summary

 Cloud Transformation
 What is Incident and Problem Management?
 Why Proactive Incident and Problem Management
 Current State
 Evolution from Reactive to Proactive Incident and Problem
Management
 Operational Benefits
 Key Performance Indicators

3
3
A New Operating Model for the Cloud Era

Reactive Proactive Innovative


IT Business
Management

People, Culture
& Organization

Processes
& Control

Software Technology
& Architecture

4
Five Capabilities Which Unlock Cloud Benefits

Description Major processes impacted

Service catalog with standardized offerings and tiered


On-Demand SLAs, actively managed and governed throughout its  Request fulfillment
Services lifecycle, and with end-user access via a self-service  Application development
portal

 Request fulfillment
Automated Automated provisioning, release and deployment of  Application development
Provisioning & infrastructure, platform and end-user compute  Release and deployment
Deployment services management

 Incident management
Proactive Incident &
Monitoring and filtering of events, automatic incident  Request fulfillment
Problem
resolution, and problem diagnosis  Event management
Management
 Application development

Security, compliance, and risk management policies  Information security


Policy-based management
embedded into standard configurations enabling
Security, Compliance  Compliance management
policy-aware applications and automation of security,
& Risk Management  Risk management
audit, and risk management processes

IT Financial IT cost transparency and service-level usage-based  Financial management


Management (ITFM) ‘showbacks’ or ‘chargebacks’ using automated  Supplier management
for Cloud metering and billing tools  Demand management

5
Lets Agree Upon A Definition…

Although these process areas are based on ITSM principles in Level 1 of


the Maturity model, by Level 3 what had been a very manual reactive set
of processes are now beginning to automate and become much more
proactive in nature – or Cloud Operations

Incident Management
Focuses on how to handle performance problems or outages. The
primary focus of Incident Management is to manage the incident until it is
resolved. Problem Management

Problem Management
Focuses on identifying root causes to repetitive and high priority incidents.
Once Root Causes have been identified, a plan of action will be generated
that will ideally repair the underlying problem. If the problem can’t be fixed,
additional monitoring and event management handling may be
implemented in an attempt to minimize or eliminate future occurrences of
the problem.

6
Cloud Operations unlocks the benefits of Cloud

Efficiency Agility Reliability

Free-up as much as 25% of Reduce time to market for new Ensure data and application
labor operating costs through innovations and increase security, compliance,
standardization, automation, flexibility availability, and recoverability
and streamlining operations1 to employees and customers

1 Your savings may vary

7
Why Proactive Incident and Problem Management

Cloud is changing how resources are shared and consumed today

Proactive Intelligent Comprehensive


Management Automation Visibility

 Analyze  Performance  Health


 Optimize  Capacity  Risk
 Forecast  Configuration

Efficiency Agility Reliability

8
Typical Process Flow Today

9
How Did We Get Here?

Constant Evolution of the Tools…


Generation 1: Focused on infrastructure management inside silo’d technology domains.

IT architectures
Reactive Reactive were simple 3 tier
Incident Problem
Management Help Desk Management hierarchy

Administrators

10
How Did We Get Here?

Constant Evolution of the Tools…


Generation 2: A focus on ITIL was added and on developing a framework around IT
management processes that could be implemented in tools

IT architectures
Reactive Level 1 Reactive morphed to a full
Incident Problem mesh
Management Support Management

Level 2 Support

Level 3 Support

11
How Did We Get Here?

Constant Evolution of the Tools…


Generation 3: Virtualization quickly identified how the lack of IT governance and process
results in sprawl and higher costs, which has resulted in more focus on end-to-end process
automation to get the advantage of ITOM investments.

Automated Complexity
Workflows exploded
Reactive Reactive
Incident Interactive Problem
Management Workflows Management

Level 1 Support

Level 2 Support

Level 3 Support

12
Cloud Ops
 Automating incident and problem management in the data
center is the key to becoming proactive
 Intelligent analytics and control continuously:
• Assesses the thousands of performance metrics and available
capacity across the entire IT stack,
• Considers all business and physical constraints
• Drives the necessary actions to tune and maintain the environment
in an optimal operating state.
 Instead of alerting you when problems
occur, or are about to occur:
• Optimizing performance, maximizing infrastructure
efficiencies and reducing operational costs.
• Prevents events/alerts from happening
• Controls the environment in a “healthy” state
 Intelligently prioritize resources and
automatically scale up or down as
performance and business
demand fluctuates
13
What is Proactive Incident and Problem Management

Ensure and Restore Optimize for


Service Levels Efficiency and Cost

Monitor Plan
Slow performance Utilization / forecast

!
Problem Maintenance

Remediate Isolate Automate Optimize


Rollback change Config issue Orchestrate changes Reclaim capacity

Reactive Proactive
14
Operational Benefits Across Three Dimensions
Benefits
Dimensions Examples

▪ Simplified infrastructure management via abstraction, policy-based


automation, and “app awareness”
Efficiency ▪ Less complex monitoring requirements due to smarter alerting of potential
issues.
▪ Fewer resources needed for labor intensive processes as a result of
automation

▪ Increased time to awareness when a critical cloud service issue arises


Agility ▪ Greater speed in addressing faults or performance based issues
▪ Capacity scaled prior to impacting performance or business requirements

▪ Improved quality of service and experience for consumers due to a reduction


in downtime (planned and unplanned)
Reliability ▪ Greater adherence to SLAs (e.g., availability, latency)
▪ Fewer events turning into incidents or problems through a proactive
approach

15
Proactive Incident and Problem Management OPEX Savings

Incident Management Change Lifecycle Savings


Lifecycle Savings  Manage changes to
 Manage/Resolve incidents apps/infrastructure
 Proactive alerts reduce costs  “Before/after” analysis reduces
30-40% changed-related incidents 30-40%

Incident Management Problem Management


Savings Savings
 Managing Service Desk issues  Closing problems after systems
(Incidents) restored, includes root cause
 Manual threshold elimination analysis
reduces erroneous tickets by  Root cause analysis improves
50-60% problem closure by 30%

Source: Reducing Operational Expense with Virtualization and Systems Management - Enterprise Management Associates

16
Business Impact and KPIs

Frequency of Interruption Mean Time to Repair (MTTR)

Widely used to measure the time


Service impact covers a number
between a fault occurring, and the
of metrics including application
KPI

fault being fixed. A good measure


response time, unscheduled and
of how quickly IT responds to
scheduled downtime, and
problems occurring in managed
frequency of security breaches.
systems.

IT staff costs (including overtime Less downtime, and reduces lost


IMPACT

and on-call costs for support), productivity from idle business unit
business unit staff costs (including staff. Fewer IT resources are
time lost when systems are down), being diverted from strategic
and lost revenue from downtime project work break-fix activity.

17
Business Impact and KPIs

Ease of Management Faster Service Deployment

Overall ease of management and


functionality that is unique to Reducing the time to deploy, re-
KPI

cloud. Most enterprises report deploy or move a business service


daily management tasks are is one of the most widely-reported
easier, or at least the same, in a OpEx reduction outcomes
virtual environment.

Zero-downtime migration
Provisioning is easier with
eliminates both application
IMPACT

templates than with traditional


downtime costs for business
software installation, availability is
users, and overtime payments for
easier to ensure with resource
out-of-hours migrations. Faster,
pooling and live migration.
cheaper lifecycle .

18
In Closing

 Proactive incident and problem management is enabling IT


transformation in support of your cloud solutions

 Utilizing proactive cloud health monitoring capabilities thru learned


behavior analytics and methods to help your organization attain its IT
and Business goals

 Identified Key Performance Indicators to evaluate and measure your


journey to cloud operations

 improving
Identified savings opportunities in IT Operating Expenses, while
cloud services availability and quality

19
Learn more about VMware Cloud Solutions

 Maximize the power of cloud computing to:


• Deliver new IT services that fuel business growth
• Transform IT into a source of innovation
• Dramatically improve IT efficiency, agility and
reliability

 Develop key capabilities in your organization


with VMware Cloud Operations Services
• Advisory, education, and remediation services
• Insight, prioritized recommendations, and expert
guidance to transform operational processes,
organizational structures, and financial models

vmware.com/cloud
20
Additional IT Transformation Tracks
SESSION ID TITLE DAY TIME

ITT1918 Is My Organization Ready to Reap the Benefits of the Cloud? Monday 12:30 PM

From Reactive to Innovative: The Journey to the Cloud Crosses


ITT3237 Monday 02:00 PM
People, Process, Technology and Measurement

ITT3245 VMware on VMware: Our Journey to the Cloud (Part 1) Monday 05:00 PM

Planning and Measuring the Impact of Cloud: IT Metrics that


ITT3244 Tuesday 11:00 AM
Matter

ITT3242 Managing Cloud Security, Compliance, and Risk Management Tuesday 12:30 PM

Advice From Your Peers: How to Best Run and Manage a Cloud
ITT1953 Tuesday 02:00 PM
Environment

ITT3243 Delivering IT Financial Management for Cloud Tuesday 05:00 PM

ITT3238 Taking Your Workloads to the Cloud: Why, How, and When? Wednesday 08:30 AM

VMware on VMware: How the Virtualization Leader is Moving to


ITT3246 Wednesday 10:00 AM
the Cloud (Part 2)
Operating a More Reliable Cloud Through Proactive Incident and
ITT3241 Wednesday 11:30 AM
Problem Management

ITT3239 On-Demand IT: Leveraging Cloud for Efficient Self-Service IT Wednesday 04:00 PM

ITT3240 From Weeks to Hours: Automated Provisioning and Deployment Thursday 12:30 PM

21
FILL OUT
A SURVEY

EVERY COMPLETE SURVEY


IS ENTERED INTO
DRAWING FOR A
$25 VMWARE COMPANY
STORE GIFT CERTIFICATE
ITT3241

Operating a More
Reliable Cloud Through
Proactive Incident and
Problem Management

Rich Benoit, VMware, Inc.

Doug Huber, VMware, Inc.

#vmworldittran

You might also like