You are on page 1of 33

Detecting Discontinuities

in Large-Scale Systems

Haroon Malik
Postdoctoral fellow

Ian John
Davis

Michael
Godfrey

Research Associate Associate Professor

Software Architecture Group


(SWAG) University of Waterloo,
Waterloo, Canada

Serge
Mankovskii
Research Staf

Douglas Neuse
Infrastructure
Management

CA Technologies
USA

Datacenters Require

Forecasting Steps

Determine
purpose

Select
technique

Prepare data

Prepare
forecast

Monitor
forecast

Forecasting Steps
1

Determine
purpose

Select
technique

Prepare data

Prepare
forecast

Monitor
forecast

Forecasting Steps
1

Determine
purpose

Select
technique

Prepare data

Prepare
forecast

Monitor
forecast

Forecasting Steps
1

Determine
purpose

Select
technique

Prepare data

Prepare
forecast

Monitor
forecast

Forecasting Steps
1

Determine
purpose

Select
technique

Prepare data

Prepare
forecast

Monitor
forecast

Forecasting Steps
1

Determine
purpose

Select
technique

Prepare data

Challenges

Prepare
forecast

Monitor
forecast

(a) Large volumes of performance data, (b) Limited time,


(c) Domain knowledge

Discontinuities
Anomal
ies
5
6

Magnitude

Discontin
uity

Time (Days)
9

Discontinuities
Reasons:
1.Company merge

Symptoms:

(b
)

(a)

2.Hardware upgrade
3.Software change (new
release)
4.Workload change

T1

T2
T3

(c)

(d)

5.Promotional customers

10
Transition Period

Why Care About


Discontinuities?
Measurements taken before the discontinuity
can skew the forecast.
Detecting a discontinuity provide analysts with
a reference point to retrain their forecasting
models and make necessary adjustments.

We propose an automated
approach to help analyst
identify discontinuities in
performance data
11

Steps Involved in The Proposed


Approach
Inpu
t
Performance
logs

Approach

Data
preparation

Metric
selection

3
Anomaly
detection

4
Discontinuity
identification

Outpu
t
Report
(discontinuities)

12

Data
preparation

Metric
selection

Anomaly
detection

Discontinuity
identification

1. Data Preparation
The performance logs from
the production have noise:
o

Missing counters

Empty counters

Different
ranges

numerical

We used statistical
techniques to filter
noise in the data

13

2.Metric
Selection

Data
preparation

Metric
selection

Anomaly
detection

Discontinuity
identification

Production logs contain


thousands of counters that
are:
o

Highly correlated

Invariants

Configuration constants

We used PrincipalComponent-Analysis
(PCA) to select
important metrics

14

3. Anomaly
Detection

Data
preparation

Metric
selection

Anomaly
detection

Discontinuity
identification

Quadratic Modelling
o

Quadratic Function that


minimize LSE

A greedy algorithm to
replace
performance
counter time series data

Cost metric to reflect


data fit

Largest costs suggest


positions in time series
value where the most
egregious
anomalies
and
discontinuities
occur
15

3. Anomaly Detection
(Quadratic Model)

Counter Value

16

3. Anomaly Detection
(Quadratic Model)

Cost

Counter Value

17

4.
Discontinuity
Identification

Data
preparation

Metric
selection

Anomaly
detection

Discontinuity
identification

Distribution comparison
o Difference
of
mean
between two population
o Quantify the difference
of mean between two
population

18

Difference of Mean Between


Two Anomaly
Populations
Anomaly
% CPU
Utilization

Discontinuity

Transition
Period

Transition
Period

19

Difference of Mean Between


Two Anomaly
Populations
Anomaly
% CPU
Utilization

Discontinuity

Transition
Period

Cost

Transition
Period

20

Difference of Mean Between


Two Anomaly
Populations
Anomaly
% CPU
Utilization

Discontinuity

Transition
Period

Wilcoxon RankSum Test

Transition
Period

H0 = The two distributions


are same
21

Difference of Mean Between


Two Anomaly
Populations
Anomaly
% CPU
Utilization

Discontinuity

Transition
Period

Wilcoxon RankSum Test

Transition
Period

H0 = The two distributions


are same
22

Difference of Mean Between


Two Anomaly
Populations
Anomaly
% CPU
Utilization

Discontinuity

Transition
Period

Wilcoxon RankSum Test

Transition
Period

H0 = The two distributions


are same
23

Quantify the Difference of Mean


Between Two Populations
A tunable
Larg
e

M
m e di

COHENS-D
threshold

ll
a
m
S

Cohen
sd

Tr
i

v
i
al

Analysts based on their


domain trends and required
granularity set the effect size
Acts as a tunable threshold to
reduce false positive
identification of discontinuity by
our approach

24

Subjects of Study

DVD Store

System: Simulation
Domain: Cloud Computing
Type of Data: Synthetic
Data

System: Open Source


Domain: Ecommerce
Type of Data: Performance
Tests

System: Industrial System


Domain: Cloud Computing
Type of Data: Production
Data

25

Fault Injection
Category

Types of Faults
CPU Stress

Anomalies

Memory Stress
Interfering Workload
Workload as Multiplicative
Factor

Discontinuities

Change in Transaction
Pattern
Hardware & Software
Upgrade

We had NO prior knowledge of the


underlying fault in the data obtained
from the industrial system
26

Results
1

0.92

0.9

0.8
3 0.8
0.7
0.6

0.72

Proposed technique has


high accuracy in detecting
discontinuities

0.5
F-measure 0.4
0.3
0.2
0.1
0

27

Results
1

0.92

0.9

0.8
3 0.8
0.7
0.6

0.72

Proposed technique has


high accuracy in detecting
discontinuities

0.5
F-measure 0.4
0.3
0.2
0.1
0

28

Results
1

0.92

0.9

0.8
3 0.8
0.7
0.6

0.72

Proposed technique has


high accuracy in detecting
discontinuities

0.5
F-measure 0.4
0.3
0.2
0.1
0

29

Limitations of Our Approach


Sensitivity

We can tune the sensitivity of our approach by adjusting effect size.

o Using large effect size reduces false alarms, this may result in an analyst overlooking
significant discontinuities.
o Analysts have to conduct multiple experiments

Determining a threshold value is a


problem
An automated techniques,
generally can not decide whether
identified discontinuity is important or is
noise.
30

Limitations of Our Approach


Distinguishab
ility

The approach can not distinguish between

o Overlapping discontinuities and


o Different type of discontinuities.
Analysts have to manually
inspect the identified
discontinuity and take
actions
Distinguisibili
ty
31

32

QUESTIONS

Haroon Malik
malikh@uwaterloo.ca
33

You might also like