Chap10 Anomaly Detection

Data Mining
Anomaly Detection
Lecture Notes for Chapter 10
Introduction to Data Mining

by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining 05/02/2006 1
Anomaly/Outlier Detection
z What are anomalies/outliers?

– The set of data points that are
considerably different than the
remainder
i d off the
h d data
z Natural implication is that anomalies are

relatively rare
– One in a thousand occurs often if you have lots of data
– Context is important,
important ee.g.,
g freezing temps in July
z Can be important or a nuisance

– 10 foot tall 2 year old
– Unusually high blood pressure
11/29/2007 Introduction to Data Mining 2
Importance of Anomaly Detection
Ozone Depletion History

z In 1985 three researchers (Farman,
Gardinar and Shanklin) were
puzzled by data gathered by the
British Antarctic Survey showing that
ozone levels for Antarctica had
dropped 10% below normal levels
z Why did the Nimbus 7 satellite,

which had instruments aboard for
recording ozone levels, not record
similarly low ozone concentrations?
z The ozone concentrations recorded

by the satellite were so low they
were being treated as outliers by a
Sources:
computer program and discarded!
http://exploringdata.cqu.edu.au/ozone.html
http://www.epa.gov/ozone/science/hole/size.html
Causes of Anomalies
z Data from different classes

– Measuring the weights of oranges, but a few grapefruit
are mixed in
z Natural variation
– Unusually tall people
z Data errors
– 200 pound 2 year old

Distinction Between Noise and Anomalies
z Noise is erroneous, perhaps random, values or

contaminating objects
– Weight recorded incorrectly
– Grapefruit mixed in with the oranges
z Noise doesn’t necessarily produce unusual values or

objects
z Noise is not interesting
z Anomalies may be interesting if they are not a result of
noise
z Noise and anomalies are related but distinct concepts
General Issues: Number of Attributes
z Many anomalies are defined in terms of a single attribute

– Height
– Shape
p
– Color
z Can be hard to find an anomaly using all attributes

– Noisy or irrelevant attributes
– Object is only anomalous with respect to some attributes
z However, an object may not be anomalous in any one

attribute

General Issues: Anomaly Scoring
z Many anomaly detection techniques provide only a binary

categorization
– An object
j is an anomaly
y or it isn’t
– This is especially true of classification-based approaches
z Other approaches assign a score to all points

– This score measures the degree to which an object is an anomaly
– This allows objects to be ranked
z In the end, you often need a binary decision

– Should this credit card transaction be flagged?
– Still useful to have a score
z How many anomalies are there?
Other Issues for Anomaly Detection
z Find all anomalies at once or one at a time

– Swamping
– Masking
g
z Evaluation
– How do you measure performance?
– Supervised vs. unsupervised situations
z Efficiency
z Context
– Professional basketball team

Variants of Anomaly Detection Problems
z Given a data set D, find all data points x ∈ D with

anomaly scores greater than some threshold t
z Given a data set D, find all data points x ∈ D

having the top-n largest anomaly scores
z Given a data set D, containing mostly normal (but

unlabeled)
l b l d) d
data
t points,
i t and d a ttestt point
i t x,
compute the anomaly score of x with respect to D
Model-Based Anomaly Detection
z Build a model for the data and see

– Unsupervised
Anomalies are those points that don’t fit well
Anomalies are those points that distort the model
Examples:
– Statistical distribution
– Clusters
– Regression
– Geometric
– Graph
– Supervised
Anomalies are regarded as a rare class
Need to have training data

Additional Anomaly Detection Techniques
z Proximity-based
– Anomalies are points far away from other points
– Can
C d detect
t t thi
this graphically
hi ll iin some cases
z Density-based
– Low density points are outliers
z Pattern matching
– Create profiles or templates of atypical but important
events
t or objects
bj t
– Algorithms to detect these patterns are usually simple
and efficient
Graphical Approaches
z Boxplots or scatter plots
z Limitations
– Not automatic
– Subjective

Convex Hull Method
z Extreme points are assumed to be outliers

z Use convex hull method to detect extreme values
z What if the outlier occurs in the middle of the

data?
Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that
has a low probability with respect to a probability distribution
model of the data.
z Usually assume a parametric model describing the distribution
of the data (e.g., normal distribution)
z Apply a statistical test that depends on
– Data distribution
– Parameters of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)
z I
Issues
– Identifying the distribution of a data set
Heavy tailed distribution
– Number of attributes
– Is the data a mixture of distributions?
Normal Distributions
One-dimensional
Gaussian
7
0.1
6
0.09
5
0.08
4
0.07
3
0.06
Two-dimensional
2
0.05 Gaussian
y
0 0.04
-1 0.03
-2 0.02
-3 0.01
-4
probability
-5 density
-4 -3 -2 -1 0 1 2 3 4 5
x
Grubbs’ Test
z Detect outliers in univariate data

z Assume data comes from normal distribution
z Detects one outlier at a time, remove the outlier,
and repeat
– H0: There is no outlier in data
– HA: There is at least one outlier
z Grubbs’ test statistic: max X − X
G=
s
z Reject H0 if: t (2α / N , N − 2 )
( N − 1)
G>
N N − 2 + t (2α / N , N − 2 )
Statistical-based – Likelihood Approach
z Assume the data set D contains samples from a

mixture of two probability distributions:
– M (majority distribution)
– A (anomalous distribution)
z General Approach:
– Initially, assume all the data points belong to M
– Let Lt(D) be the log likelihood of D at time t
– For
F each h point
i t xt that
th t belongs
b l tto M,
M move it to
t A
Let Lt+1 (D) be the new log likelihood.
Compute the difference, Δ = Lt(D) – Lt+1 (D)
If Δ > c (some threshold), then xt is declared as an anomaly
and moved permanently from M to A
Statistical-based – Likelihood Approach
z Data distribution, D = (1 – λ) M + λ A
z M is a probability distribution estimated from data
– Can be based on any modeling method (naïve Bayes,
maximum entropy, etc)
z A is initially assumed to be uniform distribution
z Likelihood at time t:
N ⎛ ⎞⎛ ⎞
Lt ( D ) = ∏ PD ( xi ) = ⎜⎜ (1 − λ )|M t | ∏ PM t ( xi ) ⎟⎟⎜⎜ λ|At | ∏ PAt ( xi ) ⎟⎟
i =1 ⎝ xi ∈M t ⎠⎝ xi ∈At ⎠
LLt ( D ) = M t log(1 − λ ) + ∑ log PM t ( xi ) + At log λ + ∑ log PAt ( xi )
xi ∈M t xi ∈At

Strengths/Weaknesses of Statistical Approaches
z Firm mathematical foundation
z Can be veryy efficient
z Good results if distribution is known
z In many cases, data distribution may not be known
z For high dimensional data,

data it may be difficult to estimate
the true distribution
z Anomalies can distort the parameters of the distribution
Distance-Based Approaches
z Several different techniques
z An object is an outlier if a specified fraction of the

objects is more than a specified distance away
(Knorr, Ng 1998)
– Some statistical definitions are special cases of this
z The outlier
Th tli score off an object
bj t is
i the
th distance
di t tto
its kth nearest neighbor

One Nearest Neighbor - One Outlier
D 2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
Outlier Score
One Nearest Neighbor - Two Outliers

0.55
D
0.5
0.45
0.4
0.35
0.3
0.25
02
0.2
0.15
0.1
0.05
Outlier Score
Five Nearest Neighbors - Small Cluster
2
D
1.8
16
1.6
1.4
1.2
0.8
0.6
0.4
Outlier Score
Five Nearest Neighbors - Differing Density
D 1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
Outlier Score
Strengths/Weaknesses of Distance-Based Approaches
z Simple
z Expensive – O(n2)
z Sensitive to parameters
z Sensitive to variations in density
z Distance becomes less meaningful in high-

dimensional space
Density-Based Approaches
z Density-based Outlier: The outlier score of an

object is the inverse of the density around the
object.
object
– Can be defined in terms of the k nearest neighbors
– One definition: Inverse of distance to kth neighbor
– Another definition: Inverse of the average distance to k
neighbors
– DBSCAN definition
z If there are regions of different density, this

approach can have problems
Relative Density
z Consider the density of a point relative to that of

its k nearest neighbors
Relative Density Outlier Scores
6.85
6
C
1.40 4
D
1.33
2
A
Outlier Score
Density-based: LOF approach
z For each point, compute the density of its local

neighborhood
z Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the
density of its nearest neighbors
z Outliers are points with largest LOF value
In the NN approach, p2 is
nott considered
id d as outlier,
tli
while LOF approach find
both p1 and p2 as outliers
p2
× p1
×
Strengths/Weaknesses of Density-Based Approaches
z Simple
z Expensive – O(n2)
z Sensitive to parameters
z Density becomes less meaningful in high-

dimensional space

Clustering-Based Approaches
z Density-based Outlier: An object

is a cluster-based outlier if it does
not strongly belong to any cluster
– For prototype-based clusters, an
object is an outlier if it is not close
enough to a cluster center
– For density-based clusters, an object
is an outlier if its density is too low
– For graph-based clusters, an object
is an outlier if it is not well connected
z Other issues include the impact of
outliers on the clusters and the
number of clusters
Distance of Points from Closest Centroids

4.5
4.6
4
C
35
3.5
2.5
D 0.17
2
1.5
1.2 1
A
0.5
Outlier Score
Relative Distance of Points from Closest Centroid
4
3.5
2.5
1.5
0.5
Outlier Score
Strengths/Weaknesses of Distance-Based Approaches
z Simple
z Many clustering techniques can be used
z Can be difficult to decide on a clustering

technique
z Can be difficult to decide on number of clusters
z Outliers can distort the clusters

Chap10 Anomaly Detection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap10 Anomaly Detection

Uploaded by

Copyright:

Available Formats

Data Mining

Lecture Notes for Chapter 10

Introduction to Data Mining

© Tan,Steinbach, Kumar Introduction to Data Mining 05/02/2006 1

z What are anomalies/outliers?

z Natural implication is that anomalies are

z Can be important or a nuisance

Ozone Depletion History

z Why did the Nimbus 7 satellite,

z The ozone concentrations recorded

11/29/2007 Introduction to Data Mining 3

z Data from different classes

11/29/2007 Introduction to Data Mining 4

z Noise is erroneous, perhaps random, values or

z Noise doesn’t necessarily produce unusual values or

11/29/2007 Introduction to Data Mining 5

General Issues: Number of Attributes

z Many anomalies are defined in terms of a single attribute

z Can be hard to find an anomaly using all attributes

z However, an object may not be anomalous in any one

11/29/2007 Introduction to Data Mining 6

z Many anomaly detection techniques provide only a binary

z Other approaches assign a score to all points

z In the end, you often need a binary decision

Other Issues for Anomaly Detection

z Find all anomalies at once or one at a time

11/29/2007 Introduction to Data Mining 8

z Given a data set D, find all data points x ∈ D with

z Given a data set D, find all data points x ∈ D

z Given a data set D, containing mostly normal (but

11/29/2007 Introduction to Data Mining 9

Model-Based Anomaly Detection

z Build a model for the data and see

11/29/2007 Introduction to Data Mining 10

11/29/2007 Introduction to Data Mining 11

z Boxplots or scatter plots

11/29/2007 Introduction to Data Mining 12

z Extreme points are assumed to be outliers

z What if the outlier occurs in the middle of the

11/29/2007 Introduction to Data Mining 15

z Detect outliers in univariate data

z Assume the data set D contains samples from a

Statistical-based – Likelihood Approach

11/29/2007 Introduction to Data Mining 18

z Firm mathematical foundation

z Can be veryy efficient

z Good results if distribution is known

z In many cases, data distribution may not be known

z For high dimensional data,

z Anomalies can distort the parameters of the distribution

11/29/2007 Introduction to Data Mining 19

z Several different techniques

z An object is an outlier if a specified fraction of the

11/29/2007 Introduction to Data Mining 20

One Nearest Neighbor - Two Outliers

Five Nearest Neighbors - Differing Density

z Sensitive to variations in density

z Distance becomes less meaningful in high-

11/29/2007 Introduction to Data Mining 25

z Density-based Outlier: The outlier score of an

z If there are regions of different density, this

z Consider the density of a point relative to that of