Professional Documents
Culture Documents
Anomaly Detection
Anomaly/Outlier Detection
Causes of Anomalies
z Natural variation
– Unusually tall people
z Data errors
– 200 pound 2 year old
z Evaluation
– How do you measure performance?
– Supervised vs. unsupervised situations
z Efficiency
z Context
– Professional basketball team
z Proximity-based
– Anomalies are points far away from other points
– Can
C d detect
t t thi
this graphically
hi ll iin some cases
z Density-based
– Low density points are outliers
z Pattern matching
– Create profiles or templates of atypical but important
events
t or objects
bj t
– Algorithms to detect these patterns are usually simple
and efficient
Graphical Approaches
z Limitations
– Not automatic
– Subjective
Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that
has a low probability with respect to a probability distribution
model of the data.
z Usually assume a parametric model describing the distribution
of the data (e.g., normal distribution)
z Apply a statistical test that depends on
– Data distribution
– Parameters of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)
z I
Issues
– Identifying the distribution of a data set
Heavy tailed distribution
– Number of attributes
– Is the data a mixture of distributions?
11/29/2007 Introduction to Data Mining 14
Normal Distributions
One-dimensional
Gaussian
7
0.1
6
0.09
5
0.08
4
0.07
3
0.06
Two-dimensional
2
0.05 Gaussian
y
0 0.04
-1 0.03
-2 0.02
-3 0.01
-4
probability
-5 density
-4 -3 -2 -1 0 1 2 3 4 5
x
Grubbs’ Test
z Data distribution, D = (1 – λ) M + λ A
z M is a probability distribution estimated from data
– Can be based on any modeling method (naïve Bayes,
maximum entropy, etc)
z A is initially assumed to be uniform distribution
z Likelihood at time t:
N ⎛ ⎞⎛ ⎞
Lt ( D ) = ∏ PD ( xi ) = ⎜⎜ (1 − λ )|M t | ∏ PM t ( xi ) ⎟⎟⎜⎜ λ|At | ∏ PAt ( xi ) ⎟⎟
i =1 ⎝ xi ∈M t ⎠⎝ xi ∈At ⎠
LLt ( D ) = M t log(1 − λ ) + ∑ log PM t ( xi ) + At log λ + ∑ log PAt ( xi )
xi ∈M t xi ∈At
Distance-Based Approaches
z The outlier
Th tli score off an object
bj t is
i the
th distance
di t tto
its kth nearest neighbor
D 2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
Outlier Score
11/29/2007 Introduction to Data Mining 21
D
0.5
0.45
0.4
0.35
0.3
0.25
02
0.2
0.15
0.1
0.05
Outlier Score
11/29/2007 Introduction to Data Mining 22
Five Nearest Neighbors - Small Cluster
2
D
1.8
16
1.6
1.4
1.2
0.8
0.6
0.4
Outlier Score
11/29/2007 Introduction to Data Mining 23
D 1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
Outlier Score
11/29/2007 Introduction to Data Mining 24
Strengths/Weaknesses of Distance-Based Approaches
z Simple
z Expensive – O(n2)
z Sensitive to parameters
Density-Based Approaches
6.85
6
C
1.40 4
D
1.33
2
A
Outlier Score
11/29/2007 Introduction to Data Mining 28
Density-based: LOF approach
In the NN approach, p2 is
nott considered
id d as outlier,
tli
while LOF approach find
both p1 and p2 as outliers
p2
× p1
×
z Simple
z Expensive – O(n2)
z Sensitive to parameters
4
C
35
3.5
2.5
D 0.17
2
1.5
1.2 1
A
0.5
Outlier Score
11/29/2007 Introduction to Data Mining 32
Relative Distance of Points from Closest Centroid
4
3.5
2.5
1.5
0.5
Outlier Score
11/29/2007 Introduction to Data Mining 33
z Simple