You are on page 1of 17

Data Mining

Anomaly Detection

Lecture Notes for Chapter 10

Introduction to Data Mining


by
Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar Introduction to Data Mining 05/02/2006 1

Anomaly/Outlier Detection

z What are anomalies/outliers?


– The set of data points that are
considerably different than the
remainder
i d off the
h d data

z Natural implication is that anomalies are


relatively rare
– One in a thousand occurs often if you have lots of data
– Context is important,
important ee.g.,
g freezing temps in July

z Can be important or a nuisance


– 10 foot tall 2 year old
– Unusually high blood pressure
11/29/2007 Introduction to Data Mining 2
Importance of Anomaly Detection

Ozone Depletion History


z In 1985 three researchers (Farman,
Gardinar and Shanklin) were
puzzled by data gathered by the
British Antarctic Survey showing that
ozone levels for Antarctica had
dropped 10% below normal levels

z Why did the Nimbus 7 satellite,


which had instruments aboard for
recording ozone levels, not record
similarly low ozone concentrations?

z The ozone concentrations recorded


by the satellite were so low they
were being treated as outliers by a
Sources:
computer program and discarded!
http://exploringdata.cqu.edu.au/ozone.html
http://www.epa.gov/ozone/science/hole/size.html

11/29/2007 Introduction to Data Mining 3

Causes of Anomalies

z Data from different classes


– Measuring the weights of oranges, but a few grapefruit
are mixed in

z Natural variation
– Unusually tall people

z Data errors
– 200 pound 2 year old

11/29/2007 Introduction to Data Mining 4


Distinction Between Noise and Anomalies

z Noise is erroneous, perhaps random, values or


contaminating objects
– Weight recorded incorrectly
– Grapefruit mixed in with the oranges

z Noise doesn’t necessarily produce unusual values or


objects
z Noise is not interesting
z Anomalies may be interesting if they are not a result of
noise
z Noise and anomalies are related but distinct concepts

11/29/2007 Introduction to Data Mining 5

General Issues: Number of Attributes

z Many anomalies are defined in terms of a single attribute


– Height
– Shape
p
– Color

z Can be hard to find an anomaly using all attributes


– Noisy or irrelevant attributes
– Object is only anomalous with respect to some attributes

z However, an object may not be anomalous in any one


attribute

11/29/2007 Introduction to Data Mining 6


General Issues: Anomaly Scoring

z Many anomaly detection techniques provide only a binary


categorization
– An object
j is an anomaly
y or it isn’t
– This is especially true of classification-based approaches

z Other approaches assign a score to all points


– This score measures the degree to which an object is an anomaly
– This allows objects to be ranked

z In the end, you often need a binary decision


– Should this credit card transaction be flagged?
– Still useful to have a score
z How many anomalies are there?
11/29/2007 Introduction to Data Mining 7

Other Issues for Anomaly Detection

z Find all anomalies at once or one at a time


– Swamping
– Masking
g

z Evaluation
– How do you measure performance?
– Supervised vs. unsupervised situations

z Efficiency

z Context
– Professional basketball team

11/29/2007 Introduction to Data Mining 8


Variants of Anomaly Detection Problems

z Given a data set D, find all data points x ∈ D with


anomaly scores greater than some threshold t

z Given a data set D, find all data points x ∈ D


having the top-n largest anomaly scores

z Given a data set D, containing mostly normal (but


unlabeled)
l b l d) d
data
t points,
i t and d a ttestt point
i t x,
compute the anomaly score of x with respect to D

11/29/2007 Introduction to Data Mining 9

Model-Based Anomaly Detection

z Build a model for the data and see


– Unsupervised
‹ Anomalies are those points that don’t fit well
‹ Anomalies are those points that distort the model
‹ Examples:
– Statistical distribution
– Clusters
– Regression
– Geometric
– Graph
– Supervised
‹ Anomalies are regarded as a rare class
‹ Need to have training data

11/29/2007 Introduction to Data Mining 10


Additional Anomaly Detection Techniques

z Proximity-based
– Anomalies are points far away from other points
– Can
C d detect
t t thi
this graphically
hi ll iin some cases
z Density-based
– Low density points are outliers
z Pattern matching
– Create profiles or templates of atypical but important
events
t or objects
bj t
– Algorithms to detect these patterns are usually simple
and efficient

11/29/2007 Introduction to Data Mining 11

Graphical Approaches

z Boxplots or scatter plots

z Limitations
– Not automatic
– Subjective

11/29/2007 Introduction to Data Mining 12


Convex Hull Method

z Extreme points are assumed to be outliers


z Use convex hull method to detect extreme values

z What if the outlier occurs in the middle of the


data?
11/29/2007 Introduction to Data Mining 13

Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that
has a low probability with respect to a probability distribution
model of the data.
z Usually assume a parametric model describing the distribution
of the data (e.g., normal distribution)
z Apply a statistical test that depends on
– Data distribution
– Parameters of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)
z I
Issues
– Identifying the distribution of a data set
‹ Heavy tailed distribution
– Number of attributes
– Is the data a mixture of distributions?
11/29/2007 Introduction to Data Mining 14
Normal Distributions

One-dimensional
Gaussian

7
0.1
6
0.09
5
0.08
4
0.07
3
0.06
Two-dimensional
2
0.05 Gaussian
y

0 0.04

-1 0.03

-2 0.02

-3 0.01

-4
probability
-5 density

-4 -3 -2 -1 0 1 2 3 4 5
x

11/29/2007 Introduction to Data Mining 15

Grubbs’ Test

z Detect outliers in univariate data


z Assume data comes from normal distribution
z Detects one outlier at a time, remove the outlier,
and repeat
– H0: There is no outlier in data
– HA: There is at least one outlier
z Grubbs’ test statistic: max X − X
G=
s
z Reject H0 if: t (2α / N , N − 2 )
( N − 1)
G>
N N − 2 + t (2α / N , N − 2 )
11/29/2007 Introduction to Data Mining 16
Statistical-based – Likelihood Approach

z Assume the data set D contains samples from a


mixture of two probability distributions:
– M (majority distribution)
– A (anomalous distribution)
z General Approach:
– Initially, assume all the data points belong to M
– Let Lt(D) be the log likelihood of D at time t
– For
F each h point
i t xt that
th t belongs
b l tto M,
M move it to
t A
‹ Let Lt+1 (D) be the new log likelihood.
‹ Compute the difference, Δ = Lt(D) – Lt+1 (D)
‹ If Δ > c (some threshold), then xt is declared as an anomaly
and moved permanently from M to A
11/29/2007 Introduction to Data Mining 17

Statistical-based – Likelihood Approach

z Data distribution, D = (1 – λ) M + λ A
z M is a probability distribution estimated from data
– Can be based on any modeling method (naïve Bayes,
maximum entropy, etc)
z A is initially assumed to be uniform distribution
z Likelihood at time t:

N ⎛ ⎞⎛ ⎞
Lt ( D ) = ∏ PD ( xi ) = ⎜⎜ (1 − λ )|M t | ∏ PM t ( xi ) ⎟⎟⎜⎜ λ|At | ∏ PAt ( xi ) ⎟⎟
i =1 ⎝ xi ∈M t ⎠⎝ xi ∈At ⎠
LLt ( D ) = M t log(1 − λ ) + ∑ log PM t ( xi ) + At log λ + ∑ log PAt ( xi )
xi ∈M t xi ∈At

11/29/2007 Introduction to Data Mining 18


Strengths/Weaknesses of Statistical Approaches

z Firm mathematical foundation

z Can be veryy efficient

z Good results if distribution is known

z In many cases, data distribution may not be known

z For high dimensional data,


data it may be difficult to estimate
the true distribution

z Anomalies can distort the parameters of the distribution

11/29/2007 Introduction to Data Mining 19

Distance-Based Approaches

z Several different techniques

z An object is an outlier if a specified fraction of the


objects is more than a specified distance away
(Knorr, Ng 1998)
– Some statistical definitions are special cases of this

z The outlier
Th tli score off an object
bj t is
i the
th distance
di t tto
its kth nearest neighbor

11/29/2007 Introduction to Data Mining 20


One Nearest Neighbor - One Outlier

D 2

1.8

1.6

1.4

1.2

0.8

0.6

0.4

Outlier Score
11/29/2007 Introduction to Data Mining 21

One Nearest Neighbor - Two Outliers


0.55

D
0.5

0.45

0.4

0.35

0.3

0.25

02
0.2

0.15

0.1

0.05
Outlier Score
11/29/2007 Introduction to Data Mining 22
Five Nearest Neighbors - Small Cluster
2

D
1.8

16
1.6

1.4

1.2

0.8

0.6

0.4

Outlier Score
11/29/2007 Introduction to Data Mining 23

Five Nearest Neighbors - Differing Density

D 1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

Outlier Score
11/29/2007 Introduction to Data Mining 24
Strengths/Weaknesses of Distance-Based Approaches

z Simple

z Expensive – O(n2)

z Sensitive to parameters

z Sensitive to variations in density

z Distance becomes less meaningful in high-


dimensional space

11/29/2007 Introduction to Data Mining 25

Density-Based Approaches

z Density-based Outlier: The outlier score of an


object is the inverse of the density around the
object.
object
– Can be defined in terms of the k nearest neighbors
– One definition: Inverse of distance to kth neighbor
– Another definition: Inverse of the average distance to k
neighbors
– DBSCAN definition

z If there are regions of different density, this


approach can have problems
11/29/2007 Introduction to Data Mining 26
Relative Density

z Consider the density of a point relative to that of


its k nearest neighbors

11/29/2007 Introduction to Data Mining 27

Relative Density Outlier Scores

6.85

6
C

1.40 4
D

1.33
2
A

Outlier Score
11/29/2007 Introduction to Data Mining 28
Density-based: LOF approach

z For each point, compute the density of its local


neighborhood
z Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the
density of its nearest neighbors
z Outliers are points with largest LOF value

In the NN approach, p2 is
nott considered
id d as outlier,
tli
while LOF approach find
both p1 and p2 as outliers
p2
× p1
×

11/29/2007 Introduction to Data Mining 29

Strengths/Weaknesses of Density-Based Approaches

z Simple

z Expensive – O(n2)

z Sensitive to parameters

z Density becomes less meaningful in high-


dimensional space

11/29/2007 Introduction to Data Mining 30


Clustering-Based Approaches

z Density-based Outlier: An object


is a cluster-based outlier if it does
not strongly belong to any cluster
– For prototype-based clusters, an
object is an outlier if it is not close
enough to a cluster center
– For density-based clusters, an object
is an outlier if its density is too low
– For graph-based clusters, an object
is an outlier if it is not well connected
z Other issues include the impact of
outliers on the clusters and the
number of clusters

11/29/2007 Introduction to Data Mining 31

Distance of Points from Closest Centroids


4.5
4.6

4
C
35
3.5

2.5

D 0.17
2

1.5

1.2 1

A
0.5

Outlier Score
11/29/2007 Introduction to Data Mining 32
Relative Distance of Points from Closest Centroid
4

3.5

2.5

1.5

0.5

Outlier Score
11/29/2007 Introduction to Data Mining 33

Strengths/Weaknesses of Distance-Based Approaches

z Simple

z Many clustering techniques can be used

z Can be difficult to decide on a clustering


technique

z Can be difficult to decide on number of clusters

z Outliers can distort the clusters

11/29/2007 Introduction to Data Mining 34

You might also like