You are on page 1of 33

Mining with Rare

Cases
Paper by Gary M. Weiss
Presenter: Indar Bhatia
INFS 795
April 28, 2005

Presentation Overview

1.
2.
3.
4.

Motivation and Introduction to problem


Why Rare Cases are Problematic
Techniques for Handling Rare Cases
Summary and Conclusion

Motivation and Introduction

What are rare cases?

A case corresponds to a region in the instance


space that is meaningful to the domain under study.
A rare case is a case that covers a small region of
the instance space

Why are they important

Detecting suspicious cargo


Finding sources of rare diseases
Detecting Fraud
Finding terrorists
Identifying rare diseases

Classification Problem

Covers relatively few training examples


Example: Finding association between
infrequently purchased supermarket items

Modeling Problem
P2

P1

Clustering: showing 1 common


And 3 rare classes

P3

Two Class Classification:


Positive Class contains 1
Common and 2 Rare classes

For a classification problem, the rare cases may manifest


themselves as small disjuncts, i.e., those disjuncts in the
classifier that cover few training examples
In unsupervised learning, the three rare cases will be
more difficult to generalize from because they contain
fewer data points
In association rule mining, the problem will be to detect
items that co-occur most infrequently.

Modeling Problem

Current research indicates rare cases and small


disjuncts pose difficulties for data mining, i.e., rare
cases have much higher misclassification rate than
common cases.
Small disjuncts collectively cover a substantial fraction
of all examples and cannot simply be eliminated doing
so will substantially degrade the performance of a
classifier.
In a most thorough study of small disjuncts, ( Weiss &
Hirsh, 2000), it was shown that in the classifiers induced
from 30 real-world data sets, most classifier errors are
contributed by the smaller disjuncts.

Why Rare Cases are Problematic

Problems arise due to absolute rarity

Most fundamental problem is associated lack of data only


few examples related to rare cases are in the data set
(Absolute rarity)
Lack of data makes it difficult to detect rare cases, and if
detected, makes generalization difficult

Problems arise due to relative rarity

Looking for a needle in a Haystack rare cases are obscured


by common cases (Relative rarity)
Data mining algorithms rely on greedy Search heuristics that
examine one variable at a time. Since the detection of rare
cases may depend on the conjunction of many conditions, any
single condition in isolation may not provide much guidance.
For example , consider Association rule mining problem.
Association Analysis has to have a very low support, support
=0. This causes a combinatorial explosion in large datasets.

Why Rare Cases are


Problematic

The Metrics
The metrics used to evaluate classifier accuracy
are more focused on common cases. As a
consequence, rare cases may be totally ignored.
Example:
consider decision tree. Most decision trees are
grown in a top-down manner, where test conditions
are repeatedly evaluated and the best one
selected.
The metrics (i.e., the information gain) used to
select the best test generally prefers tests that
result in a balanced tree where purity is increased
for most of the examples.
Rare cases which correspond to high purity
branches covering few examples will often not be
included in the decision tree.

Why Rare Cases are Problematic

The Bias
The bias of a data mining system is critical to its
performance. The extra-evidentiary bias makes it
possible to generalize from specific examples.
Bias used by many data mining systems, especially
those used to induce classifiers, employ a maximumgenerality bias.
This means that when a disjunct that covers some set
of training examples is formed, only the most general
set of conditions that satisfy those examples are
selected.
The maximum-generality bias works well for common
cases, but not for rare cases/small disjuncts.
Attempts to address the problems of small disjuncts by
selecting an appropriate bias must be considered.

Why Rare Cases are


Problematic

Noisy data
Sufficient high level of background noise may
prevent the learner to distinguish between noise
and rare cases.
Unfortunately, there is not much that can be done
to minimize the impact on noise on rare cases.
For example: Pruning and overfitting avoidance
techniques, as well as inductive biases that foster
generalization, can minimize the overall impact of
noise but, because these methods tend to remove
both the rare cases and noise-generated ones, they
do so at the expense of rare cases.

Techniques For Handling rare


Cases
1.
2.
3.
4.
5.
6.
7.

Obtain Additional Training Data


Use a More Appropriate Inductive Bias
Use More Appropriate Metrics
Employ Non-Greedy Search Techniques
Employ Knowledge/Human Interaction
Employ Boosting
Place Rare Cases Into Separate Classes

1. Obtain Additional Training Data

Simply obtaining additional training data will not help


much because most of the new data will be also
associated with the common cases and may be some
associated with rare cases. This may help problems of
absolute rarity but not with relative rarity
Only by selectively obtaining additional data for the rare
cases can one address the issues with relative rarity.
Such a sampling scheme will also help with absolute
rarity.
The selective sampling approach does not seem
practical for real-world data sets.

2. Use a More Appropriate


Inductive Bias

Rare cases tend to cause small disjuncts to be formed in


a classifier induced from labeled data. This is partly due
to bias used by most learners.
Simple strategies that eliminate all small disjuncts or use
statistical significance testing to prevent small disjuncts
from being formed, have proven to perform poorly.
More sophisticated approaches for adjusting the bias of a
learner in order to minimize the problem with small
disjuncts have been investigated.
Holte et al. (1989) use a maximum generality bias for
large disjuncts and use a maximum specificity bias for
small disjuncts. This was shown to have improved
performance of small disjuncts but degrade the
performance of large disjuncts, yielding poorer overall
performance.

2. Use a More Appropriate


Inductive Bias

The approach was refined to ensure that the more


specific bias used to induce the small disjuncts does not
affect and therefore cannot degrade the performance
of the large disjuncts.
This was accomplished by using different learners for
examples that fall into large disjuncts and examples that
fall into small disjuncts. (Ting, 1994)
This hybrid approach was shown to improve the accuracy
of small disjuncts, the results were not conclusive.
Carvalho and Frietas(2002a, 2002b) essentially use the
same approach, except that the set of training examples
falling into each individual small disjunct are used to
generate a separate classifier.
Several attempts have been made to perform better on
rare cases by using a highly specific bias for the induced
small disjuncts. These methods have shown mixed
success.

3. Use More Appropriate Metrics

Altering Relative importance of Precision vs.


Recall:

Use Evaluation Metrics that, unlike accuracy metrics, do


not discount the importance of rare cases.
Given a classification rule R that predicts target class C,
the recall of R is the % of examples belonging to C that
are correctly identified while the precision of R is the % of
times that the rule is correct.
Rare cases can be given more prominence by increasing
the importance of precision over recall.
Timeweaver (Weiss, 1999), a genetic-algorithm based
classification system, searches for rare cases by carefully
altering the relative importance of precision vs. recall

3. Use More Appropriate


Metrics
Two-Phase Rule Induction:

PNrule (Joshi, Aggarwal & Kumar, 2001) uses two-phase


rule induction to focus on each measure separately.
The first Phase focuses on recall. In the second phase,
precision is optimized. This is accomplished by learning to
identify false positives within the rule from phase-1.
In the Needle-in-the-haystack analogy, the first phase
identifies regions likely to contain the needle, then in the
second phase learns to discard the hay strands within
these regions.

PN-rule Learning

P-phase:

Positive examples with good support


Seek good recall

N-phase:

Remove FP from examples of P-phase


High accuracy and significant support

4. Employ Non-Greedy Search


Techniques

Most Greedy algorithms are designed to be locally optimal,


so as to avoid local minima. This is done to make sure that
the solution remains tractable. Mining algorithms based on
Greedy method are not globally optimal.
Greedy algorithms are not suitable for dealing with rare
cases because rare cases may depend on the conjunction of
many conditions and any single condition in isolation my not
provide the needed solution.
Mining solution algorithms for handling rare cases must use
more powerful global search methods.
Recommended solution:
Genetic algorithms, which operate on a population of
candidate solutions rather than a single solution:
For this reason GA are more appropriate for rare cases.
(Goldberg, 1989), (Freitas, 2002), (Weiss, 1999), (Cavallo
and Freitas, 2002)

5. Employ Knowledge/Human
Interaction

Interaction and knowledge of domain experts can be


used more effectively for rare case mining.
Example:

SAR detection
Rare disease detection
Etc.

6. Employ Boosting

Boosting algorithms, such as AdaBoost, are iterative


algorithms that place different weights on the training
distribution at each iteration.
Following each iteration, boosting increases the weights
associated with the incorrectly classified examples and
decreases the weight associated with the correctly
classified examples.
This forces the learner to focus more on the incorrectly
classified examples in the next iteration,
An algorithm, RareBoost (Joshi, Kumar and Agarwal,
2001) which applies modified weight-update mechanism
to improve the performance of rare classes and rare
cases.

7. Place Rare Cases Into Separate


Classes

Rare cases complicate classification because


different rare cases may have little in common
between them, making it difficult to assign same
class label to all of them.
Solution: Reformulate the problem so that rare cases
are viewed as separate classes.
Approach:
1.
2.
3.
4.

Separate each class into subclasses using clustering


Learn after re-labeling the training examples with the
new class labels
Because multiple clustering experiments were used in
steps 1, step 2 involves learning multiple models.
These models are combined using voting.

Boosting based algorithms

RareBoost
Updates the weights differently

SMOTEBoost
Combination of SMOTE (Synthetic
Minority Oversampling Technique)
and boosting

CREDOS

First use ripple down rules to


overfit the data
Ripple down rules are often used

Then prune to improve


generalization
Different mechanism from decision
trees

Cost Sensitive Modeling

Detection rate / False Alarm rate may be


misleading
Cost factors: damage cost, response
cost, operational cost
Costs for TP, FP, TN, FN
Define cumulative cost

Outlier Detection Schemes

Detect intrusions (data


points) that are very
different from the
normal activities (rest
of the data points)
General Steps

Identify normal
behavior
Construct useful set of
features
Define similarity function
Use outlier detection
algorithm

Statistics based
Distance based
Model based

Distance Based Outlier


Detection

Represent data as a vector of


features
Major approaches

Nearest neighbor based


Density based
Clustering based

Problem
High dimensionality of data

Distance Based Nearest


Neighbor

Not enough neighbors Outliers


Compute distance d to the k-th nearest neighbor
Outlier points
Located in more sparse neighborhoods
Have d larger than a certain threashold

Mahalanobis-distance based approach


More appropriate for computing distance with
y
skewed distributions
*
*

*
*
*
*

p2

*
*
* *

* *
* *
*
*
* *
* *
*
*

p1

Distance Based Density

Local Outlier Factor (LOF)


Average of the ratios of the density of example p
and the density of its nearest neighbors

Compute density of local neighborhood for


each point
Compute LOF
Larger LOF Outliers

p2

p1

Distance Based Clustering

Radius w of proximity is specified


Two points x1 and x2 are near if d(x1,
x2)<w
Define N(x) as number of points that are
within w of x
Points in small cluster Outliers
Fixed-width clustering for speedup

Distance Based - Clustering


(cont.)

K-Nearst Neighbor + Canopy Clustering


Compute sum of distances to k nearest
neighbors

Small K-NN point in dense region

Canopy clustering for speedup

WaveCluster
Transform data into multidimensional
signals using wavelet transformation
Remove Hign/Low frequency parts
Remaining parts Outliers

Model Based Outlier Detection

Similar to Probabilistic Based schemes


Build prediction model for normal
behavior
Deviation from model potential
intrusion

Major approaches
Neural networks
Unsupervised Support Vector Machines
(SVMs)

Model Based - Neural Networks

Use a replicator 4-layer feed-forward neural network


Input variables are the target output during training
RNN forms a compressed model for traning data
Outlyingness reconstruction error

Model Based - SVMs

Attempt to separate the entire set of


training data from the origin
Regions where most data lies are
labeled as one class

Parameters

Expected outlier rates


Good for high quality
controlled training
data
Variance of Radial Basis Function
(RBF)
- Larger higher detection
rate and more false alarm

origin

Summary And Conclusion

Rare classes, which result from highly skewed class


distribution, share many of the problems associated with
rare cases. Rare classes and rare cases are connected.
Rare cases may occur can occur within both rare classes
and common classes, it is expected that rare cases to be
more of an issue for rare classers.
(Japkowicz, 2001) views rare classes as a consequence
of between-class imbalance and rare cases as a
consequence of within-class imbalances.
Thus, both forms of rarity are a type of data imbalance
Modeling improvements presented in this paper are
applicable to both types of rarity.

You might also like