Indar

Mining with Rare
Cases
Paper by Gary M. Weiss
Presenter: Indar Bhatia
INFS 795
April 28, 2005
Presentation Overview
1.
2.
3.
4.
Motivation and Introduction to problem

Why Rare Cases are Problematic
Techniques for Handling Rare Cases
Summary and Conclusion
Motivation and Introduction
What are rare cases?
A case corresponds to a region in the instance

space that is meaningful to the domain under study.
A rare case is a case that covers a small region of
the instance space
Why are they important
Detecting suspicious cargo

Finding sources of rare diseases
Detecting Fraud
Finding terrorists
Identifying rare diseases
Classification Problem
Covers relatively few training examples

Example: Finding association between
infrequently purchased supermarket items
Modeling Problem
P2
P1
Clustering: showing 1 common

And 3 rare classes
P3
Two Class Classification:

Positive Class contains 1
Common and 2 Rare classes
For a classification problem, the rare cases may manifest

themselves as small disjuncts, i.e., those disjuncts in the
classifier that cover few training examples
In unsupervised learning, the three rare cases will be
more difficult to generalize from because they contain
fewer data points
In association rule mining, the problem will be to detect
items that co-occur most infrequently.
Modeling Problem
Current research indicates rare cases and small

disjuncts pose difficulties for data mining, i.e., rare
cases have much higher misclassification rate than
common cases.
Small disjuncts collectively cover a substantial fraction
of all examples and cannot simply be eliminated doing
so will substantially degrade the performance of a
classifier.
In a most thorough study of small disjuncts, ( Weiss &
Hirsh, 2000), it was shown that in the classifiers induced
from 30 real-world data sets, most classifier errors are
contributed by the smaller disjuncts.
Problems arise due to absolute rarity
Most fundamental problem is associated lack of data only

few examples related to rare cases are in the data set
(Absolute rarity)
Lack of data makes it difficult to detect rare cases, and if
detected, makes generalization difficult
Problems arise due to relative rarity
Looking for a needle in a Haystack rare cases are obscured

by common cases (Relative rarity)
Data mining algorithms rely on greedy Search heuristics that
examine one variable at a time. Since the detection of rare
cases may depend on the conjunction of many conditions, any
single condition in isolation may not provide much guidance.
For example , consider Association rule mining problem.
Association Analysis has to have a very low support, support
=0. This causes a combinatorial explosion in large datasets.
Why Rare Cases are

Problematic
The Metrics
The metrics used to evaluate classifier accuracy
are more focused on common cases. As a
consequence, rare cases may be totally ignored.
Example:
consider decision tree. Most decision trees are
grown in a top-down manner, where test conditions
are repeatedly evaluated and the best one
selected.
The metrics (i.e., the information gain) used to
select the best test generally prefers tests that
result in a balanced tree where purity is increased
for most of the examples.
Rare cases which correspond to high purity
branches covering few examples will often not be
included in the decision tree.
The Bias
The bias of a data mining system is critical to its
performance. The extra-evidentiary bias makes it
possible to generalize from specific examples.
Bias used by many data mining systems, especially
those used to induce classifiers, employ a maximumgenerality bias.
This means that when a disjunct that covers some set
of training examples is formed, only the most general
set of conditions that satisfy those examples are
selected.
The maximum-generality bias works well for common
cases, but not for rare cases/small disjuncts.
Attempts to address the problems of small disjuncts by
selecting an appropriate bias must be considered.
Why Rare Cases are

Problematic
Noisy data
Sufficient high level of background noise may
prevent the learner to distinguish between noise
and rare cases.
Unfortunately, there is not much that can be done
to minimize the impact on noise on rare cases.
For example: Pruning and overfitting avoidance
techniques, as well as inductive biases that foster
generalization, can minimize the overall impact of
noise but, because these methods tend to remove
both the rare cases and noise-generated ones, they
do so at the expense of rare cases.
Techniques For Handling rare

Cases
1.
2.
3.
4.
5.
6.
7.
Obtain Additional Training Data

Use a More Appropriate Inductive Bias
Use More Appropriate Metrics
Employ Non-Greedy Search Techniques
Employ Knowledge/Human Interaction
Employ Boosting
Place Rare Cases Into Separate Classes
1. Obtain Additional Training Data
Simply obtaining additional training data will not help

much because most of the new data will be also
associated with the common cases and may be some
associated with rare cases. This may help problems of
absolute rarity but not with relative rarity
Only by selectively obtaining additional data for the rare
cases can one address the issues with relative rarity.
Such a sampling scheme will also help with absolute
rarity.
The selective sampling approach does not seem
practical for real-world data sets.
2. Use a More Appropriate

Inductive Bias
Rare cases tend to cause small disjuncts to be formed in

a classifier induced from labeled data. This is partly due
to bias used by most learners.
Simple strategies that eliminate all small disjuncts or use
statistical significance testing to prevent small disjuncts
from being formed, have proven to perform poorly.
More sophisticated approaches for adjusting the bias of a
learner in order to minimize the problem with small
disjuncts have been investigated.
Holte et al. (1989) use a maximum generality bias for
large disjuncts and use a maximum specificity bias for
small disjuncts. This was shown to have improved
performance of small disjuncts but degrade the
performance of large disjuncts, yielding poorer overall
performance.
2. Use a More Appropriate

Inductive Bias
The approach was refined to ensure that the more

specific bias used to induce the small disjuncts does not
affect and therefore cannot degrade the performance
of the large disjuncts.
This was accomplished by using different learners for
examples that fall into large disjuncts and examples that
fall into small disjuncts. (Ting, 1994)
This hybrid approach was shown to improve the accuracy
of small disjuncts, the results were not conclusive.
Carvalho and Frietas(2002a, 2002b) essentially use the
same approach, except that the set of training examples
falling into each individual small disjunct are used to
generate a separate classifier.
Several attempts have been made to perform better on
rare cases by using a highly specific bias for the induced
small disjuncts. These methods have shown mixed
success.
3. Use More Appropriate Metrics
Altering Relative importance of Precision vs.

Recall:
Use Evaluation Metrics that, unlike accuracy metrics, do

not discount the importance of rare cases.
Given a classification rule R that predicts target class C,
the recall of R is the % of examples belonging to C that
are correctly identified while the precision of R is the % of
times that the rule is correct.
Rare cases can be given more prominence by increasing
the importance of precision over recall.
Timeweaver (Weiss, 1999), a genetic-algorithm based
classification system, searches for rare cases by carefully
altering the relative importance of precision vs. recall
3. Use More Appropriate

Metrics
Two-Phase Rule Induction:
PNrule (Joshi, Aggarwal & Kumar, 2001) uses two-phase

rule induction to focus on each measure separately.
The first Phase focuses on recall. In the second phase,
precision is optimized. This is accomplished by learning to
identify false positives within the rule from phase-1.
In the Needle-in-the-haystack analogy, the first phase
identifies regions likely to contain the needle, then in the
second phase learns to discard the hay strands within
these regions.
PN-rule Learning
P-phase:
Positive examples with good support

Seek good recall
N-phase:
Remove FP from examples of P-phase

High accuracy and significant support
4. Employ Non-Greedy Search

Techniques
Most Greedy algorithms are designed to be locally optimal,

so as to avoid local minima. This is done to make sure that
the solution remains tractable. Mining algorithms based on
Greedy method are not globally optimal.
Greedy algorithms are not suitable for dealing with rare
cases because rare cases may depend on the conjunction of
many conditions and any single condition in isolation my not
provide the needed solution.
Mining solution algorithms for handling rare cases must use
more powerful global search methods.
Recommended solution:
Genetic algorithms, which operate on a population of
candidate solutions rather than a single solution:
For this reason GA are more appropriate for rare cases.
(Goldberg, 1989), (Freitas, 2002), (Weiss, 1999), (Cavallo
and Freitas, 2002)
5. Employ Knowledge/Human
Interaction
Interaction and knowledge of domain experts can be

used more effectively for rare case mining.
Example:
SAR detection
Rare disease detection
Etc.
6. Employ Boosting
Boosting algorithms, such as AdaBoost, are iterative

algorithms that place different weights on the training
distribution at each iteration.
Following each iteration, boosting increases the weights
associated with the incorrectly classified examples and
decreases the weight associated with the correctly
classified examples.
This forces the learner to focus more on the incorrectly
classified examples in the next iteration,
An algorithm, RareBoost (Joshi, Kumar and Agarwal,
2001) which applies modified weight-update mechanism
to improve the performance of rare classes and rare
cases.
7. Place Rare Cases Into Separate

Classes
Rare cases complicate classification because

different rare cases may have little in common
between them, making it difficult to assign same
class label to all of them.
Solution: Reformulate the problem so that rare cases
are viewed as separate classes.
Approach:
1.
2.
3.
4.
Separate each class into subclasses using clustering

Learn after re-labeling the training examples with the
new class labels
Because multiple clustering experiments were used in
steps 1, step 2 involves learning multiple models.
These models are combined using voting.
Boosting based algorithms
RareBoost
Updates the weights differently
SMOTEBoost
Combination of SMOTE (Synthetic
Minority Oversampling Technique)
and boosting
CREDOS
First use ripple down rules to

overfit the data
Ripple down rules are often used
Then prune to improve

generalization
Different mechanism from decision
trees
Cost Sensitive Modeling
Detection rate / False Alarm rate may be

misleading
Cost factors: damage cost, response
cost, operational cost
Costs for TP, FP, TN, FN
Define cumulative cost
Outlier Detection Schemes
Detect intrusions (data

points) that are very
different from the
normal activities (rest
of the data points)
General Steps
Identify normal
behavior
Construct useful set of
features
Define similarity function
Use outlier detection
algorithm
Statistics based
Distance based
Model based
Distance Based Outlier

Detection
Represent data as a vector of

features
Major approaches
Nearest neighbor based

Density based
Clustering based
Problem
High dimensionality of data
Distance Based Nearest

Neighbor
Not enough neighbors Outliers

Compute distance d to the k-th nearest neighbor
Outlier points
Located in more sparse neighborhoods
Have d larger than a certain threashold
Mahalanobis-distance based approach

More appropriate for computing distance with
y
skewed distributions
*
*
*
*
*
*
p2
*
*
* *
* *
* *
*
*
* *
* *
*
*
p1
Distance Based Density
Local Outlier Factor (LOF)

Average of the ratios of the density of example p
and the density of its nearest neighbors
Compute density of local neighborhood for

each point
Compute LOF
Larger LOF Outliers
p2
p1
Distance Based Clustering
Radius w of proximity is specified

Two points x1 and x2 are near if d(x1,
x2)<w
Define N(x) as number of points that are
within w of x
Points in small cluster Outliers
Fixed-width clustering for speedup
Distance Based - Clustering

(cont.)
K-Nearst Neighbor + Canopy Clustering

Compute sum of distances to k nearest
neighbors
Small K-NN point in dense region
Canopy clustering for speedup
WaveCluster
Transform data into multidimensional
signals using wavelet transformation
Remove Hign/Low frequency parts
Remaining parts Outliers
Model Based Outlier Detection
Similar to Probabilistic Based schemes

Build prediction model for normal
behavior
Deviation from model potential
intrusion
Major approaches
Neural networks
Unsupervised Support Vector Machines
(SVMs)
Model Based - Neural Networks
Use a replicator 4-layer feed-forward neural network

Input variables are the target output during training
RNN forms a compressed model for traning data
Outlyingness reconstruction error
Model Based - SVMs
Attempt to separate the entire set of

training data from the origin
Regions where most data lies are
labeled as one class
Parameters
Expected outlier rates

Good for high quality
controlled training
data
Variance of Radial Basis Function
(RBF)
- Larger higher detection
rate and more false alarm
origin
Summary And Conclusion
Rare classes, which result from highly skewed class

distribution, share many of the problems associated with
rare cases. Rare classes and rare cases are connected.
Rare cases may occur can occur within both rare classes
and common classes, it is expected that rare cases to be
more of an issue for rare classers.
(Japkowicz, 2001) views rare classes as a consequence
of between-class imbalance and rare cases as a
consequence of within-class imbalances.
Thus, both forms of rarity are a type of data imbalance
Modeling improvements presented in this paper are
applicable to both types of rarity.

Indar

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Indar

Uploaded by

Copyright:

Available Formats

Mining with Rare

Motivation and Introduction to problem

Motivation and Introduction

What are rare cases?

A case corresponds to a region in the instance

Why are they important

Detecting suspicious cargo

Covers relatively few training examples

Clustering: showing 1 common

Two Class Classification:

For a classification problem, the rare cases may manifest

Current research indicates rare cases and small

Why Rare Cases are Problematic

Problems arise due to absolute rarity

Most fundamental problem is associated lack of data only

Problems arise due to relative rarity

Looking for a needle in a Haystack rare cases are obscured

Why Rare Cases are

Why Rare Cases are Problematic

Why Rare Cases are

Techniques For Handling rare

Obtain Additional Training Data

1. Obtain Additional Training Data

Simply obtaining additional training data will not help

2. Use a More Appropriate

Rare cases tend to cause small disjuncts to be formed in

2. Use a More Appropriate

The approach was refined to ensure that the more

3. Use More Appropriate Metrics

Altering Relative importance of Precision vs.

Use Evaluation Metrics that, unlike accuracy metrics, do

3. Use More Appropriate

PNrule (Joshi, Aggarwal & Kumar, 2001) uses two-phase

Positive examples with good support

Remove FP from examples of P-phase

4. Employ Non-Greedy Search

Most Greedy algorithms are designed to be locally optimal,

Interaction and knowledge of domain experts can be

Boosting algorithms, such as AdaBoost, are iterative

7. Place Rare Cases Into Separate

Rare cases complicate classification because

Separate each class into subclasses using clustering

Boosting based algorithms

First use ripple down rules to

Then prune to improve

Cost Sensitive Modeling

Detection rate / False Alarm rate may be

Outlier Detection Schemes

Detect intrusions (data

Distance Based Outlier

Represent data as a vector of

Nearest neighbor based

Distance Based Nearest

Not enough neighbors Outliers

Mahalanobis-distance based approach

Distance Based Density

Local Outlier Factor (LOF)

Compute density of local neighborhood for

Distance Based Clustering

Radius w of proximity is specified

Distance Based - Clustering

K-Nearst Neighbor + Canopy Clustering

Small K-NN point in dense region