Professional Documents
Culture Documents
Cases
Paper by Gary M. Weiss
Presenter: Indar Bhatia
INFS 795
April 28, 2005
Presentation Overview
1.
2.
3.
4.
Classification Problem
Modeling Problem
P2
P1
P3
Modeling Problem
The Metrics
The metrics used to evaluate classifier accuracy
are more focused on common cases. As a
consequence, rare cases may be totally ignored.
Example:
consider decision tree. Most decision trees are
grown in a top-down manner, where test conditions
are repeatedly evaluated and the best one
selected.
The metrics (i.e., the information gain) used to
select the best test generally prefers tests that
result in a balanced tree where purity is increased
for most of the examples.
Rare cases which correspond to high purity
branches covering few examples will often not be
included in the decision tree.
The Bias
The bias of a data mining system is critical to its
performance. The extra-evidentiary bias makes it
possible to generalize from specific examples.
Bias used by many data mining systems, especially
those used to induce classifiers, employ a maximumgenerality bias.
This means that when a disjunct that covers some set
of training examples is formed, only the most general
set of conditions that satisfy those examples are
selected.
The maximum-generality bias works well for common
cases, but not for rare cases/small disjuncts.
Attempts to address the problems of small disjuncts by
selecting an appropriate bias must be considered.
Noisy data
Sufficient high level of background noise may
prevent the learner to distinguish between noise
and rare cases.
Unfortunately, there is not much that can be done
to minimize the impact on noise on rare cases.
For example: Pruning and overfitting avoidance
techniques, as well as inductive biases that foster
generalization, can minimize the overall impact of
noise but, because these methods tend to remove
both the rare cases and noise-generated ones, they
do so at the expense of rare cases.
PN-rule Learning
P-phase:
N-phase:
5. Employ Knowledge/Human
Interaction
SAR detection
Rare disease detection
Etc.
6. Employ Boosting
RareBoost
Updates the weights differently
SMOTEBoost
Combination of SMOTE (Synthetic
Minority Oversampling Technique)
and boosting
CREDOS
Identify normal
behavior
Construct useful set of
features
Define similarity function
Use outlier detection
algorithm
Statistics based
Distance based
Model based
Problem
High dimensionality of data
*
*
*
*
p2
*
*
* *
* *
* *
*
*
* *
* *
*
*
p1
p2
p1
WaveCluster
Transform data into multidimensional
signals using wavelet transformation
Remove Hign/Low frequency parts
Remaining parts Outliers
Major approaches
Neural networks
Unsupervised Support Vector Machines
(SVMs)
Parameters
origin