Handling Imbalanced Dataset

Wel-Come
To
Faculty Development Programme

on
Recent Trends in Machine Learning
Sachin Subhash Patil

Q.I.P. Ph.D. Scholar, CSE Dept., WCE
Under the Guidance of

Dr. Shefali Pratap Sonavane
Assoc. Prof., IT Dept., WCE
1
Topics of Discussion
• Title and its Significance

• Challenges
• Research Contributions:
o Over Sampling Techniques
o Safe-Level based Synthetic Sample (SSS)
o Lowest versus Highest (LVH)
o Addressing Data Characteristics
• Outcomes
2
Title of Session
I.
H.
T.
Handling of Imbalanced Big Data Sets

Classification Using Enriched Over
Imbalanced
Data Set
Sampling
B.D.S.
Techniques
Classifier
Imprecise
Precise
Classification
Classification
3
Imbalanced Data Sets
Significance
• Class Imbalance problem

• Classifiers ignore the minority instances while
forming rule sets
• Representation of boundaries within class
structures
• Skewed data partition
5
Significance
• The numerous real-world applications are

affected:
- Software defect detection
- Threat supervision
- Medical judgment
- Web authorization
• Misclassifying rare classes can result in

heavy costs
6
Common Approaches
• At Data Level: Re-Sampling

- Oversampling
- Undersampling
- Active Sampling
• At Algorithmic Level:
- Adjusting the misclassification costs
- Adjusting the decision threshold at the
tree leaf
7
Quiz-1
 Which of these is not a type of imbalance scenario?
a) 95:5 b) 80:20
b) 75:25 d) 60:40
Ans.: b
 Data Level techniques? (multiple possibilities)
a) Cost Sensitive b) Under Sampling
c) Over Sampling d) Classifier based
Ans.: b and c
 Real-world extreme imbalanced data set example is?
a) Fraud detection b) Birth rate ratio
Ans.: a
Research Challenges
1. Analysing the structure of classes

2. Extreme class imbalance
3. Classifier’s output adjustment
4. Multi-class imbalanced classification
5. Multi-class Classifiers
6. Multi-instance imbalanced classification
9
Research Challenges
7. Regression in imbalanced scenarios

8. Semi-supervised and unsupervised
learning from imbalanced data
9. Learning from imbalanced data streams
10.Imbalanced Big Data
10
Research Challenges
1. Analysing the structure of classes:

o Predefined group based on neighbourhood:
- Safe
- Borderline
- Rare
- Outliers
o Incorporating the background knowledge about
objects into the training procedure of classifiers
o Selecting difficult samples to concentrate
11
Research Challenges
o Justifying the role of noisy/outlier samples

o Adaptive methods adjusting the size of analysed
neighborhood according to local densities
 K-NN strongly implies uniform distribution of data
2. Extreme class imbalance:

o Characterizing a reduced imbalance ratio with
decomposing
o Methods to reconstruct a potential class structure
12
Research Challenges
3. Classifier’s output adjustment:

o Analysing the characteristics of each classified
example
4. Multi-class imbalanced classification:
o Class overlapping with more than two groups
o Unclear defined borders
o Change in difficulty of each sample w.r.t. different
classes
5. Multi-class Classifiers:
o Classification without decomposition/ resampling
(using algorithm-level solutions)
13
Research Challenges
oEnhanced distance-based classifiers and density

based methods (Hellinger-distance to decision trees)
oExploring local competencies of classifiers and
creating sectional decision areas
6.Multi-instance imbalanced classification:
o Labelling the bags of objects and handling bags
 Does not imply that, the bag consists only of
objects from a given class
o Global schemes for tackling between and within
class imbalance
14
Research Challenges
o New measures for assessing the quality of training

bags and selecting the most useful ones
7. Regression in imbalanced scenarios:
o Branch of ML, yet to be explored from the
imbalanced perspective
o Developing more flexible cost-sensitive regression
solutions
o Adapting penalty as per the degree of importance
8. Semi-supervised/Un-supervised learning:
o Clustering imbalanced data with various perspectives:
15
Research Challenges
 Process of group discovery on its own or

 Method for reducing the complexity of problem or
 Solution to analysis of the minority class structure
o New indexes to measure how well the discovered groups
reflect the actual skewed distributions
o Novel unsupervised methods for assessing the
distributions and potential difficulty of unlabeled objects
o Active learning strategies to point the most difficult
objects effecting on learned decision boundaries
16
Research Challenges
9. Learning from imbalanced data streams:

o Adaptive methods for skewed real time objects
o Changes with stream progress (I.R., class status)
o New class emergence and/or fading of the old
ones
o Active learning methods reducing the cost of
supervision
o Algorithms to extract drift’s templates
(reappearing sources)
17
Research Challenges
10. Imbalanced Big Data:

o Big imbalanced data types like graphs, tensors,
video sequences, xml structures, hyperspectral
images, associations etc (eg. social networks or
computer vision)
o Designing both preprocessing and direct learning
algorithms
o Handling heterogeneous and atypical data
(Spark, Hadoop)
18
Research Challenges
o Global-scale data partitioning methods (supervising O.S.

process)
o Interpretable classifiers handling massive and skewed data
o When dealing with imbalanced big data we face one of
two possible scenarios:
i. When majority class is massive and minority class is of a
small sample size
- Directly related to the problem of extreme imbalance
ii.When imbalance is present but representatives from
both classes are abundant
19
Research Challenges
- Need of an in-depth analysis of the structure of minority
class and its examples
- Analyse the appearance of new types of examples or
changes in properties of already described types
- To address complex scenarios requiring local analysis of
each difficult region and their individual solutions
20
Research Challenges
- Need of an in-depth analysis of the structure of minority
class and its examples
- Analyse the appearance of new types of examples or
changes in properties of already described types
- To address complex scenarios requiring local analysis of
each difficult region and their individual solutions
21
Quiz-2
 Can a sample may overlapped with more than two classes?
a) false b) true
Ans.: b
 Dose the streaming data progresses skewness?
a) May be b) May not
c) Yes d) No
Ans.: a and b (mostly)
• Multi-instance imbalanced classification terms too:

a) Multi-classifier b) Labelling the bags of objects
c) Both d) None of these
Ans.: b
Research Contributions
23

Handling Imbalanced Dataset

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Handling Imbalanced Dataset

Uploaded by

Copyright:

Available Formats

Wel-Come

Faculty Development Programme

Sachin Subhash Patil

Under the Guidance of

• Title and its Significance

Handling of Imbalanced Big Data Sets

• Class Imbalance problem

• The numerous real-world applications are

• Misclassifying rare classes can result in

• At Data Level: Re-Sampling

1. Analysing the structure of classes

7. Regression in imbalanced scenarios

1. Analysing the structure of classes:

o Justifying the role of noisy/outlier samples

2. Extreme class imbalance:

3. Classifier’s output adjustment:

oEnhanced distance-based classifiers and density

o New measures for assessing the quality of training

 Process of group discovery on its own or

9. Learning from imbalanced data streams:

10. Imbalanced Big Data:

o Global-scale data partitioning methods (supervising O.S.

• Multi-instance imbalanced classification terms too:

You might also like