You are on page 1of 6

Enhanced SMOTE Algorithm for Classification of

Imbalanced Big-Data using Random Forest


Reshma C. Bhagat Sachin S. Patil
M. Tech. Student Department of CSE Assistant Professor, Department of CSE
Rajarambapu Institute of Technology Rajarambapu Institute of Technology
Islampur (Sangli), MS, India Islampur (Sangli), MS, India
resham.bhagat3@gmail.com sachin.patil@ritindia.edu

Abstract— In the era of big data, the applications


A. Basics of Data Mining
generating tremendous amount of data are becoming the
main focus of attention as the wide increment of data Data mining is the extraction of predictive information
generation and storage that has taken place in the last few from large data-sets. The various data mining tools are used to
years. This scenario is challenging for data mining predict useful information from available data-sets, helps
techniques which are not arrogated to the new space and organizations to make proactive business-driven decisions.
time requirements. In many of the real world applications, Various data mining techniques are present including
classification of imbalanced data-sets is the point of association rules, decision tree classification, clustering, and
attraction. Most of the classification methods focused on sequence mining and so on. Classification is most popular
two-class imbalanced problem. So, it is necessary to solve application area of data miming. Classification is a process of
multi-class imbalanced problem, which exist in real-world assigning new objects to predefined categories or classes. The
domains. classification plays an important role and is essential tool for
In the proposed work, we introduced a methodology for several organizations and in many individual’s lives.
classification of multi-class imbalanced data. This Classification come into existence to makes our life better and
easier. [4] Machine learning is defined as ʊalgorithmic part
methodology consists of two steps: In first step we used
of data mining process. During the last few years, machine
Binarization techniques (OVA and OVO) for decomposing
learning techniques have been applied to complex real-world
original dataset into subsets of binary classes. In second
problems with the aim of extracting novel and useful
step, the SMOTE algorithm is applied against each subset
knowledge [4] [16].
of imbalanced binary class in order to get balanced data.
Furthermore, many real world applications present classes
Finally, to achieve classification goal Random Forest (RF)
which have insignificant number of samples as compared to
classifier is used. Specifically, oversampling technique is
other class that are considered. This situation is called as class
adapted to big data using MapReduce so that this imbalance problem. Usually, the insignificant samples are
technique is able to handle as large data-set as needed. main focus of study, so it is necessary to classify them
An experimental study is carried out to evaluate the correctly. In machine learning research, learning from
performance of proposed method. For experimental imbalanced data-sets is an issue that has attracted a lot of
analysis, we have used different datasets from UCI attention. The statistical learning methods are suited for
repository and the proposed system is implemented on balanced data sets and may ignore the negligible samples
Apache Hadoop and Apache Spark platform. The results which are important. That‘s why it is required to consider the
obtained shows that proposed method outperforms over features of problem and solve it correctly.
other methods. Another challenge for data mining algorithms is big data.
Keywords— Data mining; Multi-class Imbalanced data;
Oversampling; MapReduce; Machine Learning.
The techniques which are used to deal with big data are
concentrated on gaining fast, scalable and parallel
implementations. To reach this goal, MapReduce framework
I. INTRODUCTION is followed. MapReduce framework is one of the most popular
The appearance of information technology in various filed procedures used for handling big data. The working of this
of our life lead to the massive amount of data storage in framework based on ʊdivide and conquers strategy, where
various formats like records, documents, images, sound the dataset is divided into subsets which are easy to manage
recordings, and scientific data and so on. Now-a-days the data and partial results are combined that are obtained.
is collected from various domains and it requires well defined
methods for extracting knowledge or information from this B. Big Data and Hadoop
data for better decision making. Knowledge discovery in Big data is data that overshoot the processing capacity of
database often called as Data Mining. Data mining came into traditional database systems. The data is too large, moves too
existence due to the perception of ʊwe are data rich but fast or doesn‘t fit into predefined structure of database
information poor.
architecture. From 2012 Big Data has become a hot IT

978-1-4799-8047-5/15/$31.00 2015
c IEEE 403
buzzword and it is defined in terms of 3 V‘s such as Volume, B. Addressing Imbalanced Problem
Velocity and Variety [1] [19]. Several techniques are present to address the classification of
In 2004, the MapReduce programming framework was Imbalanced data [5] [6] [19]. These techniques are categorized
first proposed by Google. It is a platform designed for into various groups:
processing tremendous amounts of data in an extremely
parallel manner. It provides an environment to easily develop 1. Data Level Approach: In which original dataset is
scalable and fault tolerant applications. The Apache Hadoop is modified to get balanced dataset so that it can be
used by standard machine learning algorithms.
an open source project and provides the MapReduce
framework for processing big data [1]. 2. Algorithm Level Approach: In which existing
Apache Mahout is an open source project and runs on top algorithm is modified to launch procedures that can
of Hadoop platform. It provides set of machine learning deal with imbalanced data.
algorithms for clustering, recommendation and classification 3. Cost-sensitive Approach: In which both data level
problems. Mahout contains various implementations of and algorithm level approaches are combining to get
classification models such as Logistic Regression, Bayesian accuracy and reduce misclassification costs.
models, Support Vector Machines, and Random Forest among Furthermore, data level approaches are divided into
others [2]. various groups: Oversampling technique and Undersamping
Like Hadoop, Apache Spark is an open-source cluster technique and hybrid technique. In oversampling technique,
computing framework. It is developed in the AMPLab at UC new data from minority classes are added to original dataset in
Berkeley. It is in-memory based MapReduce paradigm. It order to obtain balanced dataset. In undersampling technique,
performs 100 times faster than Hadoop for certain applications data from majority classes are removed in order to balance the
[3]. datasets. In hybrid technique, previous techniques are
In this work, we present analysis of techniques which are combines to achieve the goal of balanced dataset. Usually, at
used to deal with imbalanced data problem. We present the first oversampling technique is used to create new samples for
enhanced SMOTE algorithm for handling multi-class minority class and then undersampling technique is applied to
imbalanced big data problem. These approaches are evaluated delete samples from majority class [5] [6] [11].
on the basis of potency in correct classification of each The oversampling and undersampling techniques have
instance of each class and time required to build classification some drawbacks. To address this end, Synthetic Minority
model. In order to perform classification, Random Forest Oversampling Technique (SMOTE) is used. The SMOTE
classifier (RF) which is a popular and well-known decision algorithm is a powerful solution to imbalance data problem
tree ensemble method is used. It is proven that RF is scalable, that have shown success in various application domains. The
robust and gives good performance. SMOTE algorithm is an oversampling technique. This
For experimental study, we will focus on MapReduce technique adds synthetic minority class samples to original
based implementation of SMOTE + RF. The experiments dataset to achieve balanced dataset [5] [6].
performed provide the limitations of original multi-
In SMOTE algorithm, minority class is oversampled by
classification algorithm and enhanced SMOTE algorithm.
duplicating samples from minority class. Depending on the
Finally, we evaluate the proposed system based on accuracy
and Geometric Mean for true rates and ȕ-F-Measure - popular oversampling required, numbers of nearest neighbors are
measures in imbalanced domain. randomly chosen. The synthetic data is generated based on
feature space likeliness prevails between existing samples of
minority class. For subset ‫ א‬S, consider K nearest neighbors
II. RELATED WORK for each sample ‫ א‬X. The K-nearest neighbors are the K
elements whose Euclidean distance between itself and have
A. Imbalanced Data Problem smallest weight along the n-dimensional feature space of X.
The samples are generated simply as, randomly select one of
The classification of Imbalanced datasets poses problem
the K-nearest neighbors and multiply corresponding Euclidean
where class distributions have number of examples in one
class is outnumbered by the number of examples in the other distance with random number between [0,1]. Finally, add this
class.[5][6] A class having abundant number of examples value to original value of instance [5].
called majority or negative class and a class having significant Mathematically, it is written as
number of examples called minority or positive class. In
recent years, the imbalanced data problem has become
burning point in industry, academia and government agencies. Where, Ǧ ample from minority class used to generate
This problem present in many real world problems such as synthetic data. - Nearest neighbor for and į - Random
medical diagnosis [8], fraud detection, finances, risk
number between [0, 1]. The generated synthetic data is a point
management, network intrusion, E-mail foldering [12],
on line fragment between under consideration and k-nearest
Software Defect Detection [18] and so on. Additionally, the
positive (minority) class is the class of interest from the neighbors for Ǥ
learning point of view and has great impact when it is not Though SMOTE is popular technique in imbalanced
classified properly. domain, it has some drawbacks including over-generalization,
only applicable for binary class problem, over-sampling rate

404 2015 IEEE International Advance Computing Conference (IACC)


vary with dataset etc. To avoid these drawbacks some data. Final step, the balanced dataset that is generated in the
approaches are defined such as Borderline-SMOTE and Reduce process forms the final output that will be the entry
Adaptive Synthetic Sampling for overgeneralization. data for the RF-Big Data algorithm [19].
Evolutionary algorithms and sampling methods are used to
deal with class imbalanced problem. [12] B. One-vs.-All Approach (OVA)
SMOTE+GLMBoost [13] and NRBoundary-SMOTE based Multi-classes entail additional complexity to data mining
on neighborhood Rough Set Model [15] are used to solve class algorithms, since datasets have overlapped class boundaries,
imbalance problem. The ensemble methods like AdaBoost, which degrade the performance level. In such cases, we
RUSBoost, and SMOTEBoost coupled with SMOTE to solve transform multi-class problem into subset of binary class
imbalanced data problem [8] [11]. problem, which is easy to distinguish, using class binarization
All these approaches are focused on two-class problem. In techniques [7] [10] [14].
Here, we have used One-vs.-All or One-against-All
[10], author proposed a solution for multi-class problem based
approach, which construct single classifier for each classes of
on fuzzy rule classification. In this paper, one-vs.-one
problem by considering samples of current class are positive
binarization technique is used to transform multi-class data
and remaining samples are negative. After performing OVA
into two-class data.
technique, SMOTE is applied for each binary class problem
III. SYSTEM ARCHITECTURE and results of all binary class problems are combined to get
overall result. Then this balanced data is applied to RF
In this section, architecture of proposed system is algorithm for classification.
explained. Fig. 1 demonstrates the working of proposed
system. SMOTE algorithm + one-vs.-all technique used to
balance the imbalanced data. Random Forest is used for
classification purpose.

Fig. 1: Proposed System Architecture

A. SMOTE Algorithm for Big- Data


Fig 3: One-vs.-All Binarization Technique for 4-class problem
The SMOTE algorithm will be implementing for big data
following a Map Reduce design where each Map process
oversamples the minority class and the Reduce process C. Random Forest for Classification
randomizes the output generated by each mapper to form the Ensembles of decision trees (often known as ĴRandom
balanced dataset. Initial step the algorithm performs a Forests‘) have been the most successful general-purpose
segmentation of the input dataset into independent data blocks; algorithm in modern times. Random Forest (RF) is an
replicates and transfers them to other machines. ensemble method of classification. RF was proposed by Leo
Breiman in 2002 to improve the classification of dataset
having small training dataset and large testing dataset i.e. RF
is suited for large number attributes and small number of
observations [16].
RF is scalable, fast and durable approach for classification of
high-dimensional data and can handle continuous data,
categorical data time-to-event data etc. [4].
D. Random Forest for Classification with Big Data
To handle big data, original RF algorithm needs to be
reshaped so it can fruitfully handle big data. The Apache
Mahout provides Inmem and Partial implementation of RF.
Fig. 2: Flowchart of SMOTE Map Reduce design The Mahout operates on the top of Hadoop and uses
Map phase, each map process balances the class MapReduce framework for implementation of machine
distribution in a mapper‘s partition using the basic SMOTE learning algorithms. RF algorithm is working in two phases:
algorithm over the available data. Reduce step collects the the first phase is dedicated to the building of classification
output generated by each mapper and randomizes the final model and second phase is dedicated to prediction of class

2015 IEEE International Advance Computing Conference (IACC) 405


labels related to dataset usingIV.
already generated model. for corresponding data blocks based on voting scheme. In
Final step, outputs from all mapper are collected to form final
Algorithm 1: SMOTE in MapReduce prediction of classification.
Input: (Key: Value) pair, where Key is the bytes and
Value is the value of an instance. IV. EXPERIMENTAL FRAMEWORK
Output (Key’: Value’) pair, where Key’ is any value and
Value’ is the value of instance. In this section we present details of real-world problems
having multi-class imbalanced data, experimentation and
configuration parameters and then we presents mathematical
SMOTE Algorithm MAP (Key, Value) tests used to compare performance of different approaches.
1: instance = INSTANCE (value) A. Datasets and Parameters
2: instances: add (instance) For experimentation we have used datasets from UCI
3: smote = Filter.usefilter(instances, SMOTE) database repository. Table I summarizes the details of selected
4: smoteinstances= smote.run() data-sets includes no of examples (#EX), number of attributes
5: for i = 0 to smote instances.length - 1 do (#ATTR), Class Distribution and number of classes (#CL).
6: EMIT (key, smoteinstances.get(i))
7: end for TABLE I. CHARACTERISTICS OF DATASET
SMOTE algorithm REDUCE (key, values)

1: while values.hasNext ()
2: instance = INSTANCE REPRESENTATION
(values.getValue())
3: smote instances.add(instance)
4: end while
5: final instances RANDOMIZE (smote instances)
6: for i = 0 to final instances:length - 1do
7: EMIT (null, finalinstances.get(i))
8: end for

In model building phase, training dataset is used for


creation of model using MapReduce framework. It consists of
three steps: Initial, Map and Reduce. In initial step, original
training dataset is divided into separate data blocks and these
blocks are copied across all the nodes. In Map phase, each
Map task works parallel and builds group of forest with
corresponding data block. Finally, in the Final step the output
files by all the mappers are parsed to extract the trees. The SMOTE algorithm generates synthetic data based on
assembly of all the trees forms the forest. Euclidean distance as mentioned earlier. While running
SMOTE on data we have set some parameters which are
1. Nearest Neighbors (N): How many nearest
Algorithm 2: SMOTE+OVA neighbors to find. (Default value is 5)
2. ClassValueIndex (C): This parameter sets the
1. Divide training data-set into N binary subsets class values to examples. (Default value is 0)
considering all classes. 3. RandomSeed (K): This parameter sets the number
2. For each binary class I, i=1………N of random seeds to be selected from Data-set.
a. Apply SMOTE preprocessing techniques (Default value is 5)
b. Combine the results of all binary subsets 4. Percentage (P): This parameter sets the amount of
3. Apply RF-BigData algorithm over the balanced oversampling. Its value can be 100, 200, and 300
V. step.
data returned by previous and so on.
4. Compute the accuracy, G-Mean and B-f-Measure.
B. Evaluation in Multi-class Imbalanced Domain
Evaluation is important factor both in analysis of
After building of model is completed, classification is classification performance and guidance to model
started to predict the class labels related to dataset. Again, this construction. The quality of classifier is given in the form of
phase has three stages: Initial, Map and Final. Like first phase, confusion matrix which arranges instances of each class
in Initial step, the segmentation of data is performed and the according to their correct and incorrect identification.
resulting data blocks are duplicated and transferred across all F-Measure:
the nodes. In Map step, each Map task performs classification It provides a simple way to assess the ability of classifier

406 2015 IEEE International Advance Computing Conference (IACC)


to correctly classify active and inactive. It provides a simple TABLE II. COMPARISON OF ACCURACY FOR TESTING AND TRAINING
way to assess the ability of classifier to correctly classify DATASETS WITH BASE METHOD AND PROPOSED METHOD

active and inactive instances by considering sensitivity and Base (OVA) SMOTE+OVA)
specificity into single metric. F-measure gives balanced
Dataset ACCtr ACCts ACCtr ACCts
accuracy and is defined as follows: [5] [19]
Sensitivity (Recall): Landsat 1 0.6534 0.9999 0.7190
It is defined as percentage of positive samples which are Image segment 0.934 0.9339 1 0.9753
correctly classified samples and it is given as:
Lymphography 1 0.9987 1 1
Iris 0.98 0.9943 1 0.9533
Sensitivity= Zoo 1 1 1 1
Car 0.9008 0.9343 0.9867 0.9305
Specificity:
It is defined as percentage of negative samples which Vehicle 0.8156 0.8158 0.9976 0.9976
are correctly classified and it is given as: Waveform 0.8736 0.8697 0.9996 0.9995
Mean 0.9381 0.90 0.9978 0.9469
Specificity=
Precision: TABLE III. COMPARING F-MEASURE VALUE FOR TESTING
DATASET WITH BASE METHOD AND PROPOSED METHOD
It is the proportion of predicted positive cases that were
correctly classified. Dataset BASE(OVA) SMOTE+OVA

Precision= Landsat 0.654 0.702


Image segment 0.924 0.978

For multi-class data, assume K be the number of classes, Lymphography 0.998 1

be the precision of class and be the recall of class. Iris 0.992 0.955
So, G-mean is calculated as: Zoo 1 1
Car 0.882 0.819
G-Mean=
Vehicle 0.752 0.998
F-measure:
Another metric used to assess quality of classifier in Waveform 0.805 0.999
imbalanced domain is -f-measure and it is given as: Mean 0.8758 0.9314

The results obtained shows that both accuracy and F-


F-measure= for all i i=1………K Measure value gives best results in test correspond to one
that is obtained by SMOTE+OVA method.

TABLE IV. COMPARISON OF TIME WITH PROPOSED METHOD


C. Experimental Study ON APACHE SPARK AND APACHE HADOOP
We have performed whole experimentation on three node
hadoop cluster. Each node has Intel 2.5 GHz i3 processor Dataset Time (sec) on Time (sec) on
having 4 GB RAM. The heap size is set to 1024 M for all spark Hadoop
mapper or reducers. Iris 0.0239 27.403
We have showed results for all datasets described in Table
Landsat 0.0138 62.547
I for basic method (base), the multi-classification approach
(OVA) and the proposed classification system Vehicle 0.0144 32.016
(SMOTE+OVA). The results are summarized in Table II. Waveform 0.0127 24.788
To make more intuitive comparison, we have calculated
f-measure value for testing datasets. The results are
summarized in TABLE III From the above table we can see that Apache Spark gives
We have implemented proposed system on Apache spark better results than Apache Hadoop for same datasets. This is
also. The performance of Apache Spark and Apache Hadoop because Apache Spark processes data in memory while
is measured in terms of time. Here, we have used four Apache Hadoop persists back to the disk after a Map and
datasets from Table I. The results are summarized in TABLE Reduce action, so Apache Spark is faster than Apache
IV. The significant results are in bold face. Hadoop.

2015 IEEE International Advance Computing Conference (IACC) 407


V. CONCLUSION level-smote: Safe-level-synthetic minority oversampling technique for
handling the class imbalanced problem. In Advance in Knowledge
In this work, the data preprocessing technique-SMOTE Discovery and Data Mining (pp. 475-482). Springer Berlin Heidelberg.
(Synthetic Minority Oversampling Technique) for multi-class [10] Batuwita, R., & Palade, V. (2009). microPred: effective classification of
imbalanced data is presented. Also, we have used RF pre-miRNAs for human miRNA gene prediction. Bioinformatics, 25(8),
(Random Forest) algorithm as a base, which is decision tree 989-995.
ensemble and known for its good performance. In today‘s [11] Fernandez, A., Del Jesus, M. J., & Herrera, F. (2010). Multi-class
scenario, big data is point of attraction because huge amount imbalanced data-sets with linguistic fuzzy rule based classification
of data that are currently generating. Traditional data mining systems based on pairwise learning. In Computational Intelligence for
techniques are unable to survive with requirements urged by Knowledge-Based Systems Design (pp. 89-98). Springer Berlin
big data. Heidelberg.
We have tested quality of proposed system in terms of [12] Bermejo, P., Gamez, J. A., & Puerta, J. M. (2011). Improving the
accuracy and F-measure. Experimental analysis carried out performance of Naive Bayes multinomial in e-mail foldering by
using various datasets of UCI repository. The results obtained introducing distributed-based balance of datasets. Expert Systems with
shows that SMOTE+OVA algorithm gives good performance Applications, 38(3), 2072-2080.
in the imbalanced data problem. We have tested spark against [13] Garci, S., Triguero, I., Carmona, C. J., & Herrera, F. (2012).
Evolutionary-based selection of generalized instances for imbalanced
hadoop in terms of time spark gives better results as compared
classification. Knowledge-Based Systems, 25(1), 3-12.
to hadoop.
[14] Xiang, H., Yang, Y., & Zhao, S. (2012). Local clustering ensemble
In future, we will try to address the problems- how to fix
learning method based on improved AdaBoost for rare class analysis.
up the oversampling rate as it varies with dataset and how to Journal of Computational Information Systems, 8(4), 1783-1790.
generate N synthetic data subsets for each feature vector in the [15] Fernandez, A., Lopez, V., Galar, M., Del Jesus, M. J., & Herrera, F.
minority class. (2013). Analysing the classification of imbalanced data-sets with
multiple classes: Binarization technique and ad-hoc approaches.
REFERENCES
Knowledge-based systems,42, 97-110.
[1] Apache Hadoop (Accessed in August 2014) http://hadoop.apache.org/ [16] Hu, F., Li, H. (2013). A novel boundary oversampling algorithm based
[2] Apache Mahout (Accessed in August 2014) http://mahout.apache.org/ on neighboured rough set model: NRSBoundary –SMOTE.
[3] Apache Spark (Accessed in August 2014) http:// spark.apache.org/ Mahematical Problems in Engineering, 2013.
[4] Han, J. Kamber, M & Pei, J. (2000). Data Miming: concepts and [17] Han, J., Liu, Y., & Sun, X. (2013, May). A scalable random forest
techniques (the Morgan Kaufmann Series in data management systems). algorithm based on MapReduce. In Software Engineering and Service
Morgan Kaufmann. Science (ICSESS), 2013 4th IEEE International Conference on (pp. 849-
[5] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. 852). IEEE.
(2002). SMOTE: synthetic minority over-sampling technique. Journal of [18] Lopez, V., Fernandez, A., Garcia, S., Palade, V., & Herrera, F. (2013).
artificial intelligence research, 16(1), 321-357. An insight into classification with imbalanced data: Empirical results
[6] Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W., (2003). and current trends on using data intrinsic characteristics. Information
SMOTEBoost: Improving prediction of the minority class in boosting. Sciences, 250, 113-141.
InKnowldge Discovery in Database: PKDD 2003 (pp. 107-119). [19] Park, B. J., Oh, S. K., & Pedrycz, W. (2013). The design of polynomial
Springer Belin Heidelberg. function-based neural network predicators for detection of software
[7] Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classification. defects. Information Sciences, 229, 40-57.
The Journal of Machine Learning Research, 5, 101-141. [20] Rio, S., Lopez, V., Benitez, J. M., & Herrera, F. (2014). On the use of
[8] Aly, M. (2005). Survey on multiclass classification methods. Neural MapReduce for imbalanced big data using Random Forest. Information
Netw, 1-9. Sciences, 285, 112-137.
[9] Bunkhumpornpat, C., Sinapiromsaran, K., & Lursinsap, C. (2009). Safe-

408 2015 IEEE International Advance Computing Conference (IACC)

You might also like