You are on page 1of 17

JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.

1 (1-17)
Available online at www.sciencedirect.com
ScienceDirect
Fuzzy Sets and Systems ()
www.elsevier.com/locate/fss
Parallel sampling from big data with uncertainty distribution
Qing He
a
, Haocheng Wang
a,b,
, Fuzhen Zhuang
a
, Tianfeng Shang
a,b
, Zhongzhi Shi
a
a
Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190,
China
b
University of Chinese Academy of Sciences, Beijing 100049, China
Abstract
Data are inherently uncertain in most applications. Uncertainty is encountered when an experiment such as sampling is to
proceed, the result of which is not known to us while leading to variety of potential outcomes. With the rapid developments of
data collection and distribution storage technologies, big data have become a bigger-than-ever problem. And dealing with big data
with uncertainty distribution is one of the most important issues of big data research. In this paper, we propose a Parallel Sampling
method based on Hyper Surface for big data with uncertainty distribution, namely PSHS, which adopts a universal concept of
Minimal Consistent Subset (MCS) of Hyper Surface Classication (HSC). Our inspiration for handling uncertainties in sampling
from big data depends on (1) the inherent structure of the original sample set is uncertain for us, (2) boundary set formed of all
the possible separating hyper surfaces is a fuzzy set and (3) the uncertainty of elements in MCS. PSHS is implemented based on
MapReduce framework, which is a current and powerful parallel programming technique used in many elds. Experiments have
been carried out on several data sets including real world data from UCI repository and synthetic data. The results show that our
algorithm shrinks data sets while maintaining identical distribution, which is useful for obtaining the inherent structure of the data
sets. Furthermore, the evaluation criterions of speedup, scaleup and sizeup validate its efciency.
2014 Elsevier B.V. All rights reserved.
Keywords: Fuzzy boundary set; Uncertainty; Minimal consistent subset; Sampling; MapReduce
1. Introduction
In many applications, data contain inherent uncertainty. The uncertainty phenomenon emerges owing to the lack
of knowledge about the occurrence of some event. It is encountered when an experiment (sampling, classication,
etc.) is to proceed, the result of which is not known to us; it may also refer to variety of potential outcomes, ways of
solution, etc. [1]. Uncertainty can also arise in categorical data, for example, the inherent structure of a given sample
set is uncertain for us. Moreover, the role of each sample in the inherent structure of the sample set is uncertain.
*
Corresponding author at: Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing
Technology, CAS, Beijing, 100190, China.
E-mail addresses: heq@ics.ict.ac.cn (Q. He), wanghc@ics.ict.ac.cn (H. Wang), zhuangfz@ics.ict.ac.cn (F. Zhuang), shangtf@ics.ict.ac.cn
(T. Shang), shizz@ics.ict.ac.cn (Z. Shi).
http://dx.doi.org/10.1016/j.fss.2014.01.016
0165-0114/ 2014 Elsevier B.V. All rights reserved.
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.2 (1-17)
2 Q. He et al. / Fuzzy Sets and Systems ()
Fuzzy set theory developed by Zadeh [2] is a suitable theory that proved its ability to work in many real applications.
It is worth noticing that fuzzy sets are a reasonable mathematical tool for handling the uncertainty in data [3].
With the rapid developments of data collection and distribution storage technologies, big data have become a
bigger-than-ever problem nowadays. Furthermore, there is a rapid growth in the hybrid study which connects the
uncertainty and big data together. And dealing with big data with uncertainty distribution is one of the most important
issues of big data research. Uncertainty in big data brings an interesting challenge as well as opportunity. Many
state-of-the-art methods can only handle small scale of data sets, therefore, parallel process big data with uncertainty
distribution is very important.
Sampling techniques, which play a very important role in all classication methods, have attracted amounts of
research in the area of machine learning and data mining. Furthermore, parallel sampling from big data with uncer-
tainty distribution becomes one of the most important tasks in the presence of the enormous amount of uncertain data
produced these days.
Hyper Surface Classication (HSC), which is a general classication method based on Jordan Curve Theorem, is
put forward by He et al. [4]. In this method, a model of hyper surface is obtained by adaptively dividing the samples
space in the training process, and then the separating hyper surface is directly used to classify large database. The data
are classied according to whether the number of intersections with the radial is odd or even. It is a novel approach
which has no need of either mapping from lower-dimensional space to higher-dimensional space or considering kernel
function. HSC can efciently and accurately classify two and three dimensional data. Furthermore, it can be extended
to deal with high dimensional data with dimension reduction [5] or ensemble techniques [6].
In order to enhance HSC performance and analyze its generalization ability, the notion of Minimal Consistent
Subset (MCS) is applied to the HSC method [7]. MCS is dened as consistent subset with a minimum number of
elements. For HSC method, the samples with the same category and falling into the same unit which covers at most
samples from the same category make an equivalent class. The MCS of HSC is a sample subset combined by selecting
one and only one representative sample from each unit included in the hyper surface. As a result, some samples in the
MCS are replaceable, while others are not, leading to the uncertainty of elements in MCS. MCS includes the same
number of elements, but the elements may be different samples. One of the most important features of MCS is that it
has the same classication model as the entire sample set, and can almost reect its classication ability. For a given
data set, this feature is useful for obtaining the inherent structure which is uncertain for us. MCS is correspond to many
real world problems, like classroom teaching. Specically, the teacher explains some examples which is the Minimal
Consistent Subset of various types of exercises at length to his students, then the students having been inspired will be
able to solve the related exercises. However, the existing serial algorithm can only be performed on a single computer,
and it is difcult for this algorithm to handle big data with uncertainty distribution. In this paper, we propose a Parallel
Sampling method based on Hyper Surface (PSHS) for big data with uncertainty distribution to get the MCS of the
original sample set whose inherent structure is uncertain for us. Experimental results in Section 4 show that PSHS can
deal with large scale data sets effectively and efciently.
Traditional sampling methods on huge amount of data consume too much time or even cannot be applied to big
data due to memory limitation. MapReduce is developed by Google as a software framework for parallel comput-
ing in a distributed environment [8,9]. It is used to process large amounts of raw data such as documents crawled
from web in parallel. In recent few years, many classical data preprocessing, classication and clustering algo-
rithms have been developed on MapReduce framework. MapReduce framework is provided with dynamic exibility
support and fault tolerance by Google and Hadoop. In addition, Hadoop can be easily deployed on commodity hard-
ware.
The remainder of the paper is organized as follows. In Section 2, preliminary knowledge is described, including
the HSC method, MCS and MapReduce. Section 3 implements the PSHS algorithm based on MapReduce framework.
In Section 4, we show our experimental results and evaluate our parallel algorithm in terms of effectiveness and
efciency. Finally, our conclusions are stated in Section 5.
2. Preliminaries
In this section we describe the preliminary knowledge, on which PSHS is based.
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.3 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 3
2.1. Hyper surface classication
Hyper Surface Classication (HSC) is a general classication method based on Jordan Curve Theoremin Topology.
Theorem 1 (Jordan Curve Theorem). Let X be a closed set in n-dimensional space R
n
. If X is homeomorphic to a
sphere S
n1
, then its complement R
n
\ X has two connected components, one called inside, the other called outside.
According to the Jordan Curve Theorem, a surface can be formed in an n-dimensional space and used as the
separating hyper surface. For any given point, the following classication theory can be used to determine whether
the point is inside or outside the separating hyper surface.
Theorem 2 (Classication Theorem). For any given point x R
n
\ X, x is inside of X the wind number i.e.
intersecting number between any radial from x and X is odd, and x is outside of X the intersecting number
between any radial from x and X is even.
The separating hyper surface is directly used to classify the data according to whether the number of intersections
with the radial is odd or even [4]. This classication method is a direct and convenience method. From the two
theorems above, X is regarded as the classier, which divides the space into two parts. And the classication process
is very easy just by counting the intersecting number between a radial from the sample point and the classier X. It
is a novel approach that has no need of making mapping from lower-dimensional space to higher-dimensional space.
HSC has no need of kernel function. Furthermore, it can directly solve the non-linear classication problem via the
hyper surface.
2.2. Minimal consistent subset
To handle the problem of high computational demands of nearest neighbor (NN), many efforts have been made
for selecting a representative subset of the original training data, like the condensed nearest neighbor rule (CNN)
presented by Hart [10]. For a sample set, a consistent subset is a subset which, when used as a stored reference set
for the NN rule, correctly classies all of the remaining points in the sample set. And the Minimal Consistent Subset
(MCS) is dened as consistent subset with a minimum number of elements. Harts method indeed ensures consistency,
but the condensed subset is not minimal, and is sensitive to the randomly picked initial selection and to the order of
consideration of the input samples. After that, a lot of work has been done to reduce the size of the condensed subset
[1116]. The MCS of HSC is dened as follows.
For a nite sample set S, suppose C is the collection of all subsets. And C

C is a disjoint cover set for S, such


that each element in S belongs to one and only one member of C

. The MCS is a sample subset combined by choosing


one sample and only one sample from each member in the disjoint cover set C

. For HSC, we call sample a and b


equivalent if they belong to the same category and fall into the same unit which covers at most samples from the same
category. And the points falling into the same unit form an equivalent class. The cover set C

is the union set of all


equivalent classes in the hyper surface H. More specically, let

H be the interior of H and u is a unit in

H. The MCS
of HSC denoted by S
min
|
H
is a sample subset combined by selecting one and only one representative sample from
each unit included in the hyper surface, i.e.
S
min
|
H
=

u

H
{choosing one and only one s u} (1)
The computation method for the MCS of a given sample set is described as follows:
1) Input the samples, containing k categories and d dimensions. Let the samples be distributed within a rectangular
region.
2) Divide the rectangular region into
d

10 10 10 small regions called units.
3) If there are some units containing samples from two or more different categories, then divide them into smaller
units repeatedly until each unit covers at most samples from the same category.
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.4 (1-17)
4 Q. He et al. / Fuzzy Sets and Systems ()
Fig. 1. Fuzzy boundary set.
4) Label each unit with 1, 2, . . . , k, according to the category of the samples inside, and unite the adjacent units with
the same label into a bigger unit.
5) For each sample in the set, locate its position in the model, which means to gure out which unit it is located in.
6) Combine samples that are located in the same unit into one equivalent class, then a number of equivalent classes
in different layers are got.
7) Pick up one sample and only one sample from each equivalent class to form the MCS of HSC.
The algorithm above is not sensitive to the randomly picked initial selection and to the order of consideration of
the input samples. And some samples in the MCS are replaceable, while others are not. Some close samples within
the same category and falling into the same unit are equivalent to each other in the building of the classier, and each
of them can be picked randomly for the MCS. On the contrary, sometimes there can be only one sample in a unit, and
this sample plays a unique role in forming the hyper surface. Hence the outcome of MCS is uncertain for us.
Note that, different division granularities lead to different separating hyper surfaces and inherent structures. As
seen in Fig. 1, each boundary denoted by dotted line (l
1
, l
2
, l
3
, etc.) may be used in the division process, and all the
possible separating hyper surfaces form a fuzzy boundary set. The samples in the fuzzy boundary set have different
memberships for the separating hyper surface used in the division process. Specically, the samples lie in dotted line
l
2
have maximum membership i.e. 1 for the separating hyper surface, while the samples lie in dotted line l
1
and l
3
have uncertain memberships larger than 0.
For a specic sample set, the MCS almost reects its classication ability. Any addition into the MCS will not
improve the classication ability, while every single deletion from MCS will lead to failure in testing accuracy. This
feature is useful for obtaining the inherent structure which is uncertain for us. However, all of the operations should be
executed in memory. When dealing with large scale data sets, the existing serial algorithm will encounter the problem
of insufcient memory.
2.3. MapReduce framework
MapReduce, as the framework showed in Fig. 2, is a simplied programming model and computation platform for
processing distributed large scale data sets. It species the computation in terms of a map and a reduce function. The
underlying runtime system automatically parallelizes the computation across large scale cluster of machines, handles
machine failures, and schedules inter-machine communication to make efcient use of the network and disks.
As its name shows, map and reduce are two basic operations in the model. Users specify a map function that
processes a key-value pair to generate a set of intermediate key-value pairs, and a reduce function that merges all
intermediate values associated with the same intermediate key.
All data processed by MapReduce are in the form of key-value pairs. The execution happens in two phases. In
the rst phase, the map function is called once for each input record. For each call, it may produce any number of
intermediate key-value pairs. A map function is used to take a single key-value pair and output a list of new key-value
pairs. The type of output key and value can be different from input key and value. It could be formalized as:
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.5 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 5
Fig. 2. Illustration of the MapReduce framework: the map is applied to all input records, which generates intermediate results that are aggregated
by the reduce.
map :: (key
1
, value
1
) list(key
2
, value
2
) (2)
In the second phase, these intermediate pairs are sorted and grouped by the key
2
, and the reduce function is called
once for each key. Finally, the reduce function is given all associated values for the key and outputs a new list of
values. Mathematically, this could be represented as:
reduce ::

key
2
, list(value
2
)

(key
3
, value
3
) (3)
The MapReduce model provides sufcient high-level parallelization. Since the map function only takes a single
record, all map operations are independent of each other and fully parallelizable. Reduce function can be executed in
parallel on each set of intermediate pairs with the same key.
3. Parallel sampling method based on hyper surface
In this section, the Parallel Sampling method based on Hyper Surface (PSHS) for big data with uncertainty distri-
bution will be summarized. Firstly, we give the representation of hyper surface inspired by decision tree. Secondly, we
analyze the conversion from the serial parts to the parallel parts in the algorithm. Then we explain how the necessary
computations can be formalized as the map and reduce operations under MapReduce framework in detail.
3.1. Hyper surface representation
In fact, it is difcult to exactly represent a hyper surface of R
n
space in the computer. Inspired by decision tree,
we can use some labeled regions to approximate a hyper surface. All N input features except the class attribute can
be considered to be real numbers in the range [0, 1). There is no loss of generality in this step. All physical quantities
must have some upper and lower bounds on their range, so suitable linear or non-linear transformations to the interval
[0, 1) can always be found. The inputs being in the range [0, 1) means that these real numbers can be expressed as
decimal fractions. This is convenient because each successive digit position corresponds to a successive part of the
feature space.
Sampling is performed by simultaneously examining the most signicant digit (MSD) of each of the N inputs.
This either yields the equivalent class directly (a leaf of the tree), or indicates that we must examine the next most
signicant digit (descend down a branch of the tree) to determine the equivalent class. The next decimal digit then
either yields the equivalent class, or tells us to examine the following digit further, and so on. Thus sampling is
equivalent to nd the region (a leaf node of the tree) representing an equivalent class, and pick up one sample and
only one sample from each region (a leaf node of the tree) to form the MCS of HSC. As sampling occurs one decimal
digit at a time, even the number having very long digits such as 0.873562, are handled with ease because the minimal
number of digits required for successful sampling is usually very little. Before data can be sampled, the decision tree
must be constructed as follows:
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.6 (1-17)
6 Q. He et al. / Fuzzy Sets and Systems ()
Table 1
9 samples of 4 dimension data.
Attribute 1 Attribute 2 Attribute 3 Attribute 4 Category
0.431 0.725 0.614 0.592 1
0.492 0.726 0.653 0.527 2
0.457 0.781 0.644 0.568 1
0.625 0.243 0.672 0.817 2
0.641 0.272 0.635 0.843 2
0.672 0.251 0.623 0.836 2
0.847 0.534 0.278 0.452 1
0.873 0.528 0.294 0.439 2
0.875 0.523 0.295 0.435 2
Table 2
The most signicant digits of 9 samples.
Sample MSD Category
s1 4765 1
s2 4765 2
s3 4765 1
s4 6268 2
s5 6268 2
s6 6268 2
s7 8524 1
s8 8524 2
s9 8524 2
1) Input all sample data, and normalize each dimension of them between [0, 1). The entire feature space is mapped
to the inside of a unit hyper-cube, referred as the root region.
2) Divide the region into sub regions by getting the most signicant digit of each of the N inputs. The arrangement
form of every N decimal digits can be viewed as a sub region.
3) For each sub region, if the samples in it belong to the same class, then label it with the samples class and attach
a ag P, which means this region is pure and we can construct a leaf node. Else turn to step 4).
4) Label this region with the majority class and attach a ag N, on behalf of impurity. Then go to step 2) to get the
next most signicant digits of the input features until all the sub regions become pure.
From the above steps, we can get a decision tree that describes the inherent structure of the data set. Every node
of this decision tree can be regarded as a rule to classify the unseen data. For example, give a 4 dimension sample set
shown in Table 1.
As all the samples have been normalized in [0, 1), we can skip the rst step. Then, we get the most signicant
digits of every sample, as shown in Table 2.
The samples falling into the region (6268) belong to the same category 2, which means region (6268) is pure. So
we label (6268) with category 2 and attach a ag P, then a rule (6268,2:P) is generated. Region (4765) has 2 samples
of category 1 and 1 sample of category 2. So we label it with category 1 and attach a ag N, leading to a new rule
(4765,1:N). And we must further divide it into sub regions. Similarly, for region (8524) we can get a rule (8524,2:N)
and also should divide it in the next circle.
Table 3 shows the result of getting the next most signicant digits of the samples falling in the impure regions. All
the sub regions of region (4765) and (8524) become pure, so we have rules (4375,1:P) and (7293,2:P) for the parent
region (8524), and rules (3219,1:P), (9252,2:P) and (5846,1:P) for the parent region (4765). The decision tree can be
constructed iteratively. The decision tree having the equivalent function of the generated rules is shown in Fig. 3. We
notice that there is no need to construct the decision tree in memory, the rules can be generated straightforwardly,
which can be used to design the Parallel Sampling method based on Hyper Surface.
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.7 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 7
Table 3
The next most signicant digits.
Sample MSD Category
s1 3219 1
s2 9252 2
s3 5846 1
s7 4375 1
s8 7293 2
s9 7293 2
Fig. 3. An equivalent decision tree of the generated rules.
3.2. The analysis of MCS from serial to parallel
In the existing serial algorithm, the most common operation is to divide a region containing more than one class
into smaller regions and then determine whether a sub region is pure or not. If a sub region is pure, all the samples
that fall into it will not provide any useful information to construct other sub regions, thus they can be removed from
the samples. So to determine whether the sub regions having the same parent region are pure or not can be parallel
executed. From Section 2 we know that the process of MCS is to construct a multi-branched tree whose function is
similar to a decision tree. Therefore, we can construct one layer of the tree iteratively, from top to bottom, until each
leaf node that represents a region is pure.
3.3. The sampling process of PSHS
As the analysis above, PSHS algorithm needs three kinds of MapReduce job in iteration. In the rst job, according
to the value of each dimension, the map function performs the procedure of assigning each sample to a region it
belongs to. While the reduce function performs the procedure of determining whether a region is pure or not, and
outputs a string representing the region and its purity attribute. After this job, a layer of the decision tree has been
constructed, and we must remove the unnecessary samples that are not useful to construct the next layer of the decision
tree, which is the task of the second job. In the third job, i.e. sampling job, the task of map function is to assign each
sample to a pure region it belongs to according to the rules representing pure regions. Since samples in the same pure
region are equivalent to each other in the building of the classier, the reduce function can randomly pick one of them
for the MCS. Firstly, we present the details of the rst job.
Map Step: The input data set is stored on HDFS which is a le system on hadoop as a sequence le of key, value
pairs, each of which represents a record in the data set. The key is the offset in bytes of this record to the start point
of the data le, and the value is a string of the content of a sample and its class. The data set is split and globally
broadcast to all mappers. The pseudo code of map function is shown in Algorithm 1. We can pass some parameters
to the job before the map function invoked. For simplicity, we use dim to represent the dimension of the input feature
except the class attribute, and layer to represent the level of the tree to be constructed.
In Algorithm 1, the main goal is to get the corresponding region a sample belongs to, which is accomplished
from step 3 to step 9. A character : is appended after getting the digits of each dimension to indicate that a layer is
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.8 (1-17)
8 Q. He et al. / Fuzzy Sets and Systems ()
Algorithm 1 TreeMapper (key, value)
Input: (key: offset in bytes; value: text of a record)
Output: (key: a string representing a region; value: the class label of the input sample)
1. Parse the string value to an array, named data, of size dim and its class label named category;
2. Set string outkey as a null string;
3. for i = 1 to layer do
4. for j = 0 to dim do
5. append outkey with getNum(data[j], i)
6. end for
7. if i < layer then
8. append outkey with :
9. end if
10. end for
11. output(outkey,category)
Algorithm 2 getNum(num, n)
Input: (num: a double variable num in [0, 1); n: an integer)
Output: a character representing the n-th digit after the decimal point.
1. i n
2. while i > 0 do
3. num num10
4. i i 1
5. end while
6. get the integer part of num and assign it to a variable ret
7. ret ret%10
8. return the corresponding character of ret
nished. We invoked a procedure getNum(num, n) in the process. Its function is getting the n-th digit of num, which
is described in Algorithm 2.
Reduce Step: The input of the reduce function is the data obtained from the map function of each host. In reduce
function, we can count the number of samples for each class. If the class labels of all samples in a region are identical,
this region is pure. If a region is impure, we will label it with the majority category. First we pass all the class labels,
named categories, to the job as parameters which will be used in the reduce function. The pseudo code for reduce
function is shown in Algorithm 3. Fig. 4 shows the complete job procedure.
When the rst job nished, we can get a set of regions that cover the whole samples. If a region is impure, we must
divide it into sub regions until the sub regions are all pure. Hence, if a region is pure, the samples that fall in it are
not needed and can be removed. Therefore, the second job can be referred to as a lter whose function is to remove
the unnecessary samples that are not useful to construct the next layer of the decision tree. We should read the impure
regions into memory before we can decide whether a sample should be removed or not. We use a variable set to store
the impure regions. Then the second jobs mapper can be described in Algorithm 4. Hadoop provides a default reduce
implementation which outputs the result of the mapper, and it is what we adopt in the second job. The complete job
procedure can be seen in Fig. 5.
The rst job and second job run iteratively until the samples are all removed, in other words all the rules have been
generated. We can get several rule sets each of whom represents a layer of the decision tree. In the sampling job,
according to the rules representing pure regions, the map function performs the procedure of assigning each sample
to a pure region it belongs to. The rules representing pure regions should be read into memory before sampling. A list
variable rules is used to store all these rules. Then the pseudo code for map function of the sampling job is shown in
Algorithm 5.
In reduce function of sampling job, we can randomly pick one sample from each pure region for the MCS. The
pseudo code of reduce function is described in Algorithm 6. Fig. 6 shows the complete job procedure.
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.9 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 9
Algorithm 3 TreeReducer
Input: (key: a string representing a region; values: the list of class labels of all samples falling in this region)
Output: (key: identical to key; value: the class label of this region plus its purity attribute)
1. Initial an array count to 0 with equal size to the number of all the class labels;
2. Initial a counter totalnum to 0 to record the number of samples in this region;
3. while values.hasNext() do
4. get a class label c from values.next()
5. count[c] + +
6. totalnum++
7. end while
8. nd the majority class max from count and its corresponding index i
9. if all samples belong to max, i.e.
totalnum = count[i] then
10. purity P
11. else
12. purity N
13. end if
14. construct value as the comprising of max and purity
15. output(key,value)
Fig. 4. Generating a layer of the decision tree.
Algorithm 4 FilterMapper
Input: (key: offset in bytes; value: text of a record)
Output: (key: identical to value if this sample falls in an impure region; value: a null string)
1. if this sample matches a rule of set then
2. output(value, )
3. end if
4. Experiments
In this section, we demonstrate the performance of our proposed algorithm with respect to effectiveness and
efciency by dealing with uncertainty distribution big data including real world data from UCI machine learning
repository and synthetic data. Performance experiments were run on a cluster of ten computers, six of them each has
four 2.8 GHz cores and 4 GB memory, the rest four each has two 2.8 GHz cores and 4 GB memory. Hadoop version
0.20.0 and Java 1.6.0_22 are used as the MapReduce system for all experiments.
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.10 (1-17)
10 Q. He et al. / Fuzzy Sets and Systems ()
Fig. 5. Filter.
Algorithm 5 SamplingMapper
Input: (key: offset in bytes; value: text of a record)
Output: (key: a string representing a pure region; value: identical to value)
1. Set string pureRegion as a null string;
2. for i = 0 to (rules.length 1) do
3. if this sample matches rules[i] then
4. pureRegion the string representing this region of rules[i]
5. output(pureRegion,value)
6. end if
7. end for
Algorithm 6 SamplingReducer
Input: (key: a string representing a pure region; values: the list of all samples falling in this region)
Output: (key: one random sample of each pure region; value: a null string)
1. Set string samp as a null string;
2. if values.hasNext() then
3. samp values.next()
4. output(samp, )
5. end if
Fig. 6. Sampling.
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.11 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 11
4.1. Effectiveness
First of all, to illustrate the effectiveness of PSHS more vivid and clear, the following gures are listed. We use
two data sets from UCI repository as follows. Waveform data set has 21 attributes, 3 categories and 5000 samples.
The data set of Poker Hand contains 25,010 samples from 10 categories in ten dimensional space. Both data sets are
transformed into three dimensions by using the method in [5].
The serial MCS computation method mentioned in [7] is executed to obtain the MCS of Poker Hand data set, and
then trained by HSC. The trained model of hyper surface is shown in Fig. 7. Furthermore, we adopt PSHS algorithm
to obtain the MCS of this data set. For comparison, the MCS of a given sample set obtained by PSHS is denoted by
PMCS, while the MCS of a given sample set obtained by the serial MCS computation method is denoted by MCS (the
same as follows). The PMCS is also used for training, whose hyper surface structure is shown in Fig. 8.
From the two gures above, we can see that the hyper surface structures between MCS and PMCS are totally the
same. They both have only one sample in each unit. No matter which we choose for training, either MCS or PMCS,
we get the same hyper surface maintaining identical distribution. Note that Waveform data set, and the same hyper
surface structures obtained by its MCS and PMCS are shown in Fig. 9.
For a specic sample set, the Minimal Consistent Subset almost reects its classication ability. Table 4 shows the
classication ability of MCS and PMCS. All the data sets used here are got from UCI repository. From this table, we
can see that the testing accuracy obtained from PMCS is same with that obtained from MCS, which means that the
PSHS algorithm is totally consistent with the serial MCS computation method.
One notable feature of PSHSthe ability to deal with uncertainty distribution big data is shown in Table 5. We
obtain the synthetic three dimensional data by following the approach used in [4], and carry out the actual numerical
sampling and classication. The sampling time of PSHS is much better than that of the serial MCS computation
method, yet achieving the same testing accuracy.
4.2. Efciency
We evaluate the efcient performance of our proposed algorithm in terms of speedup, scaleup and sizeup [17] when
dealing with uncertainty distribution big data. We use Breast Cancer Wisconsin data set from UCI repository, which
contains 699 samples from two different categories. The data set is rstly transformed into three dimensions by using
the method in Ref. [5], and then replicate it to get 3 million, 6 million, 12 million, and 24 million samples respectively.
Speedup: In order to measure the speedup, we keep the data set constant and increase the number of cores in
the system. More specically, we rst apply PSHS algorithm in a system consisting of 4 cores, and then gradually
increase it. The core number of system varies from 4 to 32 and the size of the data set increases from 3 million to
24 million. The speedup given by the larger system with m cores is measured as:
Speedup(m) =
run-time on 1 core
run-time on m cores
(4)
The perfect parallel algorithm demonstrates linear speedup: a system with m times the number of cores yields a
speedup of m. In practice, linear speedup is very difcult to achieve because of the communication cost and the skew
of the slaves. The slowest slave determines the total time needed. If not every slave needs the same time, we have this
skew problem.
We have performed the speedup evaluation on data sets with different sizes. Fig. 10 demonstrates the results. As the
size of the data set increases, the speedup of PSHS becomes approximately linear, especially when the data set is big
such as 12 million and 24 million. We also notice that when the data set is small such as 3 million, the performance
of 32-core system is not signicantly improved compared to that of 16-core system, which is not accord with our
intuition. The reason is that the time of processing 3 million data set is not very bigger than the communication time
among the nodes and time occupied by fault-tolerance. However, as the data set increases, the processing time will
occupy the main part, leading to a good speedup performance.
Scaleup: Scaleup measures the ability to grow both the system and the data set size. It is dened as the ability of
an m-times larger system to perform an m-times larger job in the same run-time as the original system. The scaleup
metric is:
Scaleup(data, m) =
run-time for processing data on 1 core
run-time for processing m data on m cores
(5)
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.12 (1-17)
12 Q. He et al. / Fuzzy Sets and Systems ()
Fig. 7. Poker Hand data set and hyper surface structure obtained by its MCS.
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.13 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 13
Fig. 8. PMCS and hyper surface structure obtained by PMCS of Poker Hand data set.
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.14 (1-17)
14 Q. He et al. / Fuzzy Sets and Systems ()
Fig. 9. The hyper surface structures obtained by MCS and PMCS of Waveform data set.
Table 4
Comparison of classication ability.
Data set Sample No. MCS
sample No.
PMCS
sample No.
MCS
accuracy
PMCS
accuracy
Sampling
ratio
Iris 150 80 80 100% 100% 53.33%
Wine 178 129 129 100% 100% 72.47%
Sonar 208 186 186 100% 100% 89.42%
Wdbc 569 268 268 100% 100% 47.10%
Pima 768 506 506 99.21% 99.21% 65.89%
Contraceptive Method Choice 1473 1219 1219 100% 100% 82.76%
Waveform 5000 4525 4525 99.84% 99.84% 90.50%
Breast Cancer Wisconsin 9002 1243 1243 99.85% 99.85% 13.81%
Poker Hand 25,010 22,904 22,904 98.29% 98.29% 91.58%
Letter Recognition 20,000 13,668 13,668 90.47% 90.47% 68.34%
Ten Spiral 33,750 7285 7285 100% 100% 21.59%
Table 5
Performance comparison on synthetic data.
Sample
No.
Testing
sample
No.
MCS
sample
No.
PMCS
sample
No.
MCS
sampling
time
PMCS
sampling
time
MCS
testing
accuracy
PMC
testing
accuracy
1,250,000 5,400,002 875,924 875,924 14 m 21 s 1 m 49 s 100% 100%
5,400,002 10,500,000 1,412,358 1,412,358 58 m 47 s 6 m 52 s 100% 100%
10,500,000 22,800,002 6,582,439 6,582,439 1 h 30 m 51 s 12 m 8 s 100% 100%
22,800,002 54,000,000 12,359,545 12,359,545 3 h 15 m 37 s 25 m 16 s 100% 100%
54,000,000 67,500,000 36,582,427 36,582,427 7 h 41 m 35 s 48 m 27 s 100% 100%
To demonstrate how well the PSHS deals with uncertainty distribution big data when more cores of computers
are available, we have performed scalability experiments where we increase the size of the data set in proportion
to the number of cores. The data sets size of 3 million, 6 million, 12 million, 24 million are performed on 4, 8, 16
and 32 cores respectively. Fig. 11 shows the performance results on these data sets. As the data set becomes larger,
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.15 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 15
Fig. 10. Speedup performance.
Fig. 11. Scaleup performance.
the scalability of PSHS drops slowly. It always maintains a value of scaleup higher than 84%. Obviously, the PSHS
algorithm scales very well.
Sizeup: Sizeup analysis holds the number of cores in the system constant, and grows the size of the data set. Sizeup
measures how much longer it takes on a given system, when the data set size is m-times larger than the original data
set. The sizeup metric is dened as follows:
Sizeup(data, m) =
run-time for processing m data
run-time for processing data
(6)
To measure the performance of sizeup, we have xed the number of cores to 4, 8, 16 and 32 respectively. Fig. 12
shows the sizeup results on different cores. When the number of cores is small such as 4 and 8, the sizeup perfor-
mances differ little. However, as more cores are available, the value of sizeup on 16 or 32 cores decreases signicantly
compared to that of 4 or 8 cores on the same data sets. The graph demonstrates PSHS has a very good sizeup perfor-
mance.
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.16 (1-17)
16 Q. He et al. / Fuzzy Sets and Systems ()
Fig. 12. Sizeup performance.
5. Conclusion
With the advent of big data era, the demand for processing big data with uncertainty distribution is increasing. In
this paper, we present a Parallel Sampling method based on Hyper Surface (PSHS) for big data with uncertainty dis-
tribution to get the Minimal Consistent Subset (MCS) of the original sample set whose inherent structure is uncertain.
Our experimental evaluation on both real and synthetic data sets showed that our approach can not only obtain consis-
tent hyper surface structure and testing accuracy with the serial algorithm, but also perform efciently according to the
speedup, scaleup and sizeup. Besides, our algorithm can process big data with uncertainty distribution on commodity
hardware efciently. It should be noted that PSHS is a universal algorithm, but the features may be very different with
different classication methods. We will further conduct the experiments and consummate the parallel algorithm to
improve usage efciency of computing resources in the future.
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Nos. 61035003, 61175052,
61203297), National High-tech R&D Program of China (863 Program) (Nos. 2012AA011003, 2013AA01A606,
2014AA012205).
References
[1] V. Novk, Are fuzzy sets a reasonable tool for modeling vague phenomena?, Fuzzy Sets Syst. 156 (2005) 341348.
[2] L.A. Zadeh, Fuzzy sets, Inf. Control 8 (1965) 338353.
[3] D. Dubois, H. Prade, Gradualness, uncertainty and bipolarity: Making sense of fuzzy sets, Fuzzy Sets Syst. 192 (2012) 324.
[4] Q. He, Z. Shi, L. Ren, E. Lee, A novel classication method based on hypersurface, Math. Comput. Model. 38 (2003) 395407.
[5] Q. He, X. Zhao, Z. Shi, Classication based on dimension transposition for high dimension data, Soft Comput. 11 (2007) 329334.
[6] X. Zhao, Q. He, Z. Shi, Hypersurface classiers ensemble for high dimensional data sets, in: Advances in Neural Networks ISNN 2006,
Springer, 2006, pp. 12991304.
[7] Q. He, X. Zhao, Z. Shi, Minimal consistent subset for hyper surface classication method, Int. J. Pattern Recognit. Artif. Intell. 22 (2008)
95108.
[8] J. Dean, S. Ghemawat, Mapreduce: simplied data processing on large clusters, Commun. ACM 51 (2008) 107113.
[9] R. Lammel, Googles mapreduce programming model-revisited, Sci. Comput. Program. 70 (2008) 130.
[10] P. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf. Theory 14 (1968) 515516.
[11] V. Cervern, A. Fuertes, Parallel random search and tabu search for the minimal consistent subset selection problem, in: Randomization and
Approximation Techniques in Computer Science, Springer, 1998, pp. 248259.
JID:FSS AID:6484 /FLA [m3SC+; v 1.188; Prn:5/03/2014; 8:28] P.17 (1-17)
Q. He et al. / Fuzzy Sets and Systems () 17
[12] B.V. Dasarathy, Minimal consistent set (mcs) identication for optimal nearest neighbor decision systems design, IEEE Trans. Syst. Man
Cybern. 24 (1994) 511517.
[13] P.A. Devijver, J. Kittler, On the edited nearest neighbor rule, in: Proc. 5th Int. Conf. on Pattern Recognition, 1980, pp. 7280.
[14] L.I. Kuncheva, Fitness functions in editing k-nn reference set by genetic algorithms, Pattern Recognit. 30 (1997) 10411049.
[15] C. Swonger, Sample set condensation for a condensed nearest neighbor decision rule for pattern recognition, in: Frontiers of Pattern Recogni-
tion, 1972, pp. 511519.
[16] H. Zhang, G. Sun, Optimal reference subset selection for nearest neighbor classication by tabu search, Pattern Recognit. 35 (2002)
14811490.
[17] X. Xu, J. Jochen, H. Kriegel, A fast parallel clustering algorithm for large spatial databases, in: High Performance Data Mining, Springer,
2002, pp. 263290.

You might also like