You are on page 1of 15

IPASJ International Journal of Computer Science (IIJCS)

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

A Survey on Big Data and Knowledge


Acquisition Techniques
A. Mahmoud Soufi 1, A. A. Abd El-Aziz2 and Hesham A. Hefny 3
1
Dept. of Computer Science
Institute of Statistical Studies and Research
Cairo University, Cairo, P.O. 12613
2
Dept. of Information Systems & Technology, Institute of Statistical Studies and Research
Cairo University, Cairo, P.O. 12613,
Dept. of Information Systems, College of Computer & Information Sciences
Jouf University, Saudi Arabia,
3
Vice-Dean
Institute of Statistical Studies and Research
Cairo University, Cairo, P.O. 12613

Abstract
Abstract- Big data refers to a data sets that a large of traditional computer memory, processor and software, that’s mean when
the volume, variety and velocity of the data are increased, the current techniques and technologies may not be able to handle
storage and processing of data, Also a Big Data described as more functionally, as large pools of unstructured and structured
data that can be captured, communicated, aggregated, stored, and analyzed which are now becoming a part of every sector and
function of the global economy. Due to the rapidly increasing and updating of the big data in real-life applications, it gives a
new rise of challenge for quickly extracting the useful information from big data for decision making, so this research will
navigate throw the big data design and structure moreover explore the different techniques that used to discover a useful
knowledge from big data, that helps a decision maker for a good and fast decision. This survey will support the future research
and development work as well as raising the awareness for the presented approaches.
Index Terms- Big data, Hadoop, HDFS, MapReduce, Spark,
Keywords: Big data, Hadoop, HDFS PIG, HIVE, SPARK, MapReduce, Knowledge discovery, Data mining, Fuzzy Set,
Rough Set, Decision Tree,.ID3, C4.5, FCM.

1. INTRODUCTION
The world creates a huge volume of data in various domains every second, with referring to a report from International
Data Corporation (IDC), in 2011, the overall created and copied data volume in the world was 1.8ZB (≈ 1021B), which
increased by nearly nine times within five years [1], so approximately 90 % of the data on the world generated in the
last two years.

The torrent of the data increasing leads to the term “Big data” which used to describe a new data set with a new
characteristic as the tradition computers and technologies can’t manipulate (collect, process, retrieve and manage) it.
When using a term big data we mean a data set contains a structured, semi structured or unstructured data that
captured from different sources like machine , social network and organizations, with a massive volume of data which
can contain unreliable and imperfect data moreover we have to retrieve and analysis a useful data on demand [2].,
Now most of researchers and data scientists agree with the characterization of big data using the 3 V’s defined by
Doug Laney of Gartner, also the 2 V’s is now also sometimes added as shown Figure 1 [3,4].

Volume 6, Issue 7, July 2018 Page 15


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

Figure1. Five (5) V’s factor of big data

 Volume: - The big volume of the data is a storage issue, also a massive analysis issue.
 Variety: - The Data generated from greater variety of sources and formats and contain multidimensional data fields.
 Velocity: - The speed of data generation and processing to meet the demands and the challenges of the growth and
development.
 Veracity: - Managing the reliability and predictability of inherently imperfect data. Accurate analysis depends on the
veracity of source data.
 Value: -The extent to which big data generates economically worthy insights and or benefits through extraction and
transformation

The Knowledge Acquisition techniques become very important tools for the big data processing, due to his ability to
deal with huge amount of data with different formats and characteristics to prepare the data for fast processing, so in
this survey we will present a different knowledge of acquisition approaches with different tools. The rest of the paper
is organized as the following: Section 2 represents the researches related to Hadoop as the most famous and important
tool for big data storage and processing. In Section 3, we present the researches related to knowledge acquisition
techniques like Rough Set, Fuzzy Set and Traditional data mining techniques. Section 4 summarizes the conclusion.

2. HADOOP
Hadoop is open source eco system; it is developed by Apache Software foundation was created by Doug Cutting and
Mike Cafarella in 2005. Doug Cutting. It was originally developed to support distribution for the Nutch search engine
project. Actually Hadoop having two types of versions in those one is hadoop-1.x and second one is hadoop-2.x.
Hadoop-1.x has some problems for that the 2.x is developed. The problem in 1.x is single point of failure. And another
thing is advantage is, hadooop-1.x having the block size is 64 MB and hadoop-2.x has a 128 MB. So, 2.x improves the
through put of data figure 2 [5], [6].

Hadoop is:
 Reliable: The software is fault tolerant; it expects and handles hardware and software failures.
 Scalable: Designed for massive scale of processors, memory, and local attached storage.
 Distributed: Handles replication. Offers massively parallel programming model, Map Reduce.

Volume 6, Issue 7, July 2018 Page 16


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

Figure2. Hadoop Versions

Figure3. Hadoop Components (Cloudera)

Hadoop Features

 Ability to deal with complex data.


 Can deal structured with structured and unstructured data.
 Support SQL and Like SQL structured unstructured language.
 Support recursive algorithms.
 Complex geo-spatial investigation or genome sequencing
 Machine learning
 Data sets are so massive. It couldn't be handling by tradition database software.
 Not Expensive as Hadoop installed on clustered commodity.
 Results are not required continuously
 Fault Tolerance

2.1 HDFS
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput of data access
rather than low latency of data access, moreover it’s designed to run on commodity hardware [6], [7].

 Resilience to hardware failure


 Streaming data access
 Support for large dataset, scalability to hundreds/thousands of nodes with high aggregate bandwidth
 Application locality to data
 Portability across heterogeneous hardware and software platforms

2.1.1 Namenode and Datanodes


HDFS designed as a master and slave architecture. HDFS cluster architecture has a single Namenode, which manages
the file system namespace and regulates access to the files by clients and other programs master server. Moreover, there
are a lot of Datanodes, designed as one per node in the cluster, which manage storage attached to the nodes that they
run on. HDFS detect a file system namespace and allows storing a user data in files, each file is distributed as pieces
into one or more blocks and these blocks are stored in a set of Datanodes. The Namenode is responsible to execute
namespace operations over file system like opening, closing, and renaming files and directories. The Namenode is

Volume 6, Issue 7, July 2018 Page 17


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

responsible for handling the mapping between blocks and Datanodes. The Datanodes execute the reading and writing
requests from the file system’s clients. The Datanodes also receive the instruction from Namenode to execute a block
creation, deletion, and replication.

Figure4. HDFS Architecture

2.2 MapReduce
MapReduce is a processing engine framework that designed to deal with HDFS to allow user to write an easy
application to manipulate a large amount of data in distributed manner. MapReduce engine has two main operations
come from his name. Map and Reduce are the main Operation in the MapReduce engine; it’s two simple operations.
MapReduce split the file/s in independence chunks to allow to them to run the application in parallel mode. Also, the
framework consists of a single master called Job Tracker. For each cluster node one Slave called “Task Tracker”. Job
Tracker responsible to job scheduling and task tracker for monitoring task and re-execute the failed tasks [8].

2.2.1 MapReduce Main Functionality

Map: - Spilt the problem to the sub problem and distribute them in different blocks to run the map in parallel mode
transform an input data row of key and value to an output key/value, which output can contain more than pair for the
same key.
Map (key1, value1) -> list<key2, value2>
Reduce: -combine the sub problems in predefined way to get answer for the original problem and return one pair for
every key.
Reduce (key2, list<value2>) -> list<value3>

Figure5. MapReduce operation


2.3 Phoenix
Phoenix is a shared-memory MapReduce model implemented for data-intensive processing tasks, which can be used for
programming multi-core chips as well a shared memory multiprocessor [24].
2.4 Twister
Twister is an iterative MapReduce. It provides the feature for caching MapReduce tasks, which allows developer to
develop iterative applications without spending much time on reading and writing large amount of data in each

Volume 6, Issue 7, July 2018 Page 18


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

iteration [25].

2.5 PIG
Pig is an ETL engine developed over MapReduce to allow writing a simple program rather than create a mappers and
reducers, Pig designed from two component languages call PigLatin and Run Time environment that execute the
PigLatin program, main processes are Load, Transform and Dump/Store [9].
Load: - to allow to PigLatin Program to execute what you have to load a file into HDFS and inform the program where
the location of the file.
A = Load '/user/cloudera/file using PigStorage (':');
Transform: - is a main function of the PIG which is a data manipulation process phase that allow user to filer, join,
group or order, etc.
B = foreach A generate $0,$4,$5;
Dump and Store: -are commands to execute the result on screen in case dump or file in case Store
dump B;
store B into 'userinfo.out';

Figure6. PIG Architecture

2.6 Hive
Hive is a data warehouse engine that used for data mining , analytics, machine learning and data analysis, Hive use
SQL Like Language called Hive-QL ,moreover can be Executed over a lot of environment like MapReduce , TEZ and
Spark and data stored into HDFS of HBase [10].
//Start Hive engine
beeline -u jdbc:hive2://
// create table, define column and its types and //storing mechanism
CREATE TABLE userinfo ( uname STRING, pswd STRING, uid INT, gid INT, fullname STRING, hdir STRING,
shell STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ':' STORED AS TEXTFILE;
// load data from file into the created table
LOAD DATA INPATH '/tmp/passwd' OVERWRITE INTO TABLE userinfo;
// select from created table
SELECT uname, fullname, hdir FROM userinfo ORDER BY uname ;

Figure7. Hive Architecture

Volume 6, Issue 7, July 2018 Page 19


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

2.7 HBase
Apache HBase developed by Powerset Company to manipulate large amount of data specially for natural language
processing. In November 2010 Facebook implement its new messaging over HBase framework. Simply HBase is a
column oriented database management system storing data on key-value base with dynamic data model, written by Java
and run over HDFS and not equivalent to relational database system, so it isn’t support structure query language like
SQL. HBase architecture is similar to HDFS architecture, Name Node and Slave Nodes in HBase are master node
equivalent to Name Node. So it is handling the cluster and region servers like Data Node, store portions of the tables
and perform the read and write of the data [11].

//Start HBase Shell


hbase shell
//creating a column base table
create 'emp', 'personal data', 'professional data'list
//insert data into the table
put 'emp','1','personal data:name','John'
put 'emp','2','personal data:name','Stephen'
put 'emp','1','professional data:salary','5000'
put 'emp','2','professional data:job','manager'
put 'emp','2','professional data:salary','6000'
//select from the table
scan 'emp'

Figure8. HBase Architecture

2.8 Spark
Apache Spark is a fast and general engine for large-scale data processing which allows user programs to load data into
a cluster's memory and query it repeatedly by using multi stage memory, so it's faster more 100 times for certain
applications. Spark can access a lot of data sources like HDFS, Cassandra, HBase, and S3, moreover Spark can run
over Mesos, Hadoop or over cloud also can run as a standalone. Furthermore it provides rich library and APIs like
SQL and Data Frames, Spark Streaming, MLlib (machine learning),GraphX (graph),Third-Party Projects ,and also it
allows to write programs with a lot of programming languages like Java,Scala,python and R.

Figure9. Spark Architecture

Volume 6, Issue 7, July 2018 Page 20


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

3. BIG DATA K NOWLEDGE ACQUISITION TOOLS


In this section, we will explore the knowledge acquisition tools which applied on the big data frameworks to enhance
the knowledge processing as we deal with a huge amount of data that must be cleaned and classified to get the output of
desired information at a proper time.

3.1. Using Rough Set


It’s a powerful and popular knowledge acquisition tool, the theory of rough sets , introduced by Pawlak in the eighties
of the twentieth century , is specifically suitable for dealing with information systems that present ‘data inconsistencies’
(i.e., objects with different values for the decision attribute but identical values for the condition attributes). Rough sets,
which categorize objects based on the ‘indiscernibility’ of their attribute values, allow dealing with incomplete,
imprecise or uncertain data [13].

In [14], they proposed a parallel rough set module that works over Map Reduce Engine to optimize the speed of
knowledge acquisition processing with respect to number of core processors and size of the data sets. The experiment
analysis of the produced method, it produces the same result as sequence methods but with more speed up when
increase the number of core processors and size of the dataset as presented in Figure10.

Speedup (p) =
Where p is the number of cores, T1 is the execution time on single core; Tp is the execution time on p cores.

Figure10. Rough Set Speedup

In [15], they produce a prototype implementation of rough set over Spark framework to execute the basic operation of
rough set including calculation of the approximation (upper and lower), the prototype was tested on a Spark cluster
using Amazon’s AWS. Eight ‘m4.4xlarge’ machines were used as workers with 64 GB of memory each; datasets of
various problem sizes, with the number of condition attributes varying from 10 to 102 and the number of instances
varying from 104 to 107, was randomly generated, the execution time of our Spark implementation grows
approximately quadratic ally.

In [26] they implement their proposed parallel large-scale knowledge acquisition on several representative MapReduce
runtime systems: Hadoop MapReduce, Phoenix and Twister. They applied three methods of rule acquisition over the
difference data sets as the table1 based on the accuracy and the coverage.

Table1. A description of data sets.

Data
s Feature
Samples Classes Size
et s
s
1 KDD99 4,898,421 41 23 0.48 GB
Weka-
2 32,000,000 10 35 1.80 GB
1.

Volume 6, Issue 7, July 2018 Page 21


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992
8
G
Weka-
3.
3 40,000,000 15 45 3.20 GB
2
G
Weka-
6.
4 80,000,000 15 58 6.40 GB
4
G

Experiment proved that: the Hadoop has a better speedup for large data sets with good fault tolerance, phoenix is fast
for non-large datasets, Twister is an Iterative MapReduce but used algorithms not need for iterative operation,
moreover the twister is faster than Hadoop and Phoenix regarding to speed up time.

In [36] they propose a Hybrid Particle Genetic Swarm Optimization (PGSO) based rough set theory and support vector
machine SVM to build knowledge base system in figure 11, the proposed knowledge base system aims to early deduct of
Ovarian Cancer.

Figure11. Knowledge based system for early detection of ovarian cancer.

The Experiment result show that the proposed algorithm is more accurate and faster than ANN and Naive Bayes

3.2. Using Data Mining Algorithm

3.2.1. Decision tree algorithm


A decision tree is a decision support tool used in data mining which uses a tree structure, where each internal node
denotes an assessor (test) on an attribute, each branch represents an attribute value, and each leaf node is a class label.

A decision tree consists of three types of nodes: [17]


 Decision nodes – typically represented by squares
 Chance nodes – typically represented by circles
 End nodes – typically represented by triangles

The ID3 is a decision tree algorithm based on information entropy. The core of the algorithm is to use the information
entropy as the training sample set splitting measurement standard in the generation of decision tree. The algorithm
proceeds as follows [15-16]. Firstly, it calculates each attribute’s information entropy in turn to select a best attribute as
the splitting attribute. Then the data are partitioned into subsets in accordance with the values of the splitting attribute.
For each subset recursive implementation of the above process until each row is correctly classified.

Figure12. Decision Tree

Volume 6, Issue 7, July 2018 Page 22


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

In [18], they improve the classification algorithms (Decision Tree ID3) by presenting it based on granular computing,
the algorithm’s classification accuracy is better than traditional methods. At the same time, the ID3 algorithm is
parallelized, parallel algorithm is more efficient and faster than traditional methods through experimental verification.

The experiment is analysis of the produced method, it produces the more accuracy than linear methods with more
speed up when increase the size of the data set. The below figure displays the speedup ratio for different data set size
and number of clusters (1, 2,3) [18].

Figure13. Speedup Ratio

The C4.5 algorithm is an extension of ID3 decision tree algorithm, the C4.5 algorithm uses the information gain ratio
as the default criteria on attribute splitting so its avoid the ID3 drawback on selecting attributes with many values [34].

In [35] they developed a parallelized C4.5 decision tree learning algorithm called MR-C4.5-Tree to solve the big data
challenge by using MapReduce, also the MR-C4.5 avoid the traditional C4.5 algorithm drawbacks on dealing with
large data set. The experiment result of applying the developed algorithm shows that the performance is increase when
increasing the number of processor and the data set size as figure 14.

Figure14. Size up of the MR-A-S algorithm

3.2.1. K-Means algorithm


K-Means algorithm is most widely used as partitioned (partition n observations into k clusters in which each
observation belongs to the cluster with the nearest mean) clustering algorithm, which unsupervised classification of
patterns into groups to exploratory data analysis [27].

In [28] they enhance the K-Mean algorithm by developing -means++ initialization to be applied over very large data
set on the Mappers phase and Reducer phase, respectively. For the reduction of Map Reduce jobs, our algorithm saves a
lot of communication and I/O cost.

In [29] They proposed an efficient parallel clustering model based on K-Means algorithm and implement it over
MapReduce, they propose the sampling and merging algorithms (WMC, DMC) to reduce the number of generated
clusters which generated in parallel and efficient way.

Experimental results on large real-world datasets and synthetic dataset demonstrate that the optimized algorithm is
efficient and performs better compared with parallel K-means, Kmeans|| and stand-alone Kmeans++ algorithms [31].
Quality of clustering as well as K-means after applying Clustering validation using Davies–Bouldins index (DBI) [30].

Volume 6, Issue 7, July 2018 Page 23


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992
Table2. Comparison of K-means, Kmeans++, Kmeans||, WMC and DMC clustering algorithms by considering DBI on
different sized data sets
Algorithm Bow (k = 50) Bow (k = 100) House (k = 50) House (k = 100)
K-means 0.0185 0.0167 0.0175 0.0124
K-means++ 0.0161 0.0152 0.0166 0.0113
K-means|| 0.0182 0.0161 0.0176 0.0116
WMC 0.0193 0.0171 0.0186 0.0133
DMC 0.0187 0.0166 0.0179 0.0128

3.3. Using Fuzzy Set Algorithms

3.3.1. Fuzzy c-means Algorithms

Fuzzy clustering (soft clustering) is a clustering tool which a classified data point can below to more than one cluster.
Fuzzy c- means (FCM) is one of the most widely used as a fuzzy clustering algorithm tool which classified as soft
constraint clustering algorithm such that the membership degree of the data point in all clusters adds up to 1 [19].

C- Means General Description

 Choose a number of clusters.


 Assign randomly to each point coefficients for being in the clusters.
 Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more
than , the given sensitivity threshold) :
 Compute the centroid for each cluster (shown below).
 For each point, compute its coefficients of being in the clusters.

The FCM algorithm is also defined as the constrained optimization of the squared-error distortion [20].

In this survey will take an example of FCM Algorithm implementation on Map Reduce for English Sentiment analysis
classification as a parallel environment [21], [22].

In [20], they apply the FCM on Map Reduce on Cloudera environment to classified Sentiment analysis by transferring
60,000 English sentences in the training data set into the 60,000 vectors to reduce the execution time of this task.

The FCM Algorithm over Map Reduce as a parallel environment is reducing the execution time to approximately half
execution time on FCM on Sequential environment with Accuracy 60% [21].

Table3. The results in testing data sets t1


Testing dataset t1 Correct classification Incorrect classification
Negative 12500 7523 4977

Positive 12500 7527 4973


Summary 25000 15050 9950

3.3.2. Using Fuzzy Sets by Directed Graph Node Similarity

In [32] they propose an algorithm to discover similar nodes in huge directed graphs, and they apply it over Twitter
social network as a case study. By the way, they are discovering that influential Twitter users calculated similar nodes
after describing it using fuzzy set membership.

Volume 6, Issue 7, July 2018 Page 24


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

The Experiment result of the algorithm proves that the performance is improved as well as the accuracy, but the good
point of the algorithm is: applying the algorithm over a portion of data we get an acceptable accuracy compared to the
consumed resources and time as table 4, note that when taking a portion of data the algorithm calculate similarity
measure proportionally, so the result must be multiplied by corresponding multiplier [32].

µR (dsui) = [32]

Table 4. Algorithm performance improvement results comparison


Portion of followers taken ( ) Discovered users Precision Time p/T Correction RMSE
(D) (p) (T) coefficient (c)
1 72 1 1 1 2.57 0.09
0.5 67 0.93 0.54 1.72 6.13 0.11
0.25 58 0.8 0.28 2.86 12.33 0.13
0.1 51 0.71 0.12 5.92 27.65 0.18

3.3.3. Using Fuzzy Associative Classifier

In [37] they propose the Distributed Associative Fuzzy Frequent Pattern (DAC-FFP). To deal with big data, the
proposed scheme base on MapReduce paradigm working in three steps , first the frequent items must be extracted ,
then, generates the strong significant classification association rules by defining triangular fuzzy set , finally, prunes
noisy and redundant rules ,by testing on real world dataset with 5 million records and characterized by 18 input
features, the proposed scheme is speeding up the performance regarding to the number of cores as figure 14 ,also
enhance the output accuracy.

Figure13. The speedup of DAC-FFP.

3.3.4. Using Fuzzy Rule Based Classification

In [38, 39] They proposed a Fuzzy Rule Base Classification System (FRBCS) called Chi-FRBCS, which is adapted to
deal with big data through MapReduce. They use two MapReduce processes, first process is to build knowledge base
(KB) big data training set, and second process is for classification of big data sample examples. In [39] they enhance
the Chi-FRBCS by modifying the computational of the rules weight to consider the cost of each class based on its
representation in the training set solve the imbalanced problem.

Volume 6, Issue 7, July 2018 Page 25


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

Figure14. A flowchart of how the building of the KB is organized in Chi-FRBCS-Big Data CS.

The Experiment demonstrates that the performance and accuracy of the model enhanced over millions of examples and
a lot of attributes.

4. DISCUSSION
As we see in the big data section we are starting with characteristics of the big data (Volume, Variety, Velocity,
Veracity, Value), then we present the most famous framework used to store this kind of data specially Hadoop using
HDFS (Name Node, Data Node), also we see the different processing engines for the big data. The above lead us to the
most important question “How can we retrieve the valuable data with high quality in the proper time”. So, we
investigate throw the knowledge acquisition technique that can be applied over big data and we categorize the
algorithms in three categories Rough Set algorithms, Data mining algorithms and Fuzzy Set algorithms as the below:

1-Rough Set Algorithms


in [14] they applied the rough set technique over MapReduce and they speedup knowledge from big data using parallel
algorithms, but they faced some limitations on Map Reduce on combination step as must be run in sequential mode
which affected the performance, In [15] they are applying some rough set operations over Spark which increase the
performance of the processing, In[26] they proposed parallel large-scale knowledge acquisition on MapReduce,
Phoenix and Twister and experiment show that the twister is faster than Hadoop and Phoenix regarding to speed up
time, In [36] they Proposed a Hybrid Particle Genetic Swarm Optimization (PGSO) based rough set theory and support
vector machine SVM to build knowledge base system and the proposed algorithm enhance the accuracy and
performance.

2-Data Mining Algorithms


In [18] they improved ID3 classification algorithm and after experiment the proposed algorithm is better than the
traditional algorithm, but the problem this algorithm not parallelized by default also another drawback on selecting
attributes with many values, In[35] they developed a parallelized C4.5 decision tree learning algorithm called MR-
C4.5-Tree to enhance the accuracy and performance also to avoid the ID3 drawbacks, In[28,29] they enhanced and
optimized the K- Means algorithm to enhance Map Reduce process by reducing and I/O processing end communication
cost.

3-Fuzzy Set Algorithms


In [20] they are applying the FCM over map reduce which enhance the performance than the sequential environment,
In [32] They proposed an algorithm to speed up discovering of influential on Twitter as well as accuracy by calculating
a similar nodes after describing it using fuzzy set membership, In [37] They speed up the performance by proposing a
Distributed Associative Fuzzy Frequent Pattern, In [38,39] They proposed a Fuzzy Rule Base Classification System
(FRBCS) called Chi-FRBCS, and adapted for MapReduce to enhance the performance and accuracy.
We can conclude from the previous sections, that the most effective algorithm on knowledge acquisition is the Rough
Set because it’s particularly suited for parallel processing so no need any modification on the algorithm like ID3, also
no need any information like Fuzzy Set, finally we have critical response to Map reduce not effect on the performance
of the module because it run the reducing step in sequential mode, also [23] explore the power of the rough set as the
below:

 Provides efficient algorithms for finding hidden patterns in data


 Finding minimal set of data (data reduction) .

Volume 6, Issue 7, July 2018 Page 26


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

 Evaluates significance of data.


 Generate minimal sets of decision rule from data.
 Offer straightforward interpretation of obtained result.
 It does not need any preliminary or additional information concerning data in data analysis.
Most algorithms based on the rough set theory are particularly suited for parallel processing

5. Conclusion
Due to the rapidly growing and data in real-life applications, we have huge amounts of data in different formats, and
varying quality which must be processed quickly, and quickly Extract the useful information from it for decision
making, so this research we explore the big data design and structure and the different techniques that used to discover
a useful knowledge from big data, Our goal is to enhance the knowledge acquisition techniques over the big data, as
our future work will focus on studying the different big data processing engine that work in parallel in all phases and
develop a new rough set algorithm and implement it over this tool to enhance the performance and accuracy .

References
[1] Min Chen· Shiwen Mao· Yunhao Liu. “Big Data: A Survey”. Springer Science Business Media New York 2014.
ISSN: 0975-9646.
[2] Suthaharan, Shan. “Big Data Classification: Problems and Challenges in Network Intrusion Prediction with
Machine Learning.” SIGMETRICS Perform. Eval. Rev. 41 (4): 70–73(2014).
[3] Gartner, Inc., Pattern-Based Strategy: Getting Value from Big Data. Gartner Group,
http://www.gartner.com/it/page.jsp?id=1731916 (accessed October,2016)
[4] Beulke, D. Big Data Impacts Data Management: The 5 vs. of Big Data (2011).
[5] SK. Jilani Basha, P. Anil Kumar, S. Giri Babu "Storage and Processing Speed for Knowledge from Enhanced
Cloud Computing With Hadoop Frame Work: A Survey", 3:2395-1990, 2016
[6] Dhruba Borthakur. “The Hadoop Distributed File System: Architecture and Design”. The Apache Software
Foundation. 2007.
[7] Apache HDFS. Available at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
hdfs/HdfsDesign.html#HDFS_Architecture
[8] Apache MapReduce. Available at https://wiki.apache.org/hadoop/MapReduce
[9] Apache Pig. Available at http://pig.apache.org
[10] Apache Hive. Available at http://hive.apache.org
[11] Apache HBase. Available at http://hbase.apache.org
[12] Apache Spark. http://spark.apache.org

[13] Pawlak, Z.: “Rough sets.: International journal of information and computer sciences,” Vol, 11, pp.341-356 (1982)
[14] J. Zhang, T.Li, Y.Pan, Parallel rough set based knowledge acquisition using map reduce from big data, in:
Proceedings of the1st International Workshop on Big Data, Streams and Heterogeneous Source Mining:
Algorithms, Systems, Programming Models and Applications, Big Mine’12, ACM, New York, NY, USA, 20–27
(2012).
[15] K.M. Huang, H.Y. Chen, K.L. Hsiung, On realizing rough set algorithms with apache spark, in: Proceedings of
The Third International Conference on Data Mining, Internet Computing and Big Data, Konya, Turkey, 2016, pp.
111–112.
[16] Fresku, E. and Anamali, “A. decision-tree learning algorithm, “Journal of Environmental Protection and Ecology,
vol. 15, no. 2,2014, pp. 686-696.
[17] Kamiński, B.; Jakubczyk, M.; Szufel, P. "A framework for sensitivity analysis of decision trees". Central
European Journal of Operations Research. (2017).
[18] L. Ping, W. Zhenggang, Z. Hao, Y. Junping, and T. Qiu, Parallel Decision Tree Algorithm Based on Granular
Computing The Open Automation and Control Systems Journal,7, 2015, pp. 873-878
[19] Fuzzy Clustering in Wolfram Research at
http://reference.wolfram.com/legacy/applications/fuzzylogic/Manual/12.html.
[20] Havens, T.C., Bezdek, J.C., Leckie, C., Hall, L.O., Palaniswami, M.: Fuzzy c-Means Algorithms for Very Large
Data. IEEE Trans. Fuzzy System 20(6),2012, pp. 1130–1146
[21] Phu, V.N.; Dat, N.D.; Tran, V.T.N.; Chau, V.T.N.; Nguyen, T.A. Fuzzy C-means for english sentiment
classification in a distributed system. Appl. Intell. 2016, 1–22.

Volume 6, Issue 7, July 2018 Page 27


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

[22] Cai, W., Chen, S., Zhang, D.: A Multiobjective Simultaneous Learning Framework for Clustering and
Classification. IEEE Trans. on Neural Networks 21(2),2010, 185–200
[23] Bharill, N., Tiwari, A.: Handling Big Data with Fuzzy Based Classification Approach. In: Jamshidi, M.,
Kreinovich, V., Kacprzyk, J. (eds.) Advance Trends in Soft Computing. STUDFUZZ, vol. 312,2014, pp. 219–227.
Springer, Heidelberg
[24] J. Talbot, R.M. Yoo, C. Kozyrakis, Phoenix++: modular MapReduce for shared-memory systems, in: Proceedings
of the second international workshop on MapReduce and its applications, MapReduce ’11, ACM, New York, NY,
USA,2011, pp. 9–16.
[25] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.H. Bae, J. Qiu, G. Fox, Twister: a runtime for iterative
MapReduce, in: Proceedings of the 19th ACM International Symposium on High Performance Distributed
Computing, HPDC ’10, ACM, New York, NY, USA,2010, pp. 810–818.
[26] J. Zhang, J. Wong, T. Li, Y. Pan, A comparison of parallel large-scale knowledge acquisition using rough set
theory on different MapReduce runtime systems, Int. J. Approx. Reason. 55 (2014) 896–907.
[27] Celebi, M.E., Kingravi, H.A., Vela, P.A.: A Comparative Study of Efficient Initialization Methods for the K-
means Clustering Algorithm. Expert Syst. Appl. 40(1),2013, 200–210
[28] B. Natarajan *, P. Chellammal, Enhanced K-Means++ Clustering for Big Data with Map reduce, Vol. 3, Issue 1,
2015,228-231
[29] Cui, X., Zhu, P., Yang, X., Li, K., & Ji, C., Optimized big data K-means clustering using MapReduce. The
Journal of Supercomputing, 2014, 70(3),1249–1259.
[30] Davies DL, Bouldin DW, A cluster separation measure[J]. IEEE Trans Pattern Anal Mach Intel, 1979 ,2:224–227
[31] Bahmani B, Moseley B, Vattani A et al, Scalable k-means++[J]. Proc VLDB Endow,2012, 5(7):622–633
[32] Jocić, M., Pap, E., Szakál, A., Obradović, Dj. & Konjović, Z. (2017). Managing Big Data Using Fuzzy Sets by
Directed Graph Node Similarity. Acta Polytechnic a Hungarica,2017, Vol.14, No.2.
[33] V. D. Blondel, A. Guajardo, M. Heymans, P. Senellart, and P. Van Dooren, “A Measure of Similarity between
Graph Vertices: Applications to Synonym Extraction and Web Searching,” SIAM Rev. 2004, Vol. 46, No. 4, pp.
647-666.
[34] Wu X, Kumar V, Quinlan JR, et al. Top 10 algorithms in data mining. Knowledge and information systems.
2008,14(1):1–37.
[35] Mu Y, Liu X, Yang Z, Liu X. A parallel C4.5 decision tree algorithm based on MapReduce. Concurrency
Computat: Pract Exper. 2017 ;29:e4015.https://doi.org/10.1002/cpe.4015.
[36] Yasodha P, Ananthanarayanan NR. Analysing big data to build knowledge-based system for early detection of
ovarian cancer. Indian Journal of Science and Technology. 2015 Jul; 8(14). DOI:
10.17485/ijst/2015/v8i14/65745.
[37] Pietro Ducange, Francesco Marcelloni, and Armando Segatori. A mapreduce-based fuzzy associative classifier for
big data. In Adnan Yazici, Nikhil R. Pal, Uzay Kaymak, Trevor Martin, Hisao Ishibuchi, Chin-Teng Lin, Joao M.
C. Sousa, and B ˜ ulent T ¨ utmez, editors, ¨ FUZZ-IEEE, pages 1–8. IEEE, 2015.
[38] S. Río, V. Lopez. Benítez and F. Herrera. A mapreduce approach to address big data classification problems based
on the fusion of linguistic fuzzy rules. International Journal of Computational Intelligence Systems, 8(3):422–437,
2015.
[39] Victoria Lopez, Sara del Sara del Río, Jose Manuel Benítez, and Francisco Herrera. Cost-sensitive linguistic fuzzy
rule-based classification systems under the mapreduce framework for imbalanced big data. Fuzzy Sets and
Systems, 258:5–38, 2015

AUTHORS
Mahmoud Soufi Master Student in Computer and Information Sciences, Institute of Statistical Studies
and Research, Cairo University, He works as Associate IT department manager at Intesa Sanpaolo
Private Banking (2018).

A. A. Abd El-Aziz He has completed my Ph.D. degree in June 2014 in Information Science &
Technology from Anna University, Chennai-25, and India. He has received my B.Sc., and Master of
Computer Science in 1999 and 2006 respectively from Faculty of Science, Cairo University. Now, He is
a lecturer in the ISSR, Cairo University, Egypt. I have 11 years’ experience in Teaching at Cairo
University, Egypt. My research interests include database system, database security, Internet of Things
and XML security.

Volume 6, Issue 7, July 2018 Page 28


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

Hesham A. Hefny He received the B.Sc., M.Sc. and Ph.D. all in Electronics and Communication
Engineering from Cairo University in 1987, 1991 and 1998 respectively. He is currently a professor of
Computer Science at the Institute of Statistical Studies and Researches (ISSR), Cairo University. He is
also the vice dean of graduate studies and researches of ISSR. Prof. Hefny has authored more than 150
papers in international conferences, journals and book chapters. His major research interest includes:
computational intelligence (neural networks – Fuzzy systems-genetic algorithms – swarm
intelligence), data mining, uncertain decision making. He is a member in the following professional societies: IEEE
Computer, IEEE Computational Intelligence, and IEEE System, Man and Cybernetics.

Volume 6, Issue 7, July 2018 Page 29

You might also like