You are on page 1of 9

International Journal of Advanced Research in Engineering Applications,

Volume 1, Issue 2, 9-17, 2014

Data Mining in the World of BIG Data-A Survey


Kalpit R. Chandpa1, Jignasa N. Patel2
Department of Computer Science and Information Technology
Shri Sad Vidya Mandal Institute of Technology
Bharuch 392-001, Gujarat, India

Abstract Rapid development and popularization of internet and technological advancement introduced massive amount
of data and still increasing continuously and daily. A very large amount of data generated, collected, stored, transferred by
applications such as sensors, smart mobile devices, cloud systems and social networks put us on the era of BIG data, a data with huge size, complex and unstructured data types from many origins. So converting these BIG data into useful information is essential, the technique for discovering hidden interesting patterns and knowledge insights into BIG data introduced as BIG data mining. BIG data have rises so many problems and challenges related with handling, storing, managing, transferring, analyzing and mining but it has provides new directions and wide range of opportunities for research
and information extraction and future of some technologies such as data mining in the terms of BIG data mining. In this
paper, we present the concept of BIG data and BIG data mining and mentioned problems with BIG data mining and listed
new research directions for BIG data mining and problems with traditional data mining techniques while dealing with
BIG data as well as we have also discuss some comparison between traditional data mining algorithms and some big data
mining algorithms that will be useful for new BIG data mining technology future.
Keywords BIG data, Data mining, BIG data mining, Challenges for BIG data mining, BIG data mining algorithms,
Parallel mining, MapReduce.

INTRODUCTION

We are living in the world of data. Whether it is data of any


social networking site on which we are sharing our photos,
videos regularly, or it is our favorite shopping site, or the
knowledge bank from where we get any information at any
time such as Wikipedia. In recent 10 years, the advancement
in the Information technology make easy to generate and
update data continuously. For example, there are so many
photos, videos are uploaded daily. Even rapid growth in
internet and cloud computing also played a great role in
expanding data. Data is constantly generated by use of internet
as well as by companies who have generated and updated a
large amount information from sensors, computers, automated
processes.
A recent study estimated that every minute, Google receives
over 2 million queries, e-mail users send over 200 million
messages, YouTube users upload 48 hours of video, Facebook
users share over 680,000 pieces of content, and Twitter users
generate 100,000 tweets[1].
Dataset or Database includes large amounts of data. But
sometimes not whole data is necessary, sometimes we need to
extract only specific or in other words only useful information
from the dataset as per our requirement or analysis purpose. It
means huge amount of data needs to be converted into useful
information and it requires some efficient process or

www.ijarea.org

methodology, so that we can uncover hidden information and


insights from dataset effectively. Such efficient process or
methodology called data mining.
When it comes to the collection of data, information is always
them means converting this data into some useful and
meaningful information is essential. It suggests mining from
data, in our concern it is mining from BIG data or in other
words BIG data mining, extracting useful information and
reveling hidden knowledge from the fast, continuous,
heterogeneous and large amount of data for analysis purpose.
BIG data mining is a new and very active area of research as
well as it provides wide range of opportunities, challenges and
many more new directions for the future of the data mining
technology in dealing with BIG data.
BIG data mining offers advanced tools and techniques for
managing, handling, storing, analysis and interpreting in the
context of BIG data. Typical data mining techniques fall short
while dealing with BIG data because of some major issues
related with BIG data while applying existing data mining
techniques on it. Hence, BIG data mining is very crucial
methodology for mining BIG data in very efficient way.
This paer organised in this way: 1st section introduce BIG data, in next section we have discussed BIG data mining and in

International Journal of Advanced Research in Engineering Applications,


Volume 1, Issue 2, 9-17, 2014
last section we have presented comparison between data mining algorithms and BIG data mining algorithms.

2 BIG DATA
2.1 Why BIG Data?
There is no particular definition for BIG data, but BIG data is
a term used to identify any collection of data sets so large and
complex that it becomes difficult to manage and process using
typical data management tools or traditional data processing
applications. Big Data means data with large size, heterogeneous, autonomous sources with distributed and decentralized
control, and seeks to explore complex and evolving relationships among data.
The expansion in the data will never stop. As shown in following figure, as per the 2011 IDC Digital Universe Study, 130
exabytes of data were generated and stored in 2005. This
amount reached to 1,227 exabytes in 2010 and is assumed to
grow at 45.2% to 7,910 exabytes in 2015[2].

Figure 1: Continuous growth of DATA in last 10 years [2]

All these suggest the rise of the BIG data, the data with
massive size, continuous growth and from heterogeneous,
autonomous sources. So, handling, managing, processing,
computing, analyzing and extracting useful information or
exact knowledge from BIG data set is a challenging task.
2.2 Definition
According to above facts about BIG data, BIG means:
B- Bigger in size
I-Increasing continuously
G-Greatly complex
So we can say that the data, Bigger in size, Increasing continuously and Greatly complex is called BIG data. The best examples of BIG data are social networking sites, Wikipedia. These
three attributes of BIG data makes it more challenging than
normal data.
As BIG data is bigger in size, so storing, handling, managing
BIG data is a challenging task.
As BIG data increasing continuously, mean to say updating

www.ijarea.org

daily and expanding regularly, processing and computing with


BIG data is tough. As BIG data is greatly complex, information
sharing, information retrieving as well as privacy is a big concern with BIG data.
2.3 BIG Data Characteristics
Based on above mentioned attributes we can define characteristics of BIG data such as: Volume, Velocity and Variety.
1) Volume: Volume is matter of size. Volume is abstract term, is
related with size of the data sets or size of the databases. Volume is a most visible characteristic to identify BIG data. For
example, data sets with large volume include educational databases, medical data sets, and financial data.
2) Velocity: Velocity is matter of speedy and streaming data.
Velocity of data in terms of how frequently data is generated,
stored, how quickly transferred and retrieved and it is related
with the flow of the data.
3) Variety: Variety is matter of structure of data and data origins. Variety means data comes from variety of sources and
variety of types of data. For example data collected from social
networking sites comes from variety of heterogeneous sources
and with different types of data such as texts, image, audio and
video. The data is coming from multiple sources and comes in
heterogeneous formats such as multimedia, text, blogs, emails,
sensors, etc.
Besides this there are other two characteristics introduced by
[3]: Value and Veracity
4) Value: Value is the matter of cost associated with data while
generating, transferring, analyzing etc. While the data is being
generated, collected, and analyzed from different quarters, it is
important to state that todays data has some costs.
5) Veracity: Veracity is the matter of noise associate with data
in the terms of meaningless data. Data preprocessing is must for
the data. There is the need to check the accuracy of the data by
eliminating the noise through methodologies such as data pedigree and sanitization. This is to ensure data quality so that
decisions that are made from the collected data are accurate and
effective.

3 BIG DATA MINING


Data mining means extracting useful information from the
dataset. In case of mining big data, there are so many new
challenges and opportunities. Even though big data contains
greater value including useful hidden knowledge and more
valuable information, it provides big challenges to extract these hidden knowledge and insights from big data. As the traditional data mining and established knowledge discovering
methodology from conventional datasets was not designed to
and it may fall short while dealing with BIG data. The problems with typical data mining techniques when applied to big
data are focused to their inadequate scalability and parallelism.
Furthermore, existing data mining approaches suffers with
great difficulties when they are required to handle BIG data
issues such as heterogeneity, volume, dynamic, speed, privacy,
accuracy, and trust.
There are some efforts being performed for improving existing

10

International Journal of Advanced Research in Engineering Applications,


Volume 1, Issue 2, 9-17, 2014
techniques by applying massive parallel processing architectures and novel distributed storage systems for designing advanced mining techniques based on new frameworks with the
potential to successfully overcome the critical challenges and
ultimately it will change and reshape the future of the data
mining technology. In short when we need to perform data
mining with BIG data, it requires new, innovative and more
advanced data mining techniques as mining BIG data is more
challenging and it provides wide range of new direction for
research and opportunities to the data mining technology.
There are many problems with traditional data mining algorithms and techniques while applying on BIG data due to BIG
datas three Vs: volume, velocity and variety. As BIG data is
large amount of data and data is continuously generated from
multiple sources, discovering useful information or hidden
knowledge requires doing some innovation or advancement in
the data mining techniques for handling and mining BIG data.
Such technique refers to BIG data mining.
Discovered knowledge is meaningless if discovery of highly
valuable knowledge is delayed. Big data not only provides
new challenges, but also provides wide range of opportunities
as it is the interconnected with complex and heterogeneous
contents comes from multiple sources of knowledge and insights. Existing data mining techniques and algorithms are not
ready to meet the new challenges of big data. Mining big data
demands highly scalable strategies and algorithms, more effective preprocessing steps such as data filtering and integration,
advanced parallel computing environments, and intelligent and
effective user interaction.
3.1 Data mining challenges for BIG data
As we know BIG data has provided so many opportunities for
business, marketing, education, medical etc, but it is obvious
big opportunities always bring big challenges with them. So,
when we apply data mining on the BIG data, it suffers from so
many problems and challenges as shown in following figure.
3.1.1
BIG data mining platform challenges: This
challenge centered on data accessing sand actual computing
procedures. As we know BIG data are often stored at different
locations and data volumes are constantly increasing, it is
essential to have an effective computing platform able to take
distributed large scale data storage for computing. For example
traditional data mining algorithms needs all data to be loaded
into the main memory for further processing, so it is very
difficult and expensive for BIG data as transferring data is
between heterogeneous and multiple sources. Typical data
mining require a computing platform for efficient access to
resources such as data and computing processors.
It is not a big deal to perform data mining on small scale data,
when it comes to medium scale data, data mining able to perform by using parallel computing or collective mining where
data is sampled and aggregate data from multiple sources use
parallel programming and at last mining process. But when it
comes to BIG data, data scale is beyond the capacity of one

www.ijarea.org

Variety: Mining from heterogeneous sources


Velocity: Fast accessing & mining from BIG data

Accuracy
Trust
Performance

Interectiveness

BIG data mining challenges

Privacy

Garbage
Mining

Volume:
BIG data mining platform challenges
BIG data mining algorithms challenges
Figure.2: BIG data mining challenges

personal computer to handle and manage so BIG data processing framework will relay on cluster computers with high
performance computing platform. It requires having some
parallel programming tool such as MapReduce[4] where data
mining task is performed on a large number of nodes. In this
software component, single data mining task is divided into
many small tasks which all are running on multiple computing
nodes.
3.1.2 Variety: Mining from heterogeneous sources:
Traditional data mining techniques are generally used to extract
unknown patterns and hidden knowledge from structured,
homogeneous, and small datasets as
BIG datas
perspectives.As we know variety, is one of the feature define
characteristics of big data, data generated from multiple or any
number of heterogeneous sources including structured,
semistructured and unstructured data. Mining such a dataset,
the great challenge is perceivable and the degree of complexity
is not even imaginable before we deeply get there.Todays
database systems can handle structured data, semi-structured
data may partially handled, but unstructured data definitelyrises
big problems. Both semi-structured and unstructured data are
typically stored in files. though bringing up greater technical
challenges, the heterogeneity feature of big data means a new
opportunity of unveiling, previously impossible, hidden
patterns or knowledge.
It suggests in case of heterogeneous set of big data, there is
requirement of constructing specialized, more complex, multimodel system. Current data mining research reveals mining
from heterogeneous information networks .It is also possible
for mining hidden patterns from heterogeneous multimedia
streams of multiple sources, this is also an active are of
research for data mining .

11

International Journal of Advanced Research in Engineering Applications,


Volume 1, Issue 2, 9-17, 2014

3.1.3 Velocity:Fast accessing & mining from BIG


data: In case of BIG data, velocity is another characteristics
that specifiesspeed really matters for BIG data. As we know
data sreams are common format for the BIG data, hence it
requres the ability of fast accessing and mining big data. It is
really essential to complete a mining task within a certain time
period, otherwise, mining output becomes less valuable or
meaningless. Some real-time requests such as earthquake
prediction, stock market prediction and agent-based
autonomous exchange (buying/selling) systems requires fast
and efficeint mining techniques and it shows another new
challenge for the BIG data.
The speed of data mining depends on two major factors: data
access time (determined mainly by the underlying data
system) and the efficiency of the mining algorithms
Exploitation of advanced indexing schemes is the key to the
speed issue. Multidimensional index structures are especially
useful for big data. Besides, design of new and more efficient
indexing schemes is much desired, but remains one of the
greatest challenges to the research community. An additional
approach to boost the speed of big data access and mining is
through maximally identifying and exploiting the potential
parallelism in the access and mining algorithms. The elasticity
and parallelism support of cloud computing are the most
promising facilities for boosting the performance and
scalability of big data mining systems.
3.1.4 Accuracy, Trust, and Provenance: When input
data for the data mining is relatively accurate data from wellknown and quite limited sources, analysis is easy and mining
results tend to be accurate, so accuracy and trust is not a big
issue for that data. But With the emerging big data, the data
sources are of many different origins, not all well-known, and
not all verifiable.So, the accuracy and trust of the source data
quickly become an issue, which further effects to the mining
results as well. So for the BIG data, data validation and
provenance tracing become more essential step in the whole
data mining process. When we want to insights into BIG data,
data provenance an integral and very important feature in any
system that deals with big data because the multiple data
sources and large volumes provide wide sources to extract
additional evidences for verifying accuracy and building trust
on the selected data and the generated mining results. High
dynamics and evolution comes into mind as the vast volume
of big data. So an adequate system for big data management
and analysis must allow dynamic changing and evolution of
the hosted data items[5].
3.1.5 Privacy: Whenever data mining was applied to realworld data, data privacy has been always a critical issue. This
issue has become extremely problematic with big data mining
that often needs personal information in order to generate accurate output results such as location-based and personalized
services, e.g., targeted and individualized advertisements. Furthermore, with the huge volume of big data such as social media that contains tremendous amount of highly interconnected

www.ijarea.org

personal information, every piece of information about everybody can be mined out, and when all pieces of the information
about a person are dug out and put together, any privacy about
that individual instantly disappears. It suggests that there is
very essential to have working toward powerful mining tools
capable of mining a great portion or even the whole Web granted facilitated. It is related to challenges for how data are maintained, accessed and shared. Privacy preserving data mining is
comes into picture where multiple parties holding some sensitive data are trying to achieve a data mining without sharing
any sensitive information inside the data. It is necessary the
foundations of data mining to be reformulated with big data in
such a way that privacy protection and discrimination prevention is embedded in the foundations themselves. There are
some serious research and new directions for findings in privacy concerns such as measuring and prevention of privacy violation during knowledge mining [5].
3.1.6 Interectiveness: By interactiveness we mean the capability or feature of a data mining system that allows prompt and
adequate
user
interaction
such
as
feedback/interference/guidance from users. Interactiveness is relatively an underemphasized issue of data mining in the past.
When our society is now confronting the challenges of big data
mining, interactiveness becomes a critical issue. Interactiveness
relates to all the three Vs and can help overcome the challenges coming along with each of them. First, in order to conquer the volume related challenge of big data mining, prompt
user feedback/guidance can help quickly narrow down into a
much reduced but promising sub-space, accelerate the processing speed (or velocity) and increase system scalability.
Second, the heterogeneity caused by the variety of big data
straightforwardly induces accordingly high complexity in the
big data itself and the mining results. Sufficient system interactiveness grants users the ability to visualize, (pre-) evaluate, and
interpret intermediate and final mining results. Such a facility
might not be quite necessary for mining conventional datasets,
but for big data, it is a must. Great interactiveness boosts the
acceptance of a complicated mining system and its mining
results by potential users. Sufficient interactiveness is especially important for big data mining [5] [6].
3.1.7 Garbage Mining: Garbage has no value. No one
wants garbage. In the big data era, the volume of data generated and populated on the World Wide Web keeps increasing at
an amazingly fast pace. In such an environment, data can
(quickly) become outdated, corrupted, and useless; in addition,
there is data that is created as junks (like junk emails). If the
society does not pay attention and take actions now, as time
goes, we will be flooded by junk data in the cyberspace. For
the sake of having a relatively clean cyberspace and clean
World Wide Web, here in we call for attentions and research
efforts. Cyberspace cleaning is not an easy task because of at
least two reasons: garbage is hidden, and there is an ownership
issue. It is possible to apply data mining approaches to mine
garbage and recycle it. Garbage mining is a serious research
topic, different but related to big data mining for the sake the
sustainability of our digital environment, mining for garbage

12

International Journal of Advanced Research in Engineering Applications,


Volume 1, Issue 2, 9-17, 2014
(and cleaning it) is as important as mining for knowledge.
This is especially so in the new era of big data. Garbage definition remains one of the greatest challenges [5] [6].
3.1.8 BIG data mining algorithms challenges: 8) BIG
data mining algorithms challenges: In case of BIG data mining algorithms, challenges rises because of BIG data volumes,
distributed data distributions and by complex and dynamic
data characteristics. BIG data mining algorithms generally
face three types of challenges: 1) Local learning and model
fusion for multiple information sources, 2) Mining from
sparse, uncertain and incomplete data, 3) Mining complex and
dynamic data.
i) Local learning and model fusion for multiple information
sources: When we apply BIG data mining algorithms, it needs
to centralize site for mining by aggregating distributed data
sources from heterogeneous sources and decentralized controls. But it is much tougher because of transmission cost and
privacy concerns. So in this case, Big data mining system has
to enable information exchange and fusion mechanism to ensure that all information sources can work together to achieve
a global optimization goal [6].
ii) Mining from sparse, uncertain and incomplete data: BIG
data tends to be uncertain, incomplete and sparse. Problems
with sparse data is number of data points are too few so that
we can get reliable conclusions because of high dimensionality of data. High dimensional data cannot easily analyze. For
that some approaches such as feature selection or dimension
reduction are use to reduce data dimensions. Another thing
with BIG data is uncertainty in data. Uncertain data are a special type of data with some random errors; it is generated because of inaccurate data readings and faulty data entries while
collecting data. BIG data mining algorithms cannot be applied
directly on it [6].
iii) Mining complex and dynamic data: Big datas main feature is continuously increasing with concluding complex data
such as unstructured data. But this complex and dynamic data
also provides opportunities for finding knowledge with complex relationships and dynamically changing volumes. Data
mining algorithms face problems to handle the complexity of
data. Data with images, audio, text, hypertext, video mining
requires advanced mining algorithms [6].
3.2 Data mining techniques for BIG data
As we have already stated above traditional data mining techniques are not enough for mining BIG data. But if we apply
above mentioned data mining techniques by using other methods and technology, it may possible to perform successful
mining with BIG data. This section introduces such innovative
data mining techniques for BIG data mining such as: Parallel
mining for BIG data, Sampling based data mining for BIG data
and machine learning techniques for mining BIG data.
Parallel computing is the simultaneous use of multiple compute
resources to solve a computational problem. One of the preferred approaches for working on Big Data is Parallel computing. It breaks down large problem solution into multiple smaller solutions and then executes the smaller divisions concurrent-

www.ijarea.org

ly. There are different types of parallel computing techniques:


bit-level, instruction level, and data level and task parallelism.
Parallel computing has been in the computing picture for many
years now, most commonly in high-performance computing.
With growing concern with improving performance and power
consumption of computers, parallel computing techniques have
become a dominant paradigm in computer architecture in the
form of multi-core processors.
When data mining techniques are combined with parallel computing with MapReduce framework, data mining algorithms
can be useful for mining BIG data efficiently. In parallel data
mining technique, some algorithms are applied on large dataset
parallel manner for efficient processing. This section introduces
data mining techniques with parallelization in MapReduce
environment.

Figure.3: BIG data mining techniques environment

3.2.1
Parallel data mining with MapReduce
environment for BIG data: Despite the many algorithmic
improvements proposed in many research related data mining,
the large size and dimensionality, heterogeneity of BIG data
makes data mining tasks ineffective. It is therefore a growing
need to develop efficient parallel data mining algorithms that
can run on a distributed system. In this section, we present
several parallel data mining algorithms for BIG data.
i) Parallel Frequent pattern mining: Frequent pattern mining is
the data mining technique for mining those patterns that occurs
frequently on the data set. There are some efficient algorithms
in data mining for mining frequent patterns such as Apriori,
FPGrowth and Eclat [7]. Mining frequent patterns from BIG
data with these algorithms is not easy because of some memory
requirement problems and insufficient computational capabilities of these algorithms when applying on BIG data. So, requirement for some innovative algorithms is must.
Many parallel data mining algorithms inherits this property
from Apriori, which is why most parallel data mining algorithms are said to be a variation of Apriori. Writing parallel
data mining algorithms are a non-trivial task. The main challenges associated with parallel data mining include minimizing
I/O minimizing synchronization and communication effective
load balancing effective data layout (horizontal vs. vertical)
deciding on the best search procedure to use (complete vs.
greedy)good data decomposition minimizing/avoiding duplication of work. There are following Parallel Algorithms were
used.
a) Count Distribution parallelizing the task of measuring the
frequency of a pattern inside a database based on Apriori algorithm.

13

International Journal of Advanced Research in Engineering Applications,


Volume 1, Issue 2, 9-17, 2014
b) Candidate Distribution parallelizing the task of generating
longer patterns Apriori algorithm.

tering [13] is a parallel implementation of power iteration clustering algorithm.

c) Hybrid Count and Candidate Distribution a hybrid algorithm that tries to combine the strengths of the above algorithms based on Apriori algorithm.

3.2.2 Sampling based data mining for BIG data: One


of the most obvious ways to deal with the "big data" problem
is to simply sample the "big" data set to create a smaller one.
Then we can run our algorithms/analysis on the smaller data
set, which will complete much more quickly.
Sampling is a powerful data reduction technique that has been
applied to a variety of data mining algorithms for reducing
computational overhead. In the context of association rules,
sampling can be utilized to gather quick preliminary rules.
This may help the user to direct the data mining process by
refining the criterion for interesting rules. Sampling can
speed up the mining process by more than an order of magnitude by reducing I/O costs and drastically shrinking the number of transaction to be considered. The validity of the sample
is determined by two characteristics the size of the sample and
the quality of the sample. The quality, in the context of statistical sampling techniques, refers to whether the sample captures the characteristics of the database [14].

d) Sampling with Hybrid Count and Candidate Distribution


an algorithm that tries to only use a sample of the database
based on Apriori algorithm.
e) MREclat [8]: MREclat is parallel algorithm based on
Map/Reduce framework.
f) Dist-Eclat [9]- is a Map Reduce implementation of, optimized for speed in case a specic encoding of the data ts into
memory.
g) BigFIM [9]- is optimized to deal with truly Big Data by
using a hybrid algorithm, combining principles from both Apriori and Eclat, also on MapReduce.
ii) Parallel classification and classification with MapReduce:
Classification is one type of supervised learning technique of
data mining for classify data set into classes based on some
class label for better analysis. There are so many existing classification techniques and algorithms are there for mining data
sets, But when these algorithms applied on BIG data without
any modification, it will generate very poor performance in
terms of speed and time because of large volume and complex
data. So it is recommended to use some more advanced classification algorithms for mining BIG data. There are following
algorithms for parallel classification and MapReduce environment for better efficiency while applying on BIG data.
a) PNB(+) [10]: PNB(+) is a parallel computing implementation of the Naive Bayesian classifier
b) SVM with MapReduce[11]: is a MapReduce implementation of SVM.
c) C4.5 with MapReduce[11]: is a MapReduce implementation of C4.5.
d) KNN with MapReduce[11]: is a MapReduce implementation of KNN.
e) Nave Bayse with MapReduce[11]: is a MapReduce implementation of Nave Bayse classifier.
iii) Parallel clustering: Clustering is one type of unsupervised
learning technique of data mining; it makes a group of clusters
based on similarity or dissimilarity. There are so many existing
clustering techniques and algorithms are there for mining data
sets, But when these algorithms applied on BIG data without
any modification, it will generate very poor performance in
terms of speed and time because of large volume and complex
data. PIC [12] finds a very low-dimensional embedding of a
dataset using truncated power iteration on a normalized pair
wise similarity matrix of the data. Parallel power iteration clus-

www.ijarea.org

3.2.3 Machine learning for BIG data: Machine learning is


a rather new domain of IT and advanced mathematics, based on
new statistical algorithms that could analyze big volume of
diverse data sources (image, sound, video, social network, geolocalization, traditional structured database, etc) in near
real time. Computers, using these new types of programs, could
learn from data for better future use [15].
MapReduce is an arrangement of compute tasks enabling
relatively easy scaling. It includes Hardware arrangement
racks of CPU's with direct access to local disk, networked to
one another, File System Distributed storage across multiple
disks, redundancy-Software processes running on various CPU
in the assembly, Controller manages mapper and reducer
tasks, fault detection and recovery, Mapper Identical tasks
assigned to multiple CPU's for each to run over its local data,
Reducer Aggregates output from several mappers to form end
product. Programmer only needs to author mapper and
reducer[15].
4. COMPARISON BETWEEN TRADITIONAL DATA MINING
ALGORITHMS AND ADVANCED BIG DATA MINING ALGORITHMS
FOR MINING BIG DATA:
In this section, we have introduced scope and new directions
for research in BIG data mining as per the data mining challenges with BIG data. Following table represents some main
data mining techniques with their purposes and problems occurred while dealing with BIG data, comparison between traditional data mining algorithms and BIG data mining algorithms.

14

International Journal of Advanced Research in Engineering Applications,


Volume 1, Issue 2, 9-17, 2014

TABLE I
DATA MINING TECHNIQUES PROBLEMS WITH BIG DATA
Data mining Techniques
Frequent pattern mining
Classification

Clustering

Outlier Analysis

Purpose
Mining patterns that occurs frequently in dataset
Classify the dataset using
known class labels for
effective data analysis
To make group of objects
from based on similarity or
dissimilarity
To identify rare and abnormal data from dataset

For normal dataset


Generate efficient results
in short time
Classify the dataset accurately
Efficiently partition and
categorize dataset
Detects outliers in fast and
efficient way

Problems with BIG data


Memory requirements and Minimum
frequency threshold & speed
Preprocessing, Feature extraction,
training or learning, accuracy, Time
complexity
Heterogeneity of data and preprocessing
Volume, Veracity and Value

TABLE II
COMPARISON BETWEEN TRADITIONAL DATA MINING ALGORITHMS AND BIG DATA MINING ALGORITHMS
Data mining Algorithms
Apriori
FPgrowth
ECCLAT
SVM[16]

Problems with BIG data

BIG data mining


Algorithms

Results

Memory requirements

MREclat

Improves performance, sufficient memory

Insufficient computational
capabilities
Balanced data distribution,
inter communication cost
Memory requirements

DISEclat

High scalability and good speedup

BigFIM

Improves performance, high speed up, efficient pattern


mining for BIG data
Reduce training time and computation time Increases
the performance
Minimize the communications cost,
reduce the execution time

C4.5[17]

A sub-tree can be replicated several times

KNN[18]

high computation cost

Nave
Bayse[19]

strong feature independence assumptions,


low performance in large
dataset

SVM with MapReduce


C4.5 with MapReduce
KNN with MapReduce
Nave Bayse with
MapReduce
PNB(+)

Reduce communication cost and increase performance

Reduce time complexity, Able to process Large data

TABLE III
BIG DATA MINING RESEARCH ISSUES
BIG data mining challenges
Volume: BIG data mining platform
Variety: Mining from heterogeneous
sources
Velocity: Fast accessing and mining
Privacy
Interectiveness
Garbage mining
Mining uncertain, incomplete data
Data mining algorithm challenges

www.ijarea.org

BIG data mining research issues


Improve the performance of MapReduce and enhance the real-time nature of
large-scale data processing.
Mining from heterogeneous information networks, Mining hidden patterns from
heterogeneous multimedia streams of multiple sources.
Design of new and more efficient indexing schemes to boost the speed of big data
access and mining
Measuring and prevention of privacy violation during mining from BIG data.
Ability to visualize, pre evaluate, interpret intermediate and final mining results.
Cyberspace cleaning, Mining hidden garbage, Mining knowledge from garbage.
Outlier mining techniques for efficient BIG data preprocessing and analysis.
Needs to develop efficient parallel mining algorithms, Sampling based algorithms,
Ensembles and Hybrid algorithms

15

International Journal of Advanced Research in Engineering Research,


Volume X, Issue X, 2012

5 RELATED WORK
BIG data and BIG data mining is very active area of research
from last recent years. As we are in the world of BIG data and
mining from BIG data is more essential. Some researchers have
presented data mining issues, data mining algorithms challenges
and some have proposed data mining algorithms for BIG data.
Zhigang Zhang, Genlin Ji, Mengmeng Tan et al. [8] have proposed an algorithm MREclat[8] for mining frequent item sets for
BIG data. Algorithm Eclat is a classical algorithm for mining
frequent itemsets, which is based on vertical layout databases. It
is greatly different from those algorithms based on horizontal
layout databases, such as algorithm Apriori and FP-Growth. In
order to improve the efficiency of mining frequent itemsets from
massive datasets, parallel algorithm MREclat based on
Map/Reduce framework is presented. The algorithm also overcomes the problem of memory. In this paper, the idea of MREclat
is introduced and the performance of the algorithm is studied.
The experimental results show that algorithm MREclat has high
scalability and good speedup.
Sandy Moens, Emin Aksehirli and Bart Goethals et al. [9] proposed two algorithms that based on the well known Eclat algorithm, Apriori, and FPgrowth exploit the MapReduce framework
to deal with two aspects of the challenges of FIM on Big Data:
(1)Dist-Eclat is a Map Reduce implementation of, optimized for
speed in case a specic encoding of the data ts into
memory.(2)BigFIM is optimized to deal with truly Big Data by
using a hybrid algorithm, combining principles from both Apriori
and Eclat, also on MapReduce.
Mr. Vijay D. Katkar, Mr. Siddhant Vijay Kulkarni et al. [10] has
proposed parallel computing implementation of the naive Bayesian classifier. This classifier is one of the most commonly used
classifiers in different machine learning applications. In spite of
the so called naive assumptions made by this classifier, it has
proven to be highly accurate and efficient when applied over
large data sets. The initial efficiency of Naive Bayesian makes it
an ideal candidate for parallel implementation to gain optimal
performance.
Seyed Reza Pakize and Abolfazl Gandomi et al. [11], mentioned
One of the most important limitations is the time when we want
to use this classification algorithm on the very large datasets
which have poor run-time performance and high Computation
time and cost. to covers this limitations, many researchers using
this classification algorithm based on MapReduce. In this paper,
they have studied these classification algorithms. Then, they have
provided comparison with the traditional models. Finally, they
have highlighted the advantages of MapReduce Models into traditional models.
F. Lin et al. [12], represents PIC finds a very low-dimensional
embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data. PIC is very fast
on large datasets which are running over 1,000 times faster than
an NCut implementation based on the state-of-the-art IRAM eigenvector computation technique. In spectral clustering the embedding is formed by the bottom
www.ijarea.org

eigenvectors of the Laplacian of a similarity matrix.In PIC the


embedding is an approximation to an eigenvalue-weighted linear
combination of al l the eigenvectors of a normalized similarity
matrix. The main focus of PIC is its simplicity and scalability.
They demonstrate that a basic implementation of this method is
able to partit ion a network dataset of 100 million edges within a
few seconds on a single machine, without sampling, grouping, or
other pre-processing of the data
Hui Chen et al. [20] has mentioned that by extending the Map
Reduce frame, parallel algorithms presented for mining the frequent patterns in a big transactional data. According to MapReduce frame, the big data is rstly split into a lot of data sub files,
and the sub files are concurrently processed by a cluster consisting of hundreds or thousands computers. In order to improve the
performance of proposed method, the in significant patterns in
the data sub file are efficiently located and pruned by probability
analysis. The simulation result proves that the method is efficient
and scalable, and can be used to efficiently mine frequent patterns in big data

6 CONCLUSION
Rapidly and continuously generation of large amount of data put
us in the era of BIG data, so it is obviously to say that BIG data
revolution has been started. BIG data challenge todays software
systems, storage systems, platform systems, mining methodologies and many more technologies in terms of volume, velocity,
veracity, veracity and value. But, BIG data have also given us
opportunities and new directions for business, marketing and
knowledge insights for analysis purpose. BIG data reveals the
limitations of existing data mining techniques, resulted in a series
of new challenges related to big data mining. BIG data mining is
a promising research area, still in its infancy. In spite of the limited work done on big data mining so far, we believe that much
work is required to overcome its challenges related to heterogeneity, scalability, speed, accuracy, trust, provenance, privacy, and
interactiveness. Our attempt to introduce BIG data mining research issues and comparison of data mining and BIG data mining algorithms will helpful for new directions for research and
innovations in BIG data mining.
For future concern, BIG data mining is the future of current data
mining technology. Because of some limitations of typical data
mining techniques it is essential to create advanced, innovative
data mining algorithms for BIG data mining to deal with BIG
datas large volume, speed and heterogeneous origins. BIG data
mining is still unaware field of research for researchers, But, we
believe it is very promising area of research and many more BIG
data mining algorithms can be proposed by using ensemble, incremental of hybrid approaches of data mining. All this requires
is some new and innovative ideas and creative minds.

REFERENCES
[1]

Dr. Brian R.King Teaching Data Mining in the Era of Big


Data, Bucknell University,120th ASSE Annual conference &
Explosition, june 26-2013, Paper ID #7580.

16

International Journal of Advanced Research in Engineering Research,


Volume X, Issue X, 2012
[2]

[3]

[4]
[5]

[6]
[7]
[8]

[9]

[10]

[11]

[12]

[13]

[14]
[15]
[16]
[17]
[18]

[19]

[20]

IDC. The 2011 Digital Universe Study: Extracting Value from


Chaos.
[Online]
Available
from:
http
//www.emc.com/collateral/analyst-reports/idc-extractingvalue-from-chaos-ar.pdf
Richard K. Lomotey and Ralph Deters - Towards
Knowledge Discovery in Big Data, 2014 IEEE 8th International Symposium on Service Oriented System Engineering
Jeffrey Dean and Sanjay Ghemawat-MapReduce: Simplied
Data Processing on Large Clusters:.Pdf
Dunren Che1 ,Mejdl Safran , and Zhiyong Peng, B.
Hong From Big Data to Big Data Mining: Challenges, Issues, and Opportunities-et al. (Eds.): DASFAA Workshops
2013, LNCS 7827, pp. 115, 2013. Springer-Verlag Berlin
Heidelberg 2013.
Xindong Wu, Xingquan Zhu, Gong-Qing Wu, Wei DingData mining with Big data
Jiawei. Han and Micheline Kamber, Data Mining: Concepts
and Techniques, 2nd Edition.
Zhigang Zhang, Genlin Ji, and Mengmeng Tang MREclat:
an Algorithm for Parallel Mining Frequent Itemsets 2013
International Conference on Advanced Cloud and Big Data.
Sandy Moens, Emin Aksehirli and Bart Goethals Frequent
Itemset Mining for Big Data, 2013 IEEE International Conference on Big Data.
Mr. Vijay D. Katkar, Mr. Siddhant Vijay Kulkarni, A Novel
Parallel implementation of Naive Bayesian classifier for Big
Data-IEEE 2013.
Seyed Reza Pakize, Abolfazl Gandomi- Comparative Study
of Classification Algorithms Based On MapReduce Modelnternational Journal of Innovative Research in Advanced
Engineering (IJIRAE), August 2014.
POSIX threads programming, High Performance Computing:
High
PerformanceComputing,
Web,
30
June2011.https://computing.llnl.gov/tutorials/pthreads/
Ankit Darji- Parallel Power Iteration Clustering for Big Data
using MapReduce in Hadoop International Journal of Advanced Research in Computer Science and Software Engineering June 2014.
Shengnan Cong, Jiawei Han, Jay Hoeflinger, David PaduaA Samplingbased Framework For Parallel Data Mining
Michael Bowles-Machine Learning Big Data using Map
Reduce.pdf
C. Cortes and V. Vapnik. Support-vector networks. Machine
learning, 20(3):273 297, 1995.
J.R. Quinlan,"C4.5: programs for machine learning", Morgan
Kaufmann, (1993).
A. Stupar, S. Michel, and R. Schenkel. RankReduce processing k-nearest neighbor queries on top of MapReduce. In
LSDS-IR, pages 1318, 2010.
A. K. Santra and S. Jayasudha, "Classification of Web Log
Data to Identify Interested Users Using Nave Bayesian Classification" , International Journal of Computer Science Issues,
Vol.9, Issue 1, January 2012.
Hui Chen,Tsau, Young Lin, Zhibing Zhang, and Jie Zhong,
Parallel Mining Frequent Patterns over Big Transactional
Data in Extended MapReduce, 2013 IEEE conference on
Granular computing(GrC).

www.ijarea.org

17

You might also like