Professional Documents
Culture Documents
Abstract Rapid development and popularization of internet and technological advancement introduced massive amount
of data and still increasing continuously and daily. A very large amount of data generated, collected, stored, transferred by
applications such as sensors, smart mobile devices, cloud systems and social networks put us on the era of BIG data, a data with huge size, complex and unstructured data types from many origins. So converting these BIG data into useful information is essential, the technique for discovering hidden interesting patterns and knowledge insights into BIG data introduced as BIG data mining. BIG data have rises so many problems and challenges related with handling, storing, managing, transferring, analyzing and mining but it has provides new directions and wide range of opportunities for research
and information extraction and future of some technologies such as data mining in the terms of BIG data mining. In this
paper, we present the concept of BIG data and BIG data mining and mentioned problems with BIG data mining and listed
new research directions for BIG data mining and problems with traditional data mining techniques while dealing with
BIG data as well as we have also discuss some comparison between traditional data mining algorithms and some big data
mining algorithms that will be useful for new BIG data mining technology future.
Keywords BIG data, Data mining, BIG data mining, Challenges for BIG data mining, BIG data mining algorithms,
Parallel mining, MapReduce.
INTRODUCTION
www.ijarea.org
2 BIG DATA
2.1 Why BIG Data?
There is no particular definition for BIG data, but BIG data is
a term used to identify any collection of data sets so large and
complex that it becomes difficult to manage and process using
typical data management tools or traditional data processing
applications. Big Data means data with large size, heterogeneous, autonomous sources with distributed and decentralized
control, and seeks to explore complex and evolving relationships among data.
The expansion in the data will never stop. As shown in following figure, as per the 2011 IDC Digital Universe Study, 130
exabytes of data were generated and stored in 2005. This
amount reached to 1,227 exabytes in 2010 and is assumed to
grow at 45.2% to 7,910 exabytes in 2015[2].
All these suggest the rise of the BIG data, the data with
massive size, continuous growth and from heterogeneous,
autonomous sources. So, handling, managing, processing,
computing, analyzing and extracting useful information or
exact knowledge from BIG data set is a challenging task.
2.2 Definition
According to above facts about BIG data, BIG means:
B- Bigger in size
I-Increasing continuously
G-Greatly complex
So we can say that the data, Bigger in size, Increasing continuously and Greatly complex is called BIG data. The best examples of BIG data are social networking sites, Wikipedia. These
three attributes of BIG data makes it more challenging than
normal data.
As BIG data is bigger in size, so storing, handling, managing
BIG data is a challenging task.
As BIG data increasing continuously, mean to say updating
www.ijarea.org
10
www.ijarea.org
Accuracy
Trust
Performance
Interectiveness
Privacy
Garbage
Mining
Volume:
BIG data mining platform challenges
BIG data mining algorithms challenges
Figure.2: BIG data mining challenges
personal computer to handle and manage so BIG data processing framework will relay on cluster computers with high
performance computing platform. It requires having some
parallel programming tool such as MapReduce[4] where data
mining task is performed on a large number of nodes. In this
software component, single data mining task is divided into
many small tasks which all are running on multiple computing
nodes.
3.1.2 Variety: Mining from heterogeneous sources:
Traditional data mining techniques are generally used to extract
unknown patterns and hidden knowledge from structured,
homogeneous, and small datasets as
BIG datas
perspectives.As we know variety, is one of the feature define
characteristics of big data, data generated from multiple or any
number of heterogeneous sources including structured,
semistructured and unstructured data. Mining such a dataset,
the great challenge is perceivable and the degree of complexity
is not even imaginable before we deeply get there.Todays
database systems can handle structured data, semi-structured
data may partially handled, but unstructured data definitelyrises
big problems. Both semi-structured and unstructured data are
typically stored in files. though bringing up greater technical
challenges, the heterogeneity feature of big data means a new
opportunity of unveiling, previously impossible, hidden
patterns or knowledge.
It suggests in case of heterogeneous set of big data, there is
requirement of constructing specialized, more complex, multimodel system. Current data mining research reveals mining
from heterogeneous information networks .It is also possible
for mining hidden patterns from heterogeneous multimedia
streams of multiple sources, this is also an active are of
research for data mining .
11
www.ijarea.org
personal information, every piece of information about everybody can be mined out, and when all pieces of the information
about a person are dug out and put together, any privacy about
that individual instantly disappears. It suggests that there is
very essential to have working toward powerful mining tools
capable of mining a great portion or even the whole Web granted facilitated. It is related to challenges for how data are maintained, accessed and shared. Privacy preserving data mining is
comes into picture where multiple parties holding some sensitive data are trying to achieve a data mining without sharing
any sensitive information inside the data. It is necessary the
foundations of data mining to be reformulated with big data in
such a way that privacy protection and discrimination prevention is embedded in the foundations themselves. There are
some serious research and new directions for findings in privacy concerns such as measuring and prevention of privacy violation during knowledge mining [5].
3.1.6 Interectiveness: By interactiveness we mean the capability or feature of a data mining system that allows prompt and
adequate
user
interaction
such
as
feedback/interference/guidance from users. Interactiveness is relatively an underemphasized issue of data mining in the past.
When our society is now confronting the challenges of big data
mining, interactiveness becomes a critical issue. Interactiveness
relates to all the three Vs and can help overcome the challenges coming along with each of them. First, in order to conquer the volume related challenge of big data mining, prompt
user feedback/guidance can help quickly narrow down into a
much reduced but promising sub-space, accelerate the processing speed (or velocity) and increase system scalability.
Second, the heterogeneity caused by the variety of big data
straightforwardly induces accordingly high complexity in the
big data itself and the mining results. Sufficient system interactiveness grants users the ability to visualize, (pre-) evaluate, and
interpret intermediate and final mining results. Such a facility
might not be quite necessary for mining conventional datasets,
but for big data, it is a must. Great interactiveness boosts the
acceptance of a complicated mining system and its mining
results by potential users. Sufficient interactiveness is especially important for big data mining [5] [6].
3.1.7 Garbage Mining: Garbage has no value. No one
wants garbage. In the big data era, the volume of data generated and populated on the World Wide Web keeps increasing at
an amazingly fast pace. In such an environment, data can
(quickly) become outdated, corrupted, and useless; in addition,
there is data that is created as junks (like junk emails). If the
society does not pay attention and take actions now, as time
goes, we will be flooded by junk data in the cyberspace. For
the sake of having a relatively clean cyberspace and clean
World Wide Web, here in we call for attentions and research
efforts. Cyberspace cleaning is not an easy task because of at
least two reasons: garbage is hidden, and there is an ownership
issue. It is possible to apply data mining approaches to mine
garbage and recycle it. Garbage mining is a serious research
topic, different but related to big data mining for the sake the
sustainability of our digital environment, mining for garbage
12
www.ijarea.org
3.2.1
Parallel data mining with MapReduce
environment for BIG data: Despite the many algorithmic
improvements proposed in many research related data mining,
the large size and dimensionality, heterogeneity of BIG data
makes data mining tasks ineffective. It is therefore a growing
need to develop efficient parallel data mining algorithms that
can run on a distributed system. In this section, we present
several parallel data mining algorithms for BIG data.
i) Parallel Frequent pattern mining: Frequent pattern mining is
the data mining technique for mining those patterns that occurs
frequently on the data set. There are some efficient algorithms
in data mining for mining frequent patterns such as Apriori,
FPGrowth and Eclat [7]. Mining frequent patterns from BIG
data with these algorithms is not easy because of some memory
requirement problems and insufficient computational capabilities of these algorithms when applying on BIG data. So, requirement for some innovative algorithms is must.
Many parallel data mining algorithms inherits this property
from Apriori, which is why most parallel data mining algorithms are said to be a variation of Apriori. Writing parallel
data mining algorithms are a non-trivial task. The main challenges associated with parallel data mining include minimizing
I/O minimizing synchronization and communication effective
load balancing effective data layout (horizontal vs. vertical)
deciding on the best search procedure to use (complete vs.
greedy)good data decomposition minimizing/avoiding duplication of work. There are following Parallel Algorithms were
used.
a) Count Distribution parallelizing the task of measuring the
frequency of a pattern inside a database based on Apriori algorithm.
13
c) Hybrid Count and Candidate Distribution a hybrid algorithm that tries to combine the strengths of the above algorithms based on Apriori algorithm.
www.ijarea.org
14
TABLE I
DATA MINING TECHNIQUES PROBLEMS WITH BIG DATA
Data mining Techniques
Frequent pattern mining
Classification
Clustering
Outlier Analysis
Purpose
Mining patterns that occurs frequently in dataset
Classify the dataset using
known class labels for
effective data analysis
To make group of objects
from based on similarity or
dissimilarity
To identify rare and abnormal data from dataset
TABLE II
COMPARISON BETWEEN TRADITIONAL DATA MINING ALGORITHMS AND BIG DATA MINING ALGORITHMS
Data mining Algorithms
Apriori
FPgrowth
ECCLAT
SVM[16]
Results
Memory requirements
MREclat
Insufficient computational
capabilities
Balanced data distribution,
inter communication cost
Memory requirements
DISEclat
BigFIM
C4.5[17]
KNN[18]
Nave
Bayse[19]
TABLE III
BIG DATA MINING RESEARCH ISSUES
BIG data mining challenges
Volume: BIG data mining platform
Variety: Mining from heterogeneous
sources
Velocity: Fast accessing and mining
Privacy
Interectiveness
Garbage mining
Mining uncertain, incomplete data
Data mining algorithm challenges
www.ijarea.org
15
5 RELATED WORK
BIG data and BIG data mining is very active area of research
from last recent years. As we are in the world of BIG data and
mining from BIG data is more essential. Some researchers have
presented data mining issues, data mining algorithms challenges
and some have proposed data mining algorithms for BIG data.
Zhigang Zhang, Genlin Ji, Mengmeng Tan et al. [8] have proposed an algorithm MREclat[8] for mining frequent item sets for
BIG data. Algorithm Eclat is a classical algorithm for mining
frequent itemsets, which is based on vertical layout databases. It
is greatly different from those algorithms based on horizontal
layout databases, such as algorithm Apriori and FP-Growth. In
order to improve the efficiency of mining frequent itemsets from
massive datasets, parallel algorithm MREclat based on
Map/Reduce framework is presented. The algorithm also overcomes the problem of memory. In this paper, the idea of MREclat
is introduced and the performance of the algorithm is studied.
The experimental results show that algorithm MREclat has high
scalability and good speedup.
Sandy Moens, Emin Aksehirli and Bart Goethals et al. [9] proposed two algorithms that based on the well known Eclat algorithm, Apriori, and FPgrowth exploit the MapReduce framework
to deal with two aspects of the challenges of FIM on Big Data:
(1)Dist-Eclat is a Map Reduce implementation of, optimized for
speed in case a specic encoding of the data ts into
memory.(2)BigFIM is optimized to deal with truly Big Data by
using a hybrid algorithm, combining principles from both Apriori
and Eclat, also on MapReduce.
Mr. Vijay D. Katkar, Mr. Siddhant Vijay Kulkarni et al. [10] has
proposed parallel computing implementation of the naive Bayesian classifier. This classifier is one of the most commonly used
classifiers in different machine learning applications. In spite of
the so called naive assumptions made by this classifier, it has
proven to be highly accurate and efficient when applied over
large data sets. The initial efficiency of Naive Bayesian makes it
an ideal candidate for parallel implementation to gain optimal
performance.
Seyed Reza Pakize and Abolfazl Gandomi et al. [11], mentioned
One of the most important limitations is the time when we want
to use this classification algorithm on the very large datasets
which have poor run-time performance and high Computation
time and cost. to covers this limitations, many researchers using
this classification algorithm based on MapReduce. In this paper,
they have studied these classification algorithms. Then, they have
provided comparison with the traditional models. Finally, they
have highlighted the advantages of MapReduce Models into traditional models.
F. Lin et al. [12], represents PIC finds a very low-dimensional
embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data. PIC is very fast
on large datasets which are running over 1,000 times faster than
an NCut implementation based on the state-of-the-art IRAM eigenvector computation technique. In spectral clustering the embedding is formed by the bottom
www.ijarea.org
6 CONCLUSION
Rapidly and continuously generation of large amount of data put
us in the era of BIG data, so it is obviously to say that BIG data
revolution has been started. BIG data challenge todays software
systems, storage systems, platform systems, mining methodologies and many more technologies in terms of volume, velocity,
veracity, veracity and value. But, BIG data have also given us
opportunities and new directions for business, marketing and
knowledge insights for analysis purpose. BIG data reveals the
limitations of existing data mining techniques, resulted in a series
of new challenges related to big data mining. BIG data mining is
a promising research area, still in its infancy. In spite of the limited work done on big data mining so far, we believe that much
work is required to overcome its challenges related to heterogeneity, scalability, speed, accuracy, trust, provenance, privacy, and
interactiveness. Our attempt to introduce BIG data mining research issues and comparison of data mining and BIG data mining algorithms will helpful for new directions for research and
innovations in BIG data mining.
For future concern, BIG data mining is the future of current data
mining technology. Because of some limitations of typical data
mining techniques it is essential to create advanced, innovative
data mining algorithms for BIG data mining to deal with BIG
datas large volume, speed and heterogeneous origins. BIG data
mining is still unaware field of research for researchers, But, we
believe it is very promising area of research and many more BIG
data mining algorithms can be proposed by using ensemble, incremental of hybrid approaches of data mining. All this requires
is some new and innovative ideas and creative minds.
REFERENCES
[1]
16
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
www.ijarea.org
17