Professional Documents
Culture Documents
Dr.L.Jabasheela
Abstract World Wide Web plays an important role in providing various knowledge sources to the world, which helps
many applications to provide quality service to the consumers. As the years go on the web is overloaded with lot of
information and it becomes very hard to extract the relevant information from the web. This gives way to the evolution
of the Big Data and the volume of the data keeps increasing rapidly day by day. Data mining techniques are used to
find the hidden information from the big data. In this paper we focus on the review of Big Data, its data classification
methods and the way it can be mined using various mining methods.
Keywords-Big Data,Data Mining,Data Classificaion,Mining Techniques
I. INTRODUCTION
The concept of big data has been endemic within computer science since the earliest days of computing. Big Data
originally meant the volume of data that could not be processed by traditional database methods and tools. Each time a
new storage medium was invented, the amount of data aaccessible exploded because it could be easily accessed. The
original definition focused on structured data, but most researchers and practitioners have come to realize that most of the
worlds information resides in massive, unstructured information, largely in the form of text and imagery. The explosion
of data has not been accompanied by a corresponding new storage medium. The structure of this paper is as follows:
Section 2 is about Big Data, Section 3 Big Data Characteristics, Section 4 Architecture and Classification, Sections 5, 6,
and 7 discuss on Big Data Analytics, Open Source Revolution, and Mining Techniques for Big Data, and finally Section
8 concludes the paper.
II. BIG DATA
Big Data is a new term assigned to the datasets which appear large in size; we cannot manage them with the traditional
data mining techniques and software tools available. Big Data appears as a concrete large size dataset which hides any
information in its massive volume, which cannot be explored without using new algorithms or data mining techniques.
III.
We have all heard of the 3Vs of big data which are Volume, Variety and Velocity, yet other Vs that IT, business and
data scientists need to be concerned with, most notably big data Veracity.
Data Volume: Data volume measures the amount of data available to an organization, which does not
necessarily have to own all of it as long as it can access it. As data volume increases, the value of different data
records will decrease in proportion to age, type, richness, and quantity among other factors.
Data Variety: Data variety is a measure of the richness of the data representation text, images video, audio,
etc. From an analytic perspective, it is probably the biggest obstacle to effectively using large volumes of data.
Incompatible data formats, non-aligned data structures, and inconsistent data semantics represents significant
challenges that can lead to analytic sprawl.
Data Velocity: Data velocity measures the speed of data creation, streaming, and aggregation. Ecommerce has
rapidly increased the speed and richness of data used for different business transactions (for example, web-site
clicks). Data velocity management is much more than a bandwidth issue; it is also an ingest issue.
Data Veracity: Data veracity refers to the biases, noise and abnormality in data. Is the data that is being stored,
and mined meaningful to the problem being analyzed. Veracity in data analysis is the biggest challenge when
compares to things like volume and velocity.
IV.
BIG DATA ARCHITECTURE AND CLASSIFICATION
This "Big data architecture and patterns" series presents a structured and pattern-based approach to simplify the task
of defining an overall big data architecture [8].
_________________________________________________________________________________________________
2014, IJIRIS- All Rights Reserved
Page - 17
Because it is important to assess whether a business scenario is a big data problem, we include pointers to help determine
which business problems are good candidates for big data solutions.
Business problem
Utilities: Predict
power consumption
Machinegenerated data
Telecommunications:
Customer churn
analytics
Marketing:
Sentiment analysis
Customer service:
Call monitoring
Human-generated
Retail: Personalized
messaging based on
facial recognition
and social media
Machinegenerated data
Biometrics
Transaction data
Utility companies have rolled out smart meters to measure the consumption of
water, gas, and electricity at regular intervals of one hour or less. These smart meters
generate huge volumes of interval data that needs to be analyzed.
Utilities also run big, expensive, and complicated systems to generate power. Each
grid includes sophisticated sensors that monitor voltage, current, frequency, and?
other important operating characteristics.
Telecommunications operators need to build detailed customer churn models that
include social media and transaction data, such as CDRs, to keep up with the
competition.
The value of the churn models depends on the quality of customer attributes
(customer master data such as date of birth, gender, location, and income) and the
social behaviour of customers. Telecommunications providers who implement a
predictive analytics strategy can manage and predict churn by analyzing the calling
patterns of subscribers.
Marketing departments use Twitter feeds to conduct sentiment analysis to determine
what users are saying about the company and its products or services, especially
after a new product or release is launched.
Customer sentiment must be integrated with customer profile data to derive
meaningful results. Customer feedback may vary according to customer
demographics.
IT departments are turning to big data solutions to analyze application logs to gain
insight that can improve system performance. Log files from various application
vendors are in different formats; they must be standardized before IT departments
can use them.
Retailers can use facial recognition technology in combination with a photo from
social media to make personalized offers to customers based on buying behaviour
and location.
This capability could have a tremendous impact on retailers? Loyalty programs, but
it has serious privacy ramifications. Retailers would need to make the appropriate
privacy disclosures before implementing these applications.
Retailers can target customers with specific promotions and coupons based location
data. Solutions are typically designed to detect a user's location upon entry to a store
or through GPS.
Location data combined with customer preference data from social networks enable
retailers to target online and in-store marketing campaigns based on buying history.
Notifications are delivered through mobile applications, SMS, and email.
a.
It's helpful to look at the characteristics of the big data along certain lines for example, figure 2 shows how the
data is collected, analyzed, and processed. Once the data is classified, it can be matched with the appropriate big data
pattern:
Analysis type whether the data is analyzed in real time or batched for later analysis. Give careful consideration to
choosing the analysis type, since it affects several other decisions about products, tools, hardware, data sources, and
expected data frequency. A mix of both types may be required by the use case: Fraud detection; analysis must be done in
real time or near real time. Trend analysis for strategic business decisions; analysis can be in batch mode.
Processing methodology the type of technique to be applied for processing data (e.g., predictive, analytical, ad-hoc
query, and reporting). Business requirements determine the appropriate processing methodology. A combination of
techniques can be used. The choice of processing methodology helps identify the appropriate tools and techniques to be
used in your big data solution.
Data frequency and size how much data is expected and at what frequency does it arrive. Knowing frequency and size
helps determine the storage mechanism, storage format, and the necessary pre-processing tools. Data frequency and size
depend on data sources: On demand, as with social media data, Continuous feed, real-time (weather data, transactional
data) Time series (time-based data)
_________________________________________________________________________________________________
2014, IJIRIS- All Rights Reserved
Page - 19
Apache Pig [6]: software for analyzing large data sets that consists of a high-level language similar to SQL for expressing
data analysis programs, coupled with infrastructure for evaluating these rograms. It contains a compiler that produces
sequences of Map- Reduce programs.
Cascading [10]: software abstraction layer for Hadoop, intended to hide the underlying complexity of MapReduce jobs.
Cascading allows users to create and execute data processing workflows on Hadoop clusters using any JVM-based
language.
Scribe [11]: server software developed by Facebook and released in 2008. It is intended for aggregating log data
streamed in real time from a large number of servers.
Apache HBase [4]: non-relational columnar distributed database designed to run on top of Hadoop Distributed
Filesystem (HDFS). It is written in Java and modeled after Googles BigTable. HBase is an example if a NoSQL data
store.
Apache Cassandra [2]: another open source distributed database management system developed by Facebook. Cassandra
is used by Netflix, which uses Cassandra as the back-end database for its streaming services.
Apache S4 [15]: platform for processing continuous data streams. S4 is designed specifically for managing data streams.
S4 apps are designed combining streams and processing elements in real time.
In Big Data Mining, there are many open source initiatives. The most popular are the following:
Apache Mahout [5]: Scalable machine learning and data mining open source software based mainly in Hadoop. It has
implementations of a wide range of machine learning and data mining algorithms: clustering, classification, collaborative
filtering and frequent pattern mining.
MOA [9]: Stream data mining open source software to perform data mining in real time. It has implementations of
classification, regression, clustering and frequent item set mining and frequent graph mining. It started as a project of the
Machine Learning group of University of Waikato, New Zealand, famous for the WEKA software. The streams
framework [12] provides an environment for defining and running stream processes using simple XML based definitions
and is able to use MOA.
R [16]: open source programming language and software environment designed for statistical computing and
visualization. R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand
beginning in 1993 and is used for statistical analysis of very large data sets.
Vowpal Wabbit [13]: open source project started at Yahoo! Research and continuing at Microsoft Research to design a
fast, scalable, useful learning algorithm. VW is able to learn from terafeature datasets. It can exceed the throughput of
any single machine network interface when doing linear learning, via parallel learning.
PEGASUS [12]: big graph mining system built on top of MAPREDUCE. It allows to find patterns and anomalies in
massive real-world graphs.
GraphLab [14]: high-level graph-parallel system built without using MAPREDUCE. GraphLab computes over
dependent records which are stored as vertices in a large distributed data-graph. Algorithms in GraphLab are expressed as
vertex-programs which are executed in parallel on each vertex and can interact with neighboring vertices.
VII.
There are many different types of analysis that can be done in order to retrieve information from big data. Each type of
analysis will have a different impact or result. Which type of data mining technique you should use really depends on the
type of business problem that you are trying to solve. Different analyses will deliver different outcomes and thus provide
_________________________________________________________________________________________________
2014, IJIRIS- All Rights Reserved
Page - 21
CONCLUSION
This paper describes about the advent of Big Data, Architecture and Characteristics. Here we discussed about the
classifications of Big Data to the business needs and how for it will help us in decision making in the business
environment. Our future work focuses on the analysis part of the big data classification by implementing a different data
mining techniques in it.
REFERENCE
[1] http://www.pro.techtarget.com
[2] Apache Cassandra, http://cassandra. apache.org.
[3] Apache Hadoop, http://hadoop.apache.org.
[4] Apache HBase, http://hbase.apache.org.
[5] Apache Mahout, http://mahout.apache.org.
[6] Apache Pig, http://www.pig.apache.org/.
[7] http://www.webopedia.com/
[8] http://www.ibm.com/library/
[9] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer.MOA: Massive Online Analysis http://moa.cms.waikato.ac.nz/.
Journal of Machine Learning Research (JMLR), 2010.
[10] Cascading, http://www.cascading.org/.
[11] Facebook Scribe, https://github.com/ facebook/scribe.
[12] U. Kang, D. H. Chau, and C. Faloutsos. PEGASUS:Mining Billion-Scale Graphs in the Cloud. 2012.
[13] J. Langford. Vowpal Wabbit, http://hunch.net/vw/,2011.
[14] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson,C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework
for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Catalina Island, California, July
2010.
[15] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4:Distributed Stream Computing Platform. In ICDM
Workshops, pages 170177, 2010.
[16] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing,
Vienna, Austria, 2012. ISBN 3-900051-07-0.
_________________________________________________________________________________________________
2014, IJIRIS- All Rights Reserved
Page - 23