Professional Documents
Culture Documents
HOW TO MANAGE IT
DINESH JITENDER
CSA Deptt. PDMCE CSA Deptt. PDMCE
dineshgomber@gmail.com Jitender_13@rediffmail.com
other recipients. However, the two terms are used for two
ABSTRACT different elements of this kind of operation.
Big Data is a new term used to identify the datasets
that due to their large size and complexity, We call them Big data is a term for a large data set. Big data sets are those
BIG DATA because we can not manage them with our that outgrow the simple kind of database and data handling
current methodologies or data mining software tools. Big architectures that were used in earlier times, when big data
Data mining is the capability of extracting useful was more expensive and less feasible. For example, sets of
information from these large datasets or streams of data, data that are too large to be easily handled in a Microsoft
that due to its volume, variability, and velocity, it was not Excel spreadsheet could be referred to as big data sets.
possible before to do it. The Big Data challenge is
becoming one of the most exciting opportunities for the Data mining refers to the activity of going through big data
next years. We present in this issue, a broad overview of sets to look for relevant or pertinent information. This type
the topic, its current status on Big Data mining. This of activity is really a good example of the old axiom
paper shows the challenge and tools to manage "looking for a needle in a haystack." The idea is that
heterogeneous information frontier in Big Data mining businesses collect massive sets of data that may be
research homogeneous or automatically collected. Decision-makers
need access to smaller, more specific pieces of data from
those large sets. They use data mining to uncover the pieces
of information that will inform leadership and help chart the
course for a business.
INTRODUCTION Data mining can involve the use of different kinds of
Data Mining is an analytic process designed to explore software packages such as analytics tools. It can be
data in search of consistent patterns and/or systematic automated, or it can be largely labor-intensive, where
relationships between variables, and then to validate the individual workers send specific queries for information to
findings by applying the detected patterns to new subsets an archive or database. Generally, data mining refers to
of data. The ultimate goal of data mining is prediction - operations that involve relatively sophisticated search
and predictive data mining is the most common type of operations that return targeted and specific results. For
data mining and one that has the most direct business example, a data mining tool may look through dozens of
applications. The process of data mining consists of three years of accounting information to find a specific column of
stages: (1) the initial exploration, (2) model building or expenses or accounts receivable for a specific operating
pattern identification with validation/verification, and (3) year.
deployment (i.e., the application of the model to new data
in order to generate predictions). In short, big data is the asset and data mining is the
Applications where data collection has grown "handler" of that is used to provide beneficial results.
tremendously and is beyond the capability of commonly
used software tools to capture, manage, and process
within a tolerable elapsed time. The most fundamental
challenge for Big Data applications is to explore the large
volumes of data and extract useful information or Data Mining Challenges with Big
knowledge for future actions . In many situations, the
knowledge extraction process has to be very efficient and Data
close to real time because storing all observed data is
nearly infeasible. Data is being produced at an ever increasing rate. There has
also been an acceleration in the proportion of machine-
generated and unstructured data (photos , videos, social
DATA MINING AND BIG DATA media feeds and so on) compared to structured data such
that 80% or more of all data holdings are now unstructured
Big data and data mining are two different things. Both of and new approaches and technologies are required to access,
them relate to the use of large data sets to handle the link, manage and gain insight from these data sets. The
collection or reporting of data that serves businesses or
origin of the term Big Data is due to the fact that we are Apache Hadoop related projects [2]: Apache Pig, Apache
creating a huge amount of data every day.
Volume: there is more data than ever before, its size Hive, Apache HBase, Apache ZooKeeper, Apache Cas-
continues increasing, but not the percent of data that our sandra, Cascading, Scribe and many others.
tools can process
Variety: there are many different types of data, as text, Apache S4 [3]: platform for processing continuous
sensor data, audio, video, graph, and more data streams. S4 is designedspeci cally for managing
data streams. S4 apps are designed combining streams
Velocity: data is arriving continuously as streams of and processing elements in real time.
data, and we are interested in obtaining useful information
from it in real time.
Visualization. A main task of Big Data analysis is Storm [4]: software for streaming data-intensive dis-
how to visualize the results. As the data is so big, it tributed applications, similar to S4, and developed by
is very difficult to find user-friendly visualizations. Nathan Marz at Twitter.
New techniques, and frameworks to tell and show
stories will be needed,
Hidden Big Data. Large quantities of useful data are In Big Data Mining, there are many open source
initiatives. The most popular are the following:
getting lost since new data is largely untagged
filebased and unstructured data. The 2012 IDC
study on Big Data [10] explains that in 2012, 23%
(643 exabytes) of the digital universe would be Apache Mahout [5]: Scalable machine learning and
useful for Big Data if tagged and analyzed. data mining open source software based mainly in
However, currently only 3% of the potentially Hadoop. It has implementations of a wide range of
useful data is tagged, and even less is analyzed. machine learning and data mining algorithms:
clustering, clas-si cation, collaborative ltering and
frequent pattern mining.
Conclusion
[5] Apache Mahout, http://mahout.apache.org.
REFERENCES