You are on page 1of 7

Data Mining and its Techniques: A review

Paper
Maria Shoukat (MS Student)
Department of Computer Systems Engineering
University of Engineering and Technology.
Peshawar, Pakistan
marvi_1708@hotmail.com

Abstract —Data mining is extraction of data from a huge amount of data. It involves pattern recognition,
making predictions and recognition of invisible patterns from data. Data mining is widely used in many day
to day applications. With rapid growing technology advancement in search algorithms is also creating
milestones in IT and relevant fields. One of the most important application of data mining is in cloud
computing paradigm. This paper deals with the basic definition of data mining, its applications and
methodologies that are adopted to implement data mining in various technologies.

Index terms—Data mining, KDD, techniques, AI, ANN,Cloud Computing. (key words)

I. INTRODUCTION

Data is processed and analyzed for valuable knowledge extraction. Data driven discovery and predictions are used in
almost every field weather its commerce, medicine or education. In database community, the term "Data Mining"
showed up in 1990s. Data mining is a field whose main target is to make predictions from large sets of data and
discovering most relevant patterns. It involves computation of large data sets [7]. For example Google shows the
most relevant information when a word or query is typed to look up on internet. It is an analysis process that extracts
information in most understandable structure for later use. Data mining is of considerable importance in today world
due to existence of large abundance of data. It directs towards knowledge extraction from huge amount of data [4].
Data Mining is mixture of Artificial intelligence, statistics and database research. The techniques for data mining are
intelligent systems that include AI, database systems, statistics, machines learning and business intelligence [5].
However there had been issues in field of Data Mining. First of all was the Development of unified theory of data
mining as there had been no specific theory that completely describes Data Mining. Speed of data streams, time
series data and mining sequence was emerging issue in data mining for handling big data. Many problems had been
observed in the areas of knowledge extraction from complex data, network settings, environmental and biological
problems and from multi agent data [6]. Researches had been carried out to deal with issues related to data security
such as privacy and data integrity. In cloud computing, data is not stored on a single disk for security purposes. As
this is era of cloud, so a lot disk digging is definitely involved in cloud as well. Problems encountered for selection
of process for mining and how to deal with unbalanced, non-static and cost-sensitive data. This paper shows
different methodologies and techniques that were adopted to overcome these issues.

II. DEFINITIONS

Data mining is usually misinterpreted in the meaning of large-scale data or data processing like data extraction,
collection, data analysis or ware housing but globally it is involved in computer decision support system. In 1960s,”
Data Dredging” or “Data Fishing” was the term used by statisticians for data mining [4]. The real definition of data
mining is actually the extraction of unknown patterns either automatically or with semi-automatic analysis from a
large group of data. These patterns may be cluster analysis (groups of data records), anomaly detection or finding
dependencies (association rule mining). Database techniques are usually involved in these processes [7]. David
Bolton said that processing of huge amount of data that is already stored in database for searching patterns and their
relationship within that data is actual Data mining. Gartner, on the other hand defines data mining as the discovery
of meaningful correlations, trends and unknown patterns by sorting through large volume of data already present in
database [2].Data mining use pattern recognition technologies, statistical and mathematical techniques. It is a part of
Knowledge Discovery in Database and used for solving problems by data analysis that is already present in
database. Knowledge Discovery in Database and Data Mining are used interchangeably now. Two rising
technologies i-e Cloud Computing and Data mining are strongly related [9]. Cloud Computing is a technology
provided over the internet for the use of shared resources. Huge volumetric databases can be easily stored over the
internet using cloud computing.

III. KNOWLEDGE DISCOVERY IN DATABASE

It is a process in which useful and previously unknown information is extracted implicitly from huge amount of data
[1].Knowledge Discovery is defined in following figure:

Fig.1. Knowledge Discovery in Database.

Knowledge Discovery in Database involves five stages [5]:

• Selection: Data is selected from different resources where data mining is needed to be performed.
• Preprocessing: Data is cleaned by removal of unwanted data. This step is also called data cleaning.
• Transformation: After cleaning, data is transformed for processing in a new format.
• Data mining: Desired results are identified at this stage.
• Interpretation / evaluation: results obtained are translated into meaningful information/report.

IV. TECHNQUES APPLIED FOR DATA MINING

Data mining creates either a predictive model or a descriptive model from data [2]. A Descriptive Model describes
the main characteristics of data set. It produces an essential summary so that important aspects may be studied from
data set. Usually undirected bottom up approach is used for descriptive modeling. In undirected data mining there is
no interpretation of data but the patterns are recognized. In predictive modeling, an unknown or future value is
predicted from data sets for a specific target variable. Predictive mining is called classification if target variable is a
discrete label. And for the case when target variable is a real number, the process is called regression. Some of the
most basic techniques used for data mining extract data by applying predictive modeling and some of them use
descriptive model. The techniques usually employed for data mining are classification techniques such as neural
networks, decision trees and predictive techniques clustering, association etc.
A. Association:

One of the most popular techniques in Data mining is association. It discovers the patterns based on the relationship
between two items in the same transaction. [1]Association rule in market basket analysis and business identifies that
which two items customers buy together. It discovers the hidden relationship between two items and for future use it
is easy to market these two items together. Based on this data businesses can have corresponding marketing
campaign to sell more products to make more profit [3]. For example if we have a track of daily purchase of a shop
that is shown below:

Fig. 2. Survey Picture

By looking at above survey we can make a general association rule as:

Fig. 3. Association Rule.

The rule concluded above shows that there is strong relationship between purchase of Beer and Diapers [3]. So that
is how Association is done in business. These rules are based on predictions that in future these items would be in
demand together too and is also helpful in making business strategies. Applications of association exist in market
basket data analysis, loss-leader analysis, cross marketing and catalog designing etc.

a) Types of association rules: Different types of association [4]rules are created based on
• Types of values handled

→ Boolean association rules

→ Quantitative association rules

• Levels of abstraction involved

→ Single-level association rules


→ Multilevel association rules

• Dimensions of data involved

→ Single-dimensional association rules

→ Multidimensional association rules

B. Clustering:

Clustering is basically the grouping of items that have similarities in same category and items having dissimilar
characteristics are dedicated to other group. So many clusters can formed in this way for huge amount of items or
objects. So each cluster is different from other but items of same cluster are similar to each other. Clustering is one
of the clouds [9] if considered a scenario where data is stored in multiple databases in cloud. [1] Consider the
example of a library in which we have a large domain of books and topics available. Here the task is to make it easy
for readers to select their required topic from a wide range of books and articles. Here clustering techniques works
by organizing books and topic following the rules of making clusters and putting similar content together. So it
becomes easy for readers to just go to their desired area and get their required topic. The figure below shows
clustering process:

Fig. 4. Clustering

C. Neural Networks:

Neural Network is a biological system that is used for detecting patterns and making predictions. Neural Networks
are also created artificially [2]. With the application of neural networks in real world problems, a major difference
has been created in the field of AI. Neural networks are applicable in issues like customers response prediction,
fraud detection or for detecting theft issues. Neural networks in Data mining works by predicting the relationships
present among huge amount of data and so business intelligence is increased in wide range of business applications.
The models predicted by neural networks are so complex that are not really easy to understand even by experts.
Neural Networks are employed in wide range of applications. Artificial Neural Network is the newest technology
introduced in IT industry. It is weapon for detecting patterns. They are also used in decision making problems and
for making predictions based on the unknown relationships. ANN has also been used for clustering technique as
well. Artificial Neural networks at first adapt themselves with system. This phase is called as training phase of
ANN. It takes variable data from system to learn how to perform and once ANN is ready, ANN takes fixed
parameters to perform operation. They are used when there is huge amount of data and problem statement is too
complex. In these situations ANN gives accurate results and its non-linearity characteristics gives a lot of margin in
providing the most accurate results. In NN some special nodes called as hidden nodes are present between input and
output nodes. They have no predefined meaning and are invisible to end user [4].These hidden nodes are basically
used of feature extraction and making predictions. ANN enables its users to network topology, performance
parameter, learning rule and stopping criteria.

Fig. 5. Neural Network with hidden Nodes.

D. Decision Trees:

A decision tree is basically a node with two branches in which each node represents a test and its branches represent
the most possible outcome of that test. And its leaves represent class distribution. It is based on predictive modeling
phenomenon and falls into category of classification. Decision tress divide the input space into cells and each cell
represents a class. Then on basis of this partition a sequence of tests is generated. Each node is then tested and
possible outcomes are generated till a leaf node is reached. The leaf nodes tell either to return to class or to continue
to reach the appropriate result. So in this way a specific input is classified and tests are performed starting from root
node to obtain a leaf node [2].

Decision trees determine the course of action and show the probability of event statistically. From business point of
view decision tress works by segmenting data and then using these segments to generate predictive segments.
These predictive segments will also show some characteristics that can be helpful in making business strategies and
making understandable models
.

Fig. 6. Decision Tree

V. APPLICATIONS

Data mining has wide range of application weather its education [5], medical or science related problem. Data
mining is used in education sector as students are categorized on the basis of certain criteria. This is an application
of data mining. In finance, business and market analysis, the perks of data mining can never be ignored. Data mining
has also practical applications in bioinformatics, telecommunication, earthquake prediction, agriculture and cloud
computing.

Data Mining in Cloud:

Data mining techniques and applications are a core object in cloud computing environment. With the expansion of
cloud in businesses and scientific research, data mining is so under lime light for this purpose. In cloud data mining
refers to the internet from structured or unstructured web sources. Organizations are centralized through cloud
computing with the guarantee of secure and efficient services for clients. The cost of implementation of data mining
techniques in cloud is relatively low and so its users are much benefited. The cost of infrastructure is also reduced
due to use of data mining tools in cloud [8]. Both Data Mining techniques and Cloud Computing can help in
maximizing profits in business and cost reduction.

Fig. 7. Transferring data from one server to another server through the data mining.
VI. CONCLUSION

This paper gives basic review of mining techniques that has developed so far. Also it describes the problems in data
mining and its application in day to day life. The aim of data mining is data extraction and patterns recognition from
a huge amount of active data. The applications of data mining use classification, clustering, association techniques,
GA and prediction etc. AI algorithms have brought revolution in the field of Data Mining and their wide ranges of
applications are even more astonishing. ANN is fast, self-organizing and adaptive. It enables parallel processing and
distributed storage and gives the most accurate results. These characteristics make it most suitable for data mining.
The current increase in data is so much fast and rapid and it is an important area of research whose importance
cannot be neglected. Most of the commercial, educational and scientific applications are dependent on data mining
technologies and it is most probable that in future more advancements and new techniques will be introduced in this
industry.

REFERENCES

[1]K. M Raval, Data Mining Techniques, 1st ed. India, 2012.

[2]N. Jain and V. Srivastava, DATA MINING TECHNIQUES: A SURVEY PAPER, 1st ed. India, 2013.

[3]P. Ning Tan, M. Steinbach and V. Kumar, "Introduction to data mining: Association
analysis",SearchBusinessAnalytics, 2006. [Online]. Available:
http://searchbusinessanalytics.techtarget.com/feature/Introduction-to-data-mining-Association-analysis.
[Accessed: 03- May- 2016].

[4]M. V. Joseph, L. Sadath and V. Rajan, Data Mining: A Comparative Study on Various Techniques and Methods,
1st ed. Oman, 2013.

[5]P. Sharma, Use of Data Mining in Various Field: A Survey Paper, 1st ed. India, 2014.

[6]Q. YANG and X. WU, 10 CHALLENGING PROBLEMS IN DATA MINING RESEARCH, 1st ed. World
Scientific Publishing Company, 2006.

[7]L. Rokach and O. Maimon, Data mining with decision trees. Singapore: World Scientific, 2008.

[8]Ruxandra-Ştefania PETRE, Data mining in Cloud Computing, Bucharest Academy of Economic Studies.

[9]A.V.R.K.Harsha Vardhan Varma, Mr.A.Srinivas,M.Kalyan Srinivas,, A Study On Cloud Computing Data


Mining, Vol 1, issue 5, International Journal of Innovative Research in Computer and Communication Engineering,
July, 2013.

You might also like