You are on page 1of 12

DATA MINING

AND
WAREHOUSING

Abstract:
Many software projects are accumulated by a great deal of data, so we
really need information about the effective maintenance and reteving of data from the
database. The newest, hottest technology to address these concerns is data mining and
data warehousing.
Data Mining is the process of automated extraction of predictive
information from large databases. It predicts future trends and finds behavior that the
experts may miss as it lies beyond their expectations. Data Mining is part of a larger
process called knowledge discovery, specifically, the step in which advanced statistical
analysis and modeling techniques are applied to the data to find useful patterns and
relationships.

Data warehousing takes a relatively simple idea and incorporates it into


the technological underpinnings of a company. The idea is that a unified view of all data
that a company collects will help improve operations. If hiring data can be combined with
sales data, the idea is that it might be possible to discover and exploit patterns in the
combined entity.

This paper will present an overview of the different process and advanced
techniques involving in data mining and data warehousing.

Key words:
Introduction, concepts of Data mining and warehousing process, architecture, techniques,
uses and activates, various applications, conclusion.

1. Introduction to Data Mining:


Data mining can be defined as "a decision support process in which we
search for patterns of information in data." This search may be done just by the user, i.e.
just by performing queries, in which case it is quite hard and in most of the cases not
comprehensive enough to reveal intricate patterns. Data mining uses sophisticated
statistical analysis and modeling techniques to uncover such patterns and relationships
hidden in organizational databases - patterns that ordinary methods might miss. Once
found, the information needs to be presented in a suitable form, with graphs, reports, etc.
1.1 Data Mining Processes
From a process-oriented view, there are three classes of data mining
activity: discovery, Predictive modeling and forensic analysis, as shown in figure below.
Discovery is the process of looking in a database to find hidden patterns without a
Predetermined idea or hypothesis about what the patterns may be. In other words, the
program takes the initiative in finding what the interesting patterns are, without the user
thinking of the relevant questions first.

Discovery is the process of looking in a database to find hidden patterns


without a predetermined idea or hypothesis about what the patterns may be. In other
words, the program takes the initiative in finding what the interesting patterns are,
without the user thinking of the relevant questions first.
In predictive modeling patterns discovered from the database are used to predict
the future. Predictive modeling thus allows the user to submit records with some
unknown field values, and the system will guess the unknown values based on previous
patterns discovered from the database. While discovery finds patterns in data, predictive
modeling applies the patterns to guess values for new data items.
Forensic analysis:
This is the process of applying the extracted patterns to find anomalous or
unusual data elements. To discover the unusual, we first find what is the norm, and then
we detect those items that deviate from the usual within a given threshold. Discovery
helps us find "usual knowledge," but forensic analysis looks for unusual and specific
cases.
1.2 Data Mining Users and Activities
Data mining activities are usually performed by three different classes of users -
executives, end users and analysts.
• Executives need top-level insights and spend far less time with computers than the
other groups.
• End users are sales people, market researchers, scientists, engineers, physicians,
etc.
• Analysts may be financial analysts, statisticians, consultants, or database
designers.
These users usually perform three types of data mining activity within a
corporate environment: episodic, strategic and continuous data mining.
In episodic mining we look at data from one specific episode such as a specific
direct marketing campaign. We may try to understand this data set, or use it for
prediction on new marketing campaigns. Analysts usually perform episodic mining.
In strategic mining we look at larger sets of corporate data with the intention of gaining
an overall understanding of specific measures such as profitability.
In continuous mining we try to understand how the world has changed within a
given time period and try to gain an understanding of the factors that influence change.

1.3 Data Mining Applications:


Virtually any process can be studied, understood, and improved using data
mining. The top three end uses of data mining are, not surprisingly, in the marketing area.
Data mining can find patterns in a customer database that can be applied to a
prospect database so that customer acquisition can be appropriately targeted. For
example, by identifying good candidates for mail offers or catalogs direct-mail marketers
can reduce expenses and increase their sales. Targeting specific promotions to existing
and potential customers offers similar benefits.
Market-basket analysis helps retailers understand which products are purchased
together or by an individual over time. With data mining, retailers can determine which
products to stock in which stores, and even how to place them within a store. Data mining
can also help assess the effectiveness of promotions and coupons.
Another common use of data mining in many organizations is to help manage
customer relationships. By determining characteristics of customers who are likely to
leave for a competitor, a company can take action to retain that customer because doing
so is usually far less expensive than acquiring a new customer.
Fraud detection is of great interest to telecommunications firms, credit-card
companies, insurance companies, stock exchanges, and government agencies. The
aggregate total for fraud losses is enormous. But with data mining, these companies can
identify potentially fraudulent transactions and contain the damage.
Financial companies use data mining to determine market and industry
characteristics as well as predict individual company and stock performance. Another
interesting niche application is in the medical field: Data mining can help predict the
effectiveness of surgical procedures, diagnostic tests, medications, service management,
and process control.
1.4 Data Mining Techniques:
Data Mining has three major components Clustering or Classification,
Association Rules and Sequence Analysis.

1.4.1 Classification:
The clustering techniques analyze a set of data and generate a set of grouping
rules that can be used to classify future data. The mining tool automatically identifies the
clusters, by studying the pattern in the training data. Once the clusters are generated,
classification can be used to identify, to which particular cluster, an input belongs. For
example, one may classify diseases and provide the symptoms, which describe each class
or subclass.
1.4.2 Association:
An association rule is a rule that implies certain association relationships
among a set of objects in a database. In this process we discover a set of association rules
at multiple levels of abstraction from the relevant set(s) of data in a database. For
example, one may discover a set of symptoms often occurring together with certain kinds
of diseases and further study the reasons behind them.
1.4.3 Sequential Analysis:
In sequential Analysis, we seek to discover patterns that occur in sequence. This
deals with data that appear in separate transactions (as opposed to data that appear in the
same transaction in the case of association) e.g. if a shopper buys item A in the first week
of the month, and then he buys item B in the second week etc.
1.4.4 Neural Nets and Decision Trees:
For any given problem, the nature of the data will affect the techniques you
choose. Consequently, you'll need a variety of tools and technologies to find the best
possible model. Classification models are among the most common, so the more popular
ways for building them have been explained here. Classifications typically involve at
least one of two workhorse statistical techniques - logistic regression (a generalization of
linear regression) and discriminate analysis. However, as data mining becomes more
common, neural nets and decision trees are also getting more consideration. Although
complex in their own way, these methods require less statistical sophistication on the part
of the user.
Neural nets use many parameters (the nodes in the hidden layer) to build a model
that takes and combines a set of inputs to predict a continuous or categorical variable.
Source: "Introduction to Data Mining and Knowledge Discovery" by "Two Crows Corporation"

The value from each hidden node is a function of the weighted sum of the
values from all the preceding nodes that feed into it. The process of building a model
involves finding the connection weights that produce the most accurate results by
"training" the neural net with data. The most common training method is back
propagation, in which the output result is compared with known correct values. After
each comparison, the weights are adjusted and a new result computed. After enough
passes through the training data, the neural net typically becomes a very good predictor.
Decision trees represent a series of rules to lead to a class or value. For
example, you may wish to classify loan applicants as good or bad credit risks. Figure
below shows a simple decision tree that solves this problem. Armed with this tree and a
loan application, a loan officer could determine whether an applicant is a good or bad
credit risk. An individual with "Income > $40,000" and "High Debt" would be classified
as a "Bad Risk," whereas an individual with "Income < $40,000" and "Job > 5 Years"
would be classified as a "Good Risk."

Decision trees have become very popular because they are reasonably accurate and,
unlike neural nets, easy to understand. Decision trees also take less time to build than
neural nets. Neural nets and decision trees can also be used to perform regressions, and
some types of neural nets can even perform clustering.
2.1 Introduction to Data warehousing:
In the current knowledge economy, it is now an indisputable fact that information
is the key to organizations for gaining competitive advantage. Organizations very well
know that the vital information for decision making is lying in its databases. Mountains
of data are getting accumulated in various databases scattered around the enterprise. But
the key to gaining competitive advantage lies in deriving insight and intelligence out of
these data. Data warehousing helps in integrating categorizing, codifying and arranging
the data from all parts of an enterprise.
According to Bill Inmon, known as the father of Data warehousing, The concept
of data warehouse is depicted as figure

A Data warehouse is a:
• Subject oriented
• Integrated
Data • Time variant
warehouse • Nonvolatile
Collection of data in support of
managements decisions.

2.1.1 Subject oriented data:


All relevant data about a subject is gathered and stored as a single set in a useful
format.
2.1.2 Integrated data:
Data is stored in a globally accepted fashion with consistent naming conventions,
measurements, encoding structures and physical attributes, even when the underlying
operational system store the data differently.

2.1.3 Non-volatile data:


The data warehouse is read-only, data is loaded in to the data warehouse and
accesses there.
2.1.4 Time-variant data:
This long term data is from 5 to 10 years as opposed to the 30-60 days of
operational data.
2.2 Structure of data warehouse:

The design of the data architecture is probably the most critical part of a data
warehousing project. The key is to plan for growth and change, as opposed to trying to
design the perfect system from the start. The design of the data architecture involves
understanding all of the data and how different pieces are related. For example, payroll
data might be related to sales data by the ID of the sales person, while the sales data
might be related to customers by the customer ID. By connecting these two relationships,
payroll data could be related to customers (e.g., which employees have ties to which
customers).

Once the data architecture has been designed, you can then consider the kinds of
reports that you are interested in. You might want to see a breakdown of employees by
region, or a ranked list of customers by revenue. These kinds of reports are fairly simple.
The power of a data warehouse becomes more obvious when you want to look at links
between data associated with disparate parts of a organization (e.g., HR, accounts
payable, and project management).

2.3 Benefits of Data warehousing:

• Cost avoidance benefits.


• Higher productivity.
• Benefits through better analytical capability.
• Manage business complexity.
• Leverage on their existing investments.
• End user spending.
• Spending on e-business.
• Accessibility and easy of use.
• Real time information and analysis.

2.4 Techniques by different organization on efficient data warehouse:

That being said, most decisions to build data warehouses are driven by non-HR
needs. Over the past decade, back office (supply chain) and front office (sales and
marketing) organizations have spearheaded the creation of large corporate data
warehouses. Improving the efficiency of the supply chain and competition for customers
rely on the tactical uses that a data warehouse can provide. The key for other
organizations, including HR, is to be involved in the creation of the warehouse so that
their meets can be met by any resulting system. This usually happens because both the
data volume and question complexity have grown beyond what the current systems can
handle. At that point the business becomes limited by the information that users can
reasonably extract from the data system.

3.Conclusion:

Data mining offers great promise in helping organizations uncover hidden


patterns in their data. However, data mining tools must be guided by users who
understand the business, the data, and the general nature of the analytical methods
involved. Realistic expectations can yield rewarding results across a wide range of
applications, from improving revenues to reducing costs.
Building models is only one step in knowledge discovery. It's vital to collect and
prepare the data properly and to check models against the real world. The "best" model is
often found after building models of several different types and by trying out various
technologies or algorithms.
The data mining area is still relatively young, and tools that support the whole of
the data mining process in an easy to use fashion are rare. However, one of the most
important issues facing researchers is the use of techniques against very large data sets.
All the mining techniques are based on Artificial Intelligence, where they are generally
executed against small sets of data, which can fit in memory. However, in data mining
applications these techniques must be applied to data held in very large databases. These
include use of parallelism and development of new database oriented techniques.
However, much work is required before data mining can be successfully applied to large
data sets. Only then will the true potential of data mining be able to be realized.
The data warehousing is the hottest concept for many software professionals to
over come the sophisticated data to be managed efficiently. The data warehouse is
repository (or archive) of information gathered from multiple sources, stored under a
unified scheme, at a single site. Once gathered the data are stored for a long time
permitting access to historical data. Thus, data ware houses provide the user a single
consolidated interface to data, making decision support actions easier to implement. In
the world of highly interconnected networks the data obtained or used by many
companies would be very large and the maintenance becomes difficult and costly. So, the
efficient data warehousing is to be implemented to obtain data from different branches
(all over the world) and maintain it for providing information to all other branches (which
does not have the concerned data).

References:

1. Data preparation for Data mining , Dorian Pyle, Morgan Kaufmann Publishers, Inc.
2. Visualizing Data Mining Models, Kurt Thearling et al,
http://www3.shore.net/~kht/text/dmviz/modelviz.htm
3. Data Mining - Finding Business value in Data, Iain McLaren,
http://home.clara.net/imclaren/dmpaper.html
4. Data Mining and Knowledge Discovery in Databases,
http://www.cs.sfu.ca/research/groups/DB/sections/publication/kdd/kdd.html
5. Wipro Mining of Gold

You might also like