Professional Documents
Culture Documents
AND
WAREHOUSING
Abstract:
Many software projects are accumulated by a great deal of data, so we
really need information about the effective maintenance and reteving of data from the
database. The newest, hottest technology to address these concerns is data mining and
data warehousing.
Data Mining is the process of automated extraction of predictive
information from large databases. It predicts future trends and finds behavior that the
experts may miss as it lies beyond their expectations. Data Mining is part of a larger
process called knowledge discovery, specifically, the step in which advanced statistical
analysis and modeling techniques are applied to the data to find useful patterns and
relationships.
This paper will present an overview of the different process and advanced
techniques involving in data mining and data warehousing.
Key words:
Introduction, concepts of Data mining and warehousing process, architecture, techniques,
uses and activates, various applications, conclusion.
1.4.1 Classification:
The clustering techniques analyze a set of data and generate a set of grouping
rules that can be used to classify future data. The mining tool automatically identifies the
clusters, by studying the pattern in the training data. Once the clusters are generated,
classification can be used to identify, to which particular cluster, an input belongs. For
example, one may classify diseases and provide the symptoms, which describe each class
or subclass.
1.4.2 Association:
An association rule is a rule that implies certain association relationships
among a set of objects in a database. In this process we discover a set of association rules
at multiple levels of abstraction from the relevant set(s) of data in a database. For
example, one may discover a set of symptoms often occurring together with certain kinds
of diseases and further study the reasons behind them.
1.4.3 Sequential Analysis:
In sequential Analysis, we seek to discover patterns that occur in sequence. This
deals with data that appear in separate transactions (as opposed to data that appear in the
same transaction in the case of association) e.g. if a shopper buys item A in the first week
of the month, and then he buys item B in the second week etc.
1.4.4 Neural Nets and Decision Trees:
For any given problem, the nature of the data will affect the techniques you
choose. Consequently, you'll need a variety of tools and technologies to find the best
possible model. Classification models are among the most common, so the more popular
ways for building them have been explained here. Classifications typically involve at
least one of two workhorse statistical techniques - logistic regression (a generalization of
linear regression) and discriminate analysis. However, as data mining becomes more
common, neural nets and decision trees are also getting more consideration. Although
complex in their own way, these methods require less statistical sophistication on the part
of the user.
Neural nets use many parameters (the nodes in the hidden layer) to build a model
that takes and combines a set of inputs to predict a continuous or categorical variable.
Source: "Introduction to Data Mining and Knowledge Discovery" by "Two Crows Corporation"
The value from each hidden node is a function of the weighted sum of the
values from all the preceding nodes that feed into it. The process of building a model
involves finding the connection weights that produce the most accurate results by
"training" the neural net with data. The most common training method is back
propagation, in which the output result is compared with known correct values. After
each comparison, the weights are adjusted and a new result computed. After enough
passes through the training data, the neural net typically becomes a very good predictor.
Decision trees represent a series of rules to lead to a class or value. For
example, you may wish to classify loan applicants as good or bad credit risks. Figure
below shows a simple decision tree that solves this problem. Armed with this tree and a
loan application, a loan officer could determine whether an applicant is a good or bad
credit risk. An individual with "Income > $40,000" and "High Debt" would be classified
as a "Bad Risk," whereas an individual with "Income < $40,000" and "Job > 5 Years"
would be classified as a "Good Risk."
Decision trees have become very popular because they are reasonably accurate and,
unlike neural nets, easy to understand. Decision trees also take less time to build than
neural nets. Neural nets and decision trees can also be used to perform regressions, and
some types of neural nets can even perform clustering.
2.1 Introduction to Data warehousing:
In the current knowledge economy, it is now an indisputable fact that information
is the key to organizations for gaining competitive advantage. Organizations very well
know that the vital information for decision making is lying in its databases. Mountains
of data are getting accumulated in various databases scattered around the enterprise. But
the key to gaining competitive advantage lies in deriving insight and intelligence out of
these data. Data warehousing helps in integrating categorizing, codifying and arranging
the data from all parts of an enterprise.
According to Bill Inmon, known as the father of Data warehousing, The concept
of data warehouse is depicted as figure
A Data warehouse is a:
• Subject oriented
• Integrated
Data • Time variant
warehouse • Nonvolatile
Collection of data in support of
managements decisions.
The design of the data architecture is probably the most critical part of a data
warehousing project. The key is to plan for growth and change, as opposed to trying to
design the perfect system from the start. The design of the data architecture involves
understanding all of the data and how different pieces are related. For example, payroll
data might be related to sales data by the ID of the sales person, while the sales data
might be related to customers by the customer ID. By connecting these two relationships,
payroll data could be related to customers (e.g., which employees have ties to which
customers).
Once the data architecture has been designed, you can then consider the kinds of
reports that you are interested in. You might want to see a breakdown of employees by
region, or a ranked list of customers by revenue. These kinds of reports are fairly simple.
The power of a data warehouse becomes more obvious when you want to look at links
between data associated with disparate parts of a organization (e.g., HR, accounts
payable, and project management).
That being said, most decisions to build data warehouses are driven by non-HR
needs. Over the past decade, back office (supply chain) and front office (sales and
marketing) organizations have spearheaded the creation of large corporate data
warehouses. Improving the efficiency of the supply chain and competition for customers
rely on the tactical uses that a data warehouse can provide. The key for other
organizations, including HR, is to be involved in the creation of the warehouse so that
their meets can be met by any resulting system. This usually happens because both the
data volume and question complexity have grown beyond what the current systems can
handle. At that point the business becomes limited by the information that users can
reasonably extract from the data system.
3.Conclusion:
References:
1. Data preparation for Data mining , Dorian Pyle, Morgan Kaufmann Publishers, Inc.
2. Visualizing Data Mining Models, Kurt Thearling et al,
http://www3.shore.net/~kht/text/dmviz/modelviz.htm
3. Data Mining - Finding Business value in Data, Iain McLaren,
http://home.clara.net/imclaren/dmpaper.html
4. Data Mining and Knowledge Discovery in Databases,
http://www.cs.sfu.ca/research/groups/DB/sections/publication/kdd/kdd.html
5. Wipro Mining of Gold