Professional Documents
Culture Documents
Not found
your data?
Data Mining
Hai na!
Compiled by:
Patel Dvipal (422)
Shah Pooja(433)
4th SEM, C.E
GHANDHINAGAR
2
PREFECE
""""
“You have no choice but to operate in a world shaped by globalization and
the information revolution. There are two options: adapt or die.”
-Andy Grove, Chairman, Intel
The last few years have seen a growing recognition of information as a key
business tool. Those who successfully gather, analyze, understand, and act upon the
information are among the winners in this new “information age”.
At this point Data Mining is very important for the user, Tough the Data
Mining is not a magic wand but it can find the “hidden” information from your database,
it can predict the market as well as the customer up to certain level of accuracy. Data
Mining also able to take the result depending on multiple database, may be in different
DBMS or different companies.
We have tries to give detail introduction for Data Mining, Main features of
it and also tries to give detail s of development of Data Mining.
We hope that after reading this report one can better understand the Data
Mining and can develop the application or say Data Mining software which can give the
facilities for mining the Data Base in real meaning. Because in real meaning your
software should have Artificial Intelligence too detect some models or methods for Data
Mining.
3
INDEX
CONTENTS
ABSTRACT
INTRODUCTION TO DATA MINING
LEARNING FROM PAST MISTAKES?
INTODUCTION TO DATA WAREHOUSES
DATA MINING AND DATA WAREHOUSING
ABSTRACT
Data Mining gains its name, and to some degree its popularity, by playing
off of a meaning that the data that you have stored is much like a “ mountain” and that
buried within the mountain (just as buried within your data) are certain “gems” of great
value. The problem is that there are also lots of non-valuable rocks and rubble in the
mountain that need to be mined through and discarded in order to get to that which is
valuable. The trick is that both for mountains of rock and mountains of data you need
some power tools to unearth the value of the data. For rock, this means earthmovers and
dynamite; for data, this means powerful computers and data mining software.
Here the Database can be global, or more than one database may be on different
DBMS, but the Data Mining can extract the all database and gives you the results which
you want. This process gives you the information from the database may be it is not
visible directly.
Data Mining can give the some results, some combinations or some specific
characteristics of customer, product or processes, which is further useful to next working.
It can be said that there is some Artificial Intelligence in the Data Mining.
Data Mining is the tool, which can give your data the intelligence for any
particular models or work. The Building of Data Mining software is very easy if you go
through proper steps.
Databases today can range in size into the terabytes - more than
1,000,000,000,000 bytes of data. Within these masses of data lies hidden information of
strategic importance. But when there are so many trees, how do you draw meaningful
conclusions about the forest?
The newest answer is Data Mining, which is being used both to increase
revenues and to reduce costs. The potential returns are enormous. Innovative
organizations worldwide are already using data mining to locate and appeal to higher-
value customers, to reconfigure their product offerings to increase sales, and to minimize
losses due to error or fraud.
Most companies already collect and refine massive quantities of data. Data
mining techniques can be implemented rapidly on existing software and hardware
platforms to enhance the value of existing information resources, and can be integrated
with new products and systems as they are brought on-line. When implemented on high
performance client/server or parallel processing computers, data mining tools can analyze
massive databases to deliver answers to questions such as, "Which clients are most likely
to respond to my next promotional mailing, and why?"
The first and simplest analytical step in data mining is to describe the data –
summarize its statically attributes(such as means and standard derivations), visually
review it using charts and graphs, and look for potentially meaningful links among
variables(such as values that often occur together). Collecting, exploring and selecting
the right data are critically important.
6
But data description alone can not provide action plan. You must build a
predictive model based on patterns determined from non results, then test the model on
result out side the original sample. A good model should never be confused with reality,
but it can be a useful guide to understanding your business.
The final step is to empirically verify the model. For example, from a
database of customers who have already responded to a particular offer. You have built a
model predicting which prospects are likeliest to respond to the se offer. Can you rely on
this prediction?
“Those who can not remember the past are condensed to repeat it”.
-G.Santayna
Data Mining works the same way as a human being does. It uses historical
information (experience) to learn from the past. However, in order for the data mining
technology to pull the “gold” out of your database, you do have to tell it what the gold
looks like (what business problem you would like to solve). It then uses the description of
that “gold” to look for similar examples in the database, and uses these pieces of
information from the past to develop a predictive model of what will happen in the future.
Data mining dose not replace skilled business analysts or manages, but
rather gives them a powerful new tool to improve the job they are doing . Any company
that knows its business and its customers is already aware of many important, high pay
off patterns that its employees have observed over the years. What data mining can do is
confirm such empirical observations and find new, sable patterns that yield steady
incremental improvement.
8
The data to be mined is first abstracted from enterprise data warehouse into a data
mining or data marts. There is a some real benefit if your data is already part of a data
warehouse. The problems of cleansing data for a data warehouse and for a data mining
are very similar. If the data has already been cleansed for data warehouse, then it most
likely will not need further cleaning in order to be mined.
The data mining data base may be logical rather than physical subset of your data
warehouse provided that the data warehouse DBMS can support the additional resource
demands of data mining. If it can not , then you will be better off with a separate data
mining data base.
The goal of data warehouse is to support decision making with data. Data mining
can be used in conjunction to help with certain types of decisions. Data mining can be
applied to operational data bases with individual transactions. To make data mining more
efficient, the data warehouse should have aggregated or summarized the collection of
data. Data mining helps extracting meaningful new patterns that can not be found
necessary by merely querying or processing data in the data warehouse. Data mining
applications should Therefore be strongly consider early, during design of data
warehouse. Also, data mining tools should be designed to facilitate their use in
conjunction with data warehouses. In fact, for very large data bases running into terabytes
of data, successful use of data base mining applications will applications will depend first
on the construction of a data warehouse.
10
OLAP is part of the spectrum of decision support tools. Traditional query and
report tools describe what is in a database. OLAP goes further; it’s used to answer why
Certain things are true. The user forms a hypothesis about a relationship and verifies it
with a series of queries against the data. For example, an analyst might want to determine
the factors that lead to loan defaults. He or she might initially hypothesize that people
with low incomes are bad credit risks and analyze the data base with OLAP to verify (or
disprove) this assumption. If that hypothesis were not borne out by the data, the analyst
might then look at high debt as the determinant of risk. If the data did not support this
guess either, he or she might then try debt and income together as the best predictor of
bad credit risks.
In other word, the OLAP analyst generates a series of hypothetical patterns and
relationships and uses queries against the database to verify them or disprove them.
OLAP analysis essentially a deductive process. But what happens when the number of
variables being analyzed is in dozens of even hundred? It becomes much more difficult
and time-consuming to find a good hypothesis (let alone be confident that there is not a
better explanation than the one found), and analysis the database with OLAP to verify or
disprove it.
Data mining is different from OLAP because rather than verify hypothetical
patterns, it uses the data itself to uncover such patterns. It is essentially an inductive
process. For example, suppose the analyst who wanted to identify the risk factors for
loan default were to use a data mining tool. The data mining tool might discover that
people with high debt and low incomes were bad credit risks (as above) , but it might go
further and also discover a pattern the analyst did not think to try, such as that age is also
a determinant of risk.
Here is where data mining and OLAP can complement each other. Before acting
on the pattern, the analyst needs to know what the financial implications would be of
using the discovered pattern to govern who gets credits. The OLAP tool can allow the
analyst to answer those kinds of questions. Furthermore, OLAP is also complementary in
the early stages of the knowledge discovery process because it can help you explore your
data, for instance by focusing attention on important variables, identifying exceptions, or
finding interactions. This is important because your understand your data, the more
effective the knowledge discovery process will be.
11
Data mining techniques are the result of a long process of research and product
development. This evolution began when business data was first stored on computers,
continued with improvements in data access, and more recently, generated technologies
that allow users to navigate through their data in real time. Data mining takes this
evolutionary process beyond retrospective data access and navigation to prospective and
proactive information delivery. Data mining is ready for application in the business
community because it is supported by three technologies that are now sufficiently mature:
• Massive data collection
• Powerful multiprocessor computers
• Data mining algorithms
Pilot,
Advanced Lockheed,
"What’s likely to Prospective,
Data Mining algorithms, IBM, SGI,
happen to Boston proactive
(Emerging multiprocessor numerous
unit sales next information
Today) computers, startups
month? Why?" delivery
massive databases (nascent
industry)
To best apply these advanced techniques, they must be fully integrated with a data
warehouse as well as flexible interactive business analysis tools. Many data mining tools
currently operate outside of the warehouse, requiring extra steps for extracting,
importing, and analyzing the data. Furthermore, when new insights require operational
implementation, integration with the warehouse simplifies the application of results from
data mining. The resulting analytic data warehouse can be applied to improve business
processes throughout the organization, in areas such as promotional campaign
management, fraud detection, new product rollout, and so on. Figure 1 illustrates an
architecture for advanced analysis in a large data warehouse.
Data warehouse is a central repository for data. There are three different basic
architectures for constructing a data warehouse. In first type there is only central location
to store data, which we call data warehouse physical storage media. In this type of
construction, data is gathered from heterogeneous, data sources, like different types of
files, local database system and from other external sources.
As the data is stored in a central place its' access is very easy and simple. But
disadvantage of this construction is the loss of performance.
In second type of construction data is decentralized. As the data cannot be stored
physically together but logically it is consolidated in data warehouse environment. In this
construction department wise data and site wise data is stored at their local place. Local
application and other generated data is stored in local database but information about
data, called metadata (data about data) is stored in central site. This local database can
also maintain their metadata locally for their local work as well as central site. This local
database with metadata is called "Data Marts".
An advantage of this architecture is that the logical data warehouse is only virtual.
Central data warehouse is not storing any actual data but information of data so any user
who wants to access data can make query to central site and this central site prepare
resultant data for user. This entire process to collect data from physical database is
transparent to user.
Third and last type of construction creates a hierarchical view of data. Here the central
data warehouse is also storing actual data and data marts on next level store copy or
summary of physical central data warehouse. Local data marts store the data, which is
related to related to their local site only.
The advantage of distributed and hierarchical construction are (1) Retrieval time of data
from data warehouse is less and (2) volume of data is also reduced. Although data is
integrated through metadata so anyone from anywhere can access data and processing is
divided in different physical machines. For better response of data retrieval, scalable data
16
As discussed above, you can physically design your data warehouse as using any of
three construction type. But to integrate data in a data warehouse require some procedure
like data extraction and data migration, data cleansing / data scrubbing and data
integration.
To extract data from operational databases, files and other external sources, extraction
tools are required. This process should be detailed and documented correctly. If this
process is not properly documented then it will create problems while integration with
other data and also create difficulties at later stage. So data extraction should provide high
level of integration and make efficient metadata for data warehouse.
Data migration is a task to convert the data from one system to another. It should
provide type checking of integrity constraints in data warehouse. It should also find out
inconsistency and missing values while converting metadata for entire process so one can
easily identified problem in migration process.
Data warehouse collect data from heterogeneous sources in organization. These data are
integrated in such a manner so any end-user can access that data very easily. For facilitate
end-user, DWA (Data Warehouse Administration) must be aware about right approach of
warehouse. To collect data from different operating system, from different network,
different application files like C, COBOL, FORTARN and different operational
databases. So our first step is to design a platform on which we can access data from
every system and put them together in a warehouse. Before transferring data from one
17
system to another, data must be standardized. This standard is always related to format of
data, structure of data and information collection.
How exactly is data mining able to tell you important things that you didn't
know or what is going to happen next? The technique that is used to perform these feats
in data mining is called modeling. Modeling is simply the act of building a model in one
situation where you know the answer and then applying it to another situation that you
don't. For instance, if you were looking for a sunken Spanish galleon on the high seas the
first thing you might do is to research the times when Spanish treasure had been found by
others in the past. You might note that these ships often tend to be found off the coast of
Bermuda and that there are certain characteristics to the ocean currents, and
certain routes that have likely been taken by the ship’s captains in that era. You note these
similarities and build a model that includes the characteristics that are common to the
locations of these sunken treasures. With these models in hand you sail off looking for
treasure where your model indicates it most likely might be given a similar situation in
the past. Hopefully, if you've got a good model, you find your treasure.
This act of model building is thus something that people have been doing for
a long time, certainly before the advent of computers or data mining technology. What
happens on computers, however, is not much different than the way people build models.
Computers are loaded up with lots of information about a variety of situations where an
answer is known and then the data mining software on the computer must run through
that data and distill the characteristics of the data that should go into the model. Once the
model is built it can then be used in similar situations where you don't know the answer.
For example, say that you are the director of marketing for a telecommunications
company and you'd like to acquire some new long distance phone customers. You could
just randomly go out and mail coupons to the general population - just as you could
21
randomly sail the seas looking for sunken treasure. In neither case would you achieve the
results you desired and of course you have the opportunity to do much better than random
- you could use your business experience stored in your database to build a model.
As the marketing director you have access to a lot of information about all
of your customers: their age, sex, credit history and long distance calling usage. The good
news is that you also have a lot of information about your prospective customers: their
age, sex, credit history etc. Your problem is that you don't know the long distance calling
usage of these prospects (since they are most likely now customers of your competition).
You'd like to concentrate on those prospects who have large amounts of long distance
usage. You can accomplish this by building a model. Table 2 illustrates the data used for
building a model for new customer prospecting in a data warehouse.
Customers Prospects
General information (e.g.
Known Known
demographic data)
Proprietary information (e.g.
Known Target
customer transactions)
This model could then be applied to the prospect data to try to tell something
about the proprietary information that this telecommunications company does not
currently have access to. With this model in hand new customers can be selectively
targeted.
Test marketing is an excellent source of data for this kind of modeling. Mining the results
of a test market representing a broad but relatively small sample of prospects can provide
a foundation for identifying good prospects in the overall market. Table 3 shows another
common scenario for building models: predict what is going to happen in the future.
customer transactions)
If someone told you that he had a model that could predict customer usage how
would you know if he really had a good model? The first thing you might try would be to
ask him to apply his model to your customer base - where you already knew the answer.
With data mining, the best way to accomplish this is by setting aside some of your data in
a vault to isolate it from the mining process. Once the mining is complete, the results can
be tested against the data held in the vault to confirm the model’s validity. If the model
works, its observations should hold for the vaulted data.
• Prediction:- Data mining can show how certain attributes within the data will
behave in the future. Examples of predictive data mining include the analysis of
buying transactions to predict what consumer will buy under certain discount,
how much sales volume store would generate in given period whether deleting
product line would yield more profits, business logic is used coupled with data
mining. In scientific context, certain scientific wave patterns may predict an
earthquake with high probability.
The solution to these problems is the tight integration of Data Mining and
Campaign Management technologies. Integration is crucial in two areas:
First, the Campaign Management software must share the definition of the defined
campaign segment with the Data Mining application to avoid modeling the entire
24
database. For example, a marketer may define a campaign segment of high-income males
between the ages of 25 and 35 living in the northeast. Through the integration of the two
applications, the Data Mining application can automatically restrict its analysis to
database records containing just those characteristics.
Second, selected scores from the resulting predictive model must flow
seamlessly into the campaign segment in order to form targets with the highest profit
potential.
This section examines how to apply the integration of Data Mining and Campaign
Management to benefit the organization. The first step creates a model using a Data
Mining tool. The second step takes this model and puts it to use in the production
environment of an automated database marketing campaign.
Management tool rather than in the Data Mining tool. Dynamic scoring both avoids
mundane, repetitive manual chores and eliminates the need to score an entire database.
Instead, dynamic scoring marks only relevant customer subsets and only when needed.
Scoring only the relevant customer subset and eliminating the manual process
shrinks cycle times. Scoring data only when needed assures "fresh," up-to-date results.
Once a model is in the Campaign Management system, a user (usually someone other
than the person who created the model) can start to build marketing campaigns using the
predictive models. Models are invoked by the Campaign Management System.
In this example:
• Length of service =9 limits the application of the
model to those customers in the ninth month of
their 12-month contracts, thus targeting customers
only at the most vulnerable time. (In reality, there
is likely a variety of contract lengths to consider
this when formulating the selection criteria.)
• Average balance > 150 selects only customers
spending, on average, more than $150 each month.
The marketer deemed that it would unprofitable to
send the offer to less valuable customers.
• Promo9 is the name of a logged predictive model
that was created with a Data Mining application.
This criterion includes a threshold score, 0.80,
which a customer must surpass to be considered "in
the model." This third criteria limits the campaign
to just those customers in the model, i.e. those
26
Ideally, marketers who build campaigns should be able to apply any model logged in the
Campaign Management system to a defined target segment. For example, a marketing
manager at a cellular telephone company might be interested in high-value customers
likely to switch to another carrier. This segment might be defined as customers who are
nine months into a twelve-month contract, and whose average monthly balance is more
than $150.
The easiest approach to retain these customers is to offer all of them a new
high-tech telephone. However, this is expensive and wasteful since many customers
would remain loyal without any incentive.
27
For marketers:
• Improved campaign results through the use of model scores that further
refine customer and prospect segments.
Records can be scored when campaigns are ready to run, allowing the use of the
most recent data. "Fresh" data and the selection of "high" scores within defined
market segments improve direct marketing results.
• Accelerated marketing cycle times that reduce costs and increase the
likelihood of reaching customers and prospects before competitors.
Scoring takes place only for records defined by the customer segment, eliminating
the need to score an entire database. This is important to keep pace with
continuously running marketing campaigns with tight cycle times.
Accelerated marketing "velocity" also increases the number of opportunities used
to refine and improve campaigns. The end of each campaign cycle presents
another chance to assess results and improve future campaigns.
• Increased accuracy through the elimination of manually induced errors. The
Campaign Management software determines which records to score and
when.
For statisticians:
• Less time spent on mundane tasks of extracting and importing files, leaving
more time for creative – building and interpreting models. Statisticians have
greater impact on corporate bottom line.
28
As a database marketer, you understand that some customers present much greater profit
potential than others. But, how will you find those high-potential customers in a database
that contains hundreds of data items for each of millions of customers?
Data Mining software can help find the "high-profit" gems buried in mountains of
information. However, merely identifying your best prospects is not enough to improve
Instead, to reduce costs and improve results, the marketer could use a predictive model to
select only those valuable customers who would likely defect to a competitor unless they
receive the offer.
Here is process for extracting hidden knowledge from your data warehouse, your
customer information file, or any other company database.
Before you begin, be clear on what you hope to accomplish with your
analysis. Know in advance the business goal of the data mining. Establish whether or
not the goal is measurable. Some possible goals are to
- Find sales relationships between specific products or services
- Identify specific parching patterns over time
- Identify potential types of customers
- Find product sales trends.
Once you have defined your goal, your next step is to select the data to
meet this goal. This may be a subset of your data warehouse or a data mart that
contains specific product information. It may be your customer information file.
Segment as much as possible the scope of the data to be mined.
Here are some key issues.
- Are the data adequate to describe the phenomena the data mining analysis
is attempting to model?
29
- Can you enhance internal customer records with external lifestyle and
demographic data?
- Are the data stable-will the mined attributes be the same after the analysis?
- If you are merging databases and you find a common field for linking
them?
- How current and relevant are the data to the business goal?
Once you’ve assembled the data, you must decide which attributes to
convert into usable formats. Consider the input of domain experts-creators and
users of the data.
- Establish strategies for handling missing data, extraneous noise, and outliers
- Identify redundant variables in the dataset and decide which fields to exclude
- Decide on a log or square transformation, if necessary
- Visually inspect the dataset to get a feel for the database
- Determine the distribution frequencies of the data
You can postpone some of these decisions until you select a data-
mining tool. For example, if you need a neural network or polynomial
network you may have to transform some of your fields.
At this point that the data mining processing begins. Usually the
first step is to use the random number seed to split the data into a training set and a
test set and construct and evaluate a model. The generation of the classification rules,
decision trees, clustering sub-groups, score, code, weights and evaluation data/error
rates takes place at this stage. Resolve these issues:
- Are error rates at acceptable level? Can you improve them?
- What extraneous attributes did you find? Can you purge them?
- Is additional data or a different methodology necessary?
- Will you have to train and test a new data set?
Share and discuss the results of the analysis with the business
client or domain expert. Ensure that the findings are correct and appropriate to the
business objectives.
- Do the findings make sense?
- Do you have to return any prior steps to improve results?
- Can use other data mining tools to replicate the findings?
Although data mining tools automate database analysis, they can lead
to faulty findings and erroneous conclusions if you’re not careful. Bear in mind that data
mining is a business process with a specific goal- to extract a competitive insight from
historical records in a database.
32
As the table below shows, a random sampling of the full customer/prospect database
produces a loss regardless of the campaign target size. However, by targeting customer
using a Data Mining model, the marketer can select a smaller target that includes a higher
percentage of good prospects. This more focused approach generates a profit until the
target becomes too large and includes too many poor prospects.
The Data Mining Suite works directly on large SQL repositories with no need
for sampling or extract files. It accesses large volumes of multi-table relational
data on the server, incrementally discovers powerful patterns and delivers
automatically generated English text and graphs as explainable documents on
the intranet.
The Data Mining Suite is based on a solid foundation with a total vision for
decision support. The three-tiered, server-based implementation provides
34
highly scalable discovery on huge SQL databases with well over 90% of the
computations performed directly on the server, in parallel if desired.
data mining. This dramatic new approach forever changed the way
corporations use decision support. No longer are OLAP and data mining
viewed as separate activities, but are fused to deliver maximum benefit. The
patterns discovered by the system include multi-dimensional influences and
contributions, OLAP affinities and associations, comparisons, trends and
variations. The richness of these patterns delivers unparalleled business
benefits to users, allowing them to make better decisions than ever before.
The Data Mining Suite also pioneered the use of incremental pattern-
base population. With incremental data mining, the system automatically
35
These truly unique products are all designed to work together, d in concert
with the Knowledge Access Suite .TM
under which stronger item groupings take place. The Affinity Discovery
System includes a number of useful features that make it a unique industrial
strength product. These features include hierarchy and cluster definitions,
exclusion lists, unknown-value management, among others.
Trend Discovery
a day, a week or a month. Time-windows define how time grains are grouped
together, e.g. we may look at daily trends with weekly windows, or we may
look at weekly grains with monthly windows. Slopes define how quickly a
measure is increasing or decreasing, while shapes give us various categories of
trend behavior, e.g. smoothly increasing vs. erratically changing.
Forensic Discovery
Predictive Modeler
The output from the seven component products of the Data Mining Suite is
stored within the pattern-base and is accessible with PQL: The Pattern Query
Language. Readable English text and graphs are automatically generated in
ASCII and HTML formats for the delivery on the inter/intranet.
38
The products in the Data Mining Suite deliver the most advanced and scalable
TM
technologies within a user friendly environment. The specific reasons draw on the solid
mathematical foundation, which Information Discovery, Inc. pioneered and a highly
scalable implementation. Click here to see what makes The Knowledge Access Suite. So
unique.
The Data Mining Suite works directly on very large SQL databases and does not
require samples, extracts and/or flat files. This alleviates the problems associated
with flat files which lose the SQL engine's power (e.g. parallel execution) and
which provide marginal results. Another advantage of working on an SQL
database is that the Data Mining Suite has the ability to deal with both numeric
and non-numeric data uniformly. The Data Mining Suite does not fix the ranges in
numerical data beforehand, but finds ranges in the data dynamically by itself.
Multi-Table Discovery
The Data Mining Suite discovers patterns in multi-table SQL databases without
having to join and build an extract file. This is a key issue in mining large
databases. The world is full of multi-table databases which can not be joined and
meshed into a single view. In fact, the theory of normalization came about
because data needs to be in more than one table. Using single tables is an affront
to all the work of E.F. Codd on database design. If you challenge the DBA in a
really large database to put things in a single table you will either get a laugh or a
blank stare -- in many cases the database size will balloon beyond control. In fact,
39
there are many cases where no single view can correctly represent the semantics
of influence because the ratios will always be off regardless of how you join. The
Data Mining Suite leads the world of discovery with the unique ability to mine
large multi-table databases.
No Sampling or Extracts
Sampling theory was invented because one could not have access to the
underlying population being analyzed. But a warehouse is there to provide such
access.
The format of the patterns discovered by the Data Mining Suite is very general
and goes far beyond decision trees or simple affinities. The advantage to this is
that the general rules discovered are far more powerful than decision trees.
Decision trees are very limited in that they cannot find all the information in the
database. Being rule-based keeps the Data Mining Suite from being constrained to
one part of a search space and makes sure that many more clusters and patterns
are found -- allowing the Data Mining Suite to provide more information and
better predictions.
Language of Expression
The Data Mining Suite has a powerful language of expression, going several
times beyond what most other systems can handle. For instance, for logical
statements it can express statements such as "IF Destination State = Departure
State THEN..." or "IF State is not Arizona THEN ...". Surprisingly most other data
mining systems can not express these simple patterns. And the Data Mining Suite
pioneered dimensional affinities such as IF Day = Saturday WHEN PaintBrush is
purchased ALSO Paint is purchased". Again most other systems cannot handle
this obvious logic.
The Data Mining Suite is unique in its ability to deal with various data types in a
uniform manner. It can smoothly deal with a large number of non-numeric values
and also automatically discovers ranges within numeric data. Moreover, the Data
Mining Suite does not fix the ranges in numerical data but discovers interesting
40
ranges by itself. For example, given the field Age, the Data Mining Suite does not
expect this to be broken into 3 segments of (1-30), (31-60), (61 and above).
Instead it may find two ranges such as (27-34) and (48-61) as important in the
data set and will use these in addition to the other ranges.
Should a data mining system be aware of the functional (and other dependencies)
that exist in a database? "Yes" and very much so. The use of these dependencies
can significantly enhance the power of a discovery system -- in fact ignoring them
can lead to confusion. The Data Mining Suite takes advantage of data
dependencies.
Server-based Architectures
The Data Mining Suite has a three level client server architecture whereby the
user interface runs on a thin intranet client and the back-end process for analysis
is done on a Unix server. The majority of the processing time is spent on the
server and these computations run both by using parallel SQL and non-SQL calls
managed by the Data Mining Suite itself. Only about 50% of the computations on
the server are SQL-based and the other statistical computations are already
managed by the Data Mining Suite program itself, at times by starting separate
processes on different nodes of the server.
System Initiative
The Data Mining Suite uses system initiative in the data mining process. It forms
hypothesis automatically based on the character of the data and converts the
hypothesis into SQL statements forwarded to the RDBMS for execution. The
Data Mining Suite then selects the significant patterns filter the unimportant
trends.
The Data Mining Suite provides explanations as to how the patterns are being
derived. This is unlike neural nets and other opaque techniques in which the
mining process is a mystery. Also, when performing predictions, the results are
transparent. Many business users insist on understandable and transparent results.
41
The Data Mining Suite is not sensitive to noise because internally it uses fuzzy
logic analysis. As the data gathers noise, the Data Mining Suite will only reduce
the level on confidence associated with the results provided. However, it will still
produce the most significant findings from the data set.
The Data Mining Suite has been specifically tuned to work on databases with an
extremely large number of rows. It can deal with data sets of 50 to 100 million
records on parallel machines. It derives its capabilities from the fact that it does
not need to write extracts and uses SQL statements to perform its process.
Generally the analyses performed in the Data Mining Suite are performed on
about 50 to 120 variables and 30 to 100 million records directly. It is, however,
easier to increase the number of records based on the specific optimization
options with the Data Mining Suite to deal with very large databases.
These unique features and benefits make the Data Mining Suite the ideal solution for
large-scale Data Mining in business and industry.
42
Data mining is a tool, not a magic wand. It won’t sit in your database watching
what happens and send you e-mails to get your attention when it sees an interesting
pattern. It doesn’t eliminate the need to know your business, to understand your data, or
to understand analytical methods. Data mining assists business analysts with finding
patterns and relationships in the data-it does not tell you the value of the patterns to
organization. Furthermore, the patterns uncovered by data mining must be verified in the
real world.
Remember that the predictive relationships found via data mining are not
necessarily causes of an action or behavior. For example, data mining might determine
that males with incomes between $50,000 and $65,000 who subscribe to certain
magazines are likely purchasers of a product you want to sell. While you can take
advantages of this pattern, say by aiming your marketing at people who fit the pattern,
you should not assume that any of these factors cause them to buy your product.
To ensure meaningful results, it’s vital that you understand your data. The quality
of your output will often be sensitive to outliers (data values that are very different from
the typical values in your database), irrelevant columns or columns that vary together
(such as age and date of birth), the way you encode your data, and the data you leave in
and the data you exclude. Algorithms vary in their sensitivity to such data issues, but it is
unwise to depend on a data-mining product to make all the right decisions on its own.
Data mining will not automatically discover solutions without guidance. Rather
than setting the vague goal, “Help improve the response to my direct mail solicitation”,
you might use data mining to find the characteristics of people who (1) respond to your
solicitation, or (2) respond AND make large purchase. The patterns data mining finds for
those two goals may be very different.
43
Although a good data mining tool shelters you from the intricacies of statistical
techniques, it requires you to understand the working of the tools you choose and the
algorithms on which they are based. The choices you make in setting up your data mining
tool and the optimizations you choose will affect the accuracy and speed of your models.
A significant issue in data warehouse is the quality control of data. Both quality
and consistency of data are major concerns. Although the data passes through a cleaning
function during acquisition, quality and consistency remain significant issues for the
database administrator. Melding data from heterogeneous and disparate sources is a major
challenge given differences in naming, domains definitions, identification numbers, and
the like. Every time a sources database changes, the warehouse administrator must
consider the possible interactions with other elements of the warehouse.
turned to remain optimized for support of the organization’s use of its warehouse. This
activity should be continue through out the life of the house in order to remain ahead of
demand. The warehouse should also be designed to accommodate addition and attrition
of data sources with out major redesign.
Administration of a data warehouse will require far broader skills than are needed
for traditional database administration. A team of highly skilled technical experts with
overlapping areas of expertise will likely be needed, rather than a single individual. Like
database administration, data warehouse administration is only partly technical; a large
part of the responsibility requires working efficiently with all the members of the
organization with an interest in the data warehouse. However difficult that can be at times
for database administrators, it is that much more challenging for data warehouse
administrators, as the scope of their responsibilities is considerably broader.
Design of the management function and selection of the management team for a
database warehouse are crucial. Managing the data warehouse in a large organization will
surely be a major task. Many commercial tools are already available to support
management functions. Effective data warehouse management will certainly be a team
function, requiring a wide set of technical skills, careful coordination, and effective
leadership. Just as we must prepare for the evolution of the warehouse, we must also
recognize that the skills of the management team will, of necessity, evolve with it.
45
Data mining derives its name from the similarities between searching for valuable
business information in a large database — for example, finding linked products in
gigabytes of store scanner data — and mining a mountain for a vein of valuable ore. Both
processes require either sifting through an immense amount of material, or intelligently
probing it to find exactly where the value resides. Given databases of sufficient size and
quality, data mining technology can generate new business opportunities by providing
these capabilities:
• Automated prediction of trends and behaviors. Data mining automates the process
of finding predictive information in large databases. Questions that traditionally
required extensive hands-on analysis can now be answered directly from the data
— quickly. A typical example of a predictive problem is targeted marketing. Data
mining uses data on past promotional mailings to identify the targets most likely
to maximize return on investment in future mailings. Other predictive problems
include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.
Data mining techniques can yield the benefits of automation on existing software and
hardware platforms, and can be implemented on new systems as existing platforms are
upgraded and new products developed. When data mining tools are implemented on high
performance parallel processing systems, they can analyze massive databases in minutes.
Faster processing means that users can automatically experiment with more models to
understand complex data. High speed makes it practical for users to analyze huge
quantities of data. Larger databases, in turn, yield improved predictions.
46
A recent Granter Group Advanced Technology Research Note listed data mining
and artificial intelligence at the top of the five key technology areas that "will clearly
have a major impact across a wide range of industries within the next 3 to 5 years."2
Gartner also listed parallel architectures and data mining as two of the top 10 new
technologies in which companies will invest during the next 5 years. According to a
recent Gartner HPC Research Note, "With the rapid advance in data capture, transmission
and storage, large-systems users will increasingly need to implement new and innovative
ways to mine the after-market value of their vast stores of detail data, employing MPP
[massively parallel processing] systems to create new sources of business advantage (0.9
probability)."3
Profitable Applications
• A pharmaceutical company can analyze its recent sales force activity and their
results to improve targeting of high-value physicians and determine which
marketing activities will have the greatest impact in the next few months. The
data needs to include competitor market activity as well as information about the
local health care systems. The results can be distributed to the sales force via a
wide-area network that enables the representatives to review the
recommendations from the perspective of the key attributes in the decision
process. The ongoing, dynamic analysis of the data warehouse allows best
practices from throughout the organization to be applied in specific sales
situations.
• A credit card company can leverage its vast warehouse of customer transaction
data to identify customers most likely to be interested in a new credit product.
Using a small test mailing, the attributes of customers with an affinity for the
product can be identified. Recent projects have indicated more than a 20-fold
decrease in costs for targeted mailing campaigns over conventional approaches.
48
• A diversified transportation company with a large direct sales force can apply data
mining to identify the best prospects for its services. Using data mining to analyze
its own customer experience, this company can build a unique segmentation
identifying the attributes of high-value prospects. Applying this segmentation to a
general business database such as those provided by Dun & Bradstreet can yield a
prioritized list of prospects by region.
• A large consumer package goods company can apply data mining to improve its
sales process to retailers. Data from consumer panels, shipments, and competitor
activity can be applied to understand the reasons for brand and store switching.
Through this analysis, the manufacturer can select promotional strategies that best
reach their target customer segments.
Each of these examples have a clear common ground. They leverage the
knowledge about customers implicit in a data warehouse to reduce costs and improve the
value of customer relationships. These organizations can now focus their efforts on the
most important (profitable) customers and prospects, and design targeted marketing
strategies to best reach them.
.
49
Data warehouses exist to facilitate complex, data-intensive, and frequent ad hoc queries.
Accordingly, data warehouses must provide far greater and more efficient query support
than is demanded of transitional databases. The data warehouse access component
supports enhanced spreadsheet functionality, efficient query processing, structured
queries, ad hoc queries, data mining, and materialized views. In particular, enhanced
spreadsheet functionality includes support for state-of-the-art spreadsheet
applications(e.g., MS Excel) as well as for OLAP applications programs. These offer
preprogrammed functionalities such as the following:
• Roll-up: Data is summarized with increasing generalization (weekly to quarterly
to annually).
• Drill-down: Increasing levels of detail are revealed (the complement of roll-up).
• Pivot: Cross tabulation (also referred as rotation) is performed.
• Slice and dice: Performing projection operations on the dimensions.
• Sorting: Data is sorted by ordinal value.
• Selection: Data is available by value or range.
• Derived (computed) attributes: Attributes are computed by operations on stored
and derived values.
50
Analytical model:-
A structure and process for analyzing a dataset. For example, a decision tree is a model
for the classification of a dataset.
Data cleansing;-
The process of ensuring that all values in a dataset are consistent and correctly recorded.
Anomalous data:-
Data that result from errors (for example, data entry keying errors) or that represent
unusual events. Anomalous data should be examined carefully because it may carry
important information
Data visualization:-
The visual interpretation of complex relationships in multidimensional data.
CHAID:-
Chi Square Automatic Interaction Detection. A decision tree technique used for
classification of a dataset. Provides a set of rules that you can apply to a new
(unclassified) dataset to predict which records will have a given outcome. Segments a
dataset by using chi square tests to create multi-way splits. Preceded, and requires more
data preparation than, CART.
Data mining:-
The extraction of hidden predictive information from large databases.
51
CART::-
Classification and Regression Trees. A decision tree technique used for classification of a
dataset. Provides a set of rules that you can apply to a new (unclassified) dataset to
predict which records will have a given outcome. Segments a dataset by creating 2-way
splits. Requires less data preparation than CHAID.
Data warehouse:-
A system for storing and delivering massive quantities of data.
Multidimensional database:-
A database designed for on-line analytical processing. Structured as a multidimensional
hypercube with one axis per dimension.
Nearest neighbor:-
A technique that classifies each record in a dataset based on a combination of the classes
of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called
a k-nearest neighbor technique.
OLAP:-
On-line analytical processing. Refers to array-oriented database applications that allow
users to view, navigate through, manipulate, and analyze multidimensional databases
Predictive model:-
A structure and process for predicting the values of specified variables in a dataset.
RAID :-
Redundant Array of Inexpensive Disks. A technology for the efficient parallel storage of
data for high-performance computer systems
52
Conclusion:
Data mining offers great promise in helping organizations uncover patterns hidden
in their data that can be used to predict the behavior of customers, products and
processes. However, data mining tools need to be guided by users who understand the
business, the data, and the general nature of the analytical methods involved. Realistic
expectations can yield rewarding results across a wide range of applications, from
improving revenues to reducing costs.
Building models is only one step in knowledge discovery. It’s vital to properly
collect and prepare the data, and to check the models against the real world. The “best”
model is often found after models of several different types ,or by trying different
technologies or algorithms.
Choosing the right data mining products means finding a tool with good basic
capabilities, an interface that matches the skill level of the people who’ll be using it, and
features relevant to your specific business problems. After you have narrowed down the
list of potential solutions, get a hands-on trial of the likeliest ones.
BIBLOGRAPHY