You are on page 1of 42

PAPERS

DATA MINING

Group 3
Created by:
 Adnan Khawarizwi
 Amin Saputra
 Candra Bayu Permana
 Dimas Febry
 M. Rizky Thariq
 Maruf Abdullah
 Noval Yazid
 Rival Rinofiansyah
 Ryan Hidayatullah
 Taufrino Cahyadi
 Yano Yahya
 Zufar Attaqi

SEKOLAHMENENGAH KEJURUAN (SMK) KIMIA


PGRI KOTA SERANG

2018
PREFACE

First of all, thanks to Allah SWT because of the help of Allah, writer finished
writing the paper entitled “Data Mining” right in the calculated time.

The purpose in writing this paper is to fulfill the assignment that given by Mr.
Bambang Irfani as lecturer in samantics major.

in arranging this paper, the writer trully get lots challenges and obstructions but
with help of many indiviuals, those obstructions could passed. writer also realized
there are still many mistakes in process of writing this paper.
because of that, the writer says thank you to all individuals who helps in the
process of writing this paper. hopefully allah replies all helps and bless you all.the
writer realized tha this paper still imperfect in arrangment and the content. then
the writer hope the criticism from the readers can help the writer in perfecting the
next paper.last but not the least Hopefully, this paper can helps the readers to gain
more knowledge about samantics major.

Serang, January 25th, 2018

author

ii
TABLE of CONTENTS

PREFACE ................................................................................................................

TABLE of CONTENTS .........................................................................................

LIST OF TABLES ..................................................................................................

LIST OF FIGURES ................................................................................................

ABSTRACT .............................................................................................................

CHAPTER I INTRODUCTION

1.1 Background ...............................................................................................

1.2 Formulation of the Problem ......................................................................

1.3 The Purpose of The Problem ....................................................................

1.4 The Benefits of Writing ............................................................................

CHAPTER II DISCUSSION

2.1 Definition ................................................................................................


2.2 Architecture For Data Mining .................................................................
2.3 Physical structure of data warehouse .....................................................

2.4 Issues in Integration of data in data warehouse.......................................


2.5 Data extraction and data migration ........................................................
2.6 Data cleansing / Data scrubbing ..............................................................
2.7 Characteristics of Data warehousing ......................................................

2.8 Types of Data Mining .............................................................................


2.9 How Data Mining Works ........................................................................

2.10 Goals of Data Mining ..............................................................................

2.11 Integrating Data Mining and Campaign Management ...........................


2.12 The integrated Data Mining and Campaign Management process .........

iii
2.13 Data Mining and Campaign Management in the real world ...................
2.14 The Benefits of integrating Data Mining and Campaign Management ..
2.15 The ten Steps of Data Mining .................................................................
2.16 Evaluating the Benefits of a Data Mining Model ...................................
2.17 The data mining suite ..............................................................................
2.18 The Data Mining Suite is Unique ............................................................

CHAPTER III CLOSING

3.1 Conclusion ................................................................................................

REFERENCES

iv
LIST of TABLES

Table 1 - Steps in the Evolution of Data Mining ...................................................


Table 2 - Data Mining for Prospecting ...................................................................
Table 3 - Data Mining for Predictions ...................................................................

Table 4 - Prospect Database Produces ...................................................................

v
LIST of FIGURES

Figure 1 - Integrated Data Mining Architecture ...........................................

Figure 2 – The Data Mining Suite .......................................................

vi
ABSTRACT

Organizations are getting larger and amassing ever-increasing amounts of


data. With the increased and widespread use of technologies, interest in data
warehousing and data mining has increased rapidly. Data is a collection of entity.
Database is referred as the collection of data. Data warehouse is a group of
database. It is the centralized location, where information gathered from various
sources is placed together. Data mining is the process of analyzing data to find
useful patterns. Data Mining works with Data Warehouse. Data Warehousing
provides the Enterprise with memory and Data Mining provides the Enterprise
with intelligence. Data mining is becoming an increasingly important tool to
transform the data into information. Data mining is the extraction of hidden
prognostic information from large databases. The OLAP (On-Line Analytical
Processing) tools, query languages, and data mining algorithms help in the
extraction of data from the information. The size of a data warehouse ranges from
giga byte to tera byte.

vii
CHAPTER I

INTRODUCTION

1.1 Background
Data mining is the process of extracting useful information. Basically it is the
process of discovering hidden patterns and information from the existing
data. In data mining, one needs to primarily concentrate on cleansing the data
so as to make it feasible for further processing. The process of cleansing the
data is also called as noise elimination or noise reduction or feature
elimination [1]. This can be done by using various tools available supporting
various techniques. The important consideration in data mining is whether the
data to be handled static or dynamic.

1.2 Formulation of The Problem


1. What is Data Mining?
2. What is the function and purpose of data mining?
3. What is data warehuse?
4. How data mining works?

1.3 The Purpose Of The Problem


1. So that readers can understand about data mining
2. Know Function and purpose of data mining
3. Know about data warehouse
4. Know How data mining works

1.4 The Benefits Of Writing


1. As a source of information for readers who want to know about Data
mining.
2. As a motivation for the reader to research more about Data mining.

1
CHAPTER II
DISCUSSION

2.19Definition
Data mining techniques are the result of a long process of research and
product development. This evolution began when business data was first
stored on computers, continued with improvements in data access, and more
recently, generated technologies that allow users to navigate through their
data in real time. Data mining takes this evolutionary process beyond
retrospective data access and navigation to prospective and proactive
information delivery. Data mining is ready for application in the business
community because it is supported by three technologies that are now
sufficiently mature:
• Massive data collection
• Powerful multiprocessor computers
• Data mining algorithms

Commercial databases are growing at unprecedented rates. A recent META


Group survey of data warehouse projects found that 19% of respondents are
beyond the 50 gigabyte level, while 59% expect to be there by second quarter
of 1996.1 In some industries, such as retail, these numbers can be much
larger. The accompanying need for improved computational engines can now
be met in a cost-effective manner with parallel multiprocessor computer
technology. Data mining algorithms embody techniques that have existed for
at least 10 years, but have only recently been implemented as mature, reliable,
understandable tools that consistently outperform older statistical methods.

In the evolution from business data to business information, each new step
has built upon the previous one. For example, dynamic data access is critical
for drill-through in data navigation applications, and the ability to store large
databases is critical to data mining. From the user’s point of view, the four.

2
3

steps listed in Table 1 were revolutionary because they allowed new business
questions to be answered accurately and quickly.

Evolutionary Enabling Product


Business Question Characteristics
Step Technologies Providers
Data "What was my total Retrospective,
Computers, tapes,
Collection revenue in the last IBM, CDC static data
disks
(1960s) five years?" delivery
Relational
Oracle,
"What were unit databases Retrospective,
Sybase,
Data Access sales in New (RDBMS), dynamic data
Informix,
(1980s) England last Structured Query delivery at
IBM,
March?" Language (SQL), record level
Microsoft
ODBC

Data On-line analytic Pilot,


"What were unit
Warehousing processing Comshare, Retrospective,
sales in New
& (OLAP), Arbor, dynamic data
England last March?
Decision multidimensional Cognos, delivery at
Drill down to
Support databases, data Micro multiple levels
Boston."
(1990s) warehouses strategy

Pilot,
Advanced Lockheed,
"What’s likely to Prospective,
Data Mining algorithms, IBM, SGI,
happen to Boston proactive
(Emerging multiprocessor numerous
unit sales next information
Today) computers, startups
month? Why?" delivery
massive databases (nascent
industry)
Table 1.Steps in the Evolution of Data Mining.

The core components of data mining technology have been under


development for decades, in research areas such as statistics, artificial
intelligence, and machine learning. Today, the maturity of these techniques,
coupled with high-performance relational database engines and broad data
integration efforts, make these technologies practical for current data
warehouse environments.
4

2.20An Architecture For Data Mining

To best apply these advanced techniques, they must be fully integrated


with a data warehouse as well as flexible interactive business analysis tools.
Many data mining tools currently operate outside of the warehouse, requiring
extra steps for extracting, importing, and analyzing the data. Furthermore,
when new insights require operational implementation, integration with the
warehouse simplifies the application of results from data mining. The
resulting analytic data warehouse can be applied to improve business
processes throughout the organization, in areas such as promotional campaign
management, fraud detection, newproduct rollout, and so on. Figure 1
illustrates an architecture for advanced analysis in a large data warehouse.

Figure 1 - Integrated Data Mining Architecture

The ideal starting point is a data warehouse containing a combination of


internal data tracking all customer contact coupled with external market data
about competitor activity. Background information on potential customers
5

also provides an excellent basis for prospecting. This warehouse can be


implemented in a variety of relational database systems: Sybase, Oracle,
Redbrick, and so on, and should be optimized for flexible and fast data
access.

An OLAP (On-Line Analytical Processing) server enables a more


sophisticated end-user business model to be applied when navigating the data
warehouse. The multidimensional structures allow the user to analyze the data
as they want to view their business – summarizing by product line, region,
and other key perspectives of their business. The Data Mining Server must be
integrated with the data warehouse and the OLAP server to embed ROI-
focused business analysis directly into this infrastructure.

An advanced, process-centric metadata template defines the data mining


objectives for specific business issues like campaign management,
prospecting, and promotion optimization. Integration with the data warehouse
enables operational decisions to be directly implemented and tracked. As the
warehouse grows with new decisions and results, the organization can
continually mine the best practices and apply them to future decisions.

This design represents a fundamental shift from conventional decision


support systems. Rather than simply delivering data to the end user through
query and reporting software, the Advanced Analysis Server applies users’
business models directly to the warehouse and returns a proactive analysis of
the most relevant information. These results enhance the metadata in the
OLAP Server by providing a dynamic metadata layer that represents a
distilled view of the data. Reporting, visualization, and other analysis tools
can then be applied to plan future actions and confirm the impact of those
plans.
6

2.21Physical structure of data warehouse :

Data warehouse is a central repository for data. There are three different
basic architectures for constructing a data warehouse. In first type there is
only central location to store data, which we call data warehouse physical
storage media. In this type of construction, data is gathered from
heterogeneous, data sources, like different types of files, local database
system and from other external sources.

As the data is stored in a central place its' access is very easy and simple.
But disadvantage of this construction is the loss of performance.

In second type of construction data is decentralized. As the data cannot be


stored physically together but logically it is consolidated in data warehouse
environment. In this construction department wise data and site wise data is
stored at their local place. Local application and other generated data is stored
in local database but information about data, called metadata (data about data)
is stored in central site. This local database can also maintain their metadata
locally for their local work as well as central site. This local database with
metadata is called "Data Marts".

An advantage of this architecture is that the logical data warehouse is only


virtual. Central data warehouse is not storing any actual data but information
of data so any user who wants to access data can make query to central site
and this central site prepare resultant data for user. This entire process to
collect data from physical database is transparent to user.

Third and last type of construction creates a hierarchical view of data.


Here the central data warehouse is also storing actual data and data marts on
next level store copy or summary of physical central data warehouse. Local
data marts store the data, which is related to related to their local site only.

The advantage of distributed and hierarchical construction are (1)


Retrieval time of data from data warehouse is less and (2) volume of data is
7

also reduced. Although data is integrated through metadata so anyone from


anywhere can access data and processing is divided in different physical
machines. For better response of data retrieval, scalable data warehouse
architecture is very important. Data warehouse response is also depending on
metadata so design of metadata is also very important for every data
warehouse.

2.22Issues in Integration of data in data warehouse


As discussed above, you can physically design your data warehouse as using
any of three construction type. But to integrate data in a data warehouse
require some procedure like data extraction and data migration, data cleansing
/ data scrubbing and data integration.

2.23Data extraction and data migration


To extract data from operational databases, files and other external
sources, extraction tools are required. This process should be detailed and
documented correctly. If this process is not properly documented then it will
create problems while integration with other data and also create difficulties
at later stage. So data extraction should provide high level of integration and
make efficient metadata for data warehouse.
Data migration is a task to convert the data from one system to another. It
should provide type checking of integrity constraints in data warehouse. It
should also find out inconsistency and missing values while converting
metadata for entire process so one can easily identified problem in migration
process.

2.24Data cleansing / Data scrubbing

Data warehouse collect data from heterogeneous sources in organization.


These data are integrated in such a manner so any end-user can access that
8

data very easily. For facilitate end-user, DWA (Data Warehouse


Administration) must be aware about right approach of warehouse. To collect
data from different operating system, from different network, different
application files like C, COBOL, FORTARN and different operational
databases. So our first step is to design a platform on which we can access
data from every system and put them together in a warehouse. Before
transferring data from one system to another, data must be standardized. This
standard is always related to format of data, structure of data and information
collection.

2.25Characteristics of Data warehousing


 multidimensional conceptual view
 generic dimensionality
 unlimited dimensions and aggregation levels
 unrestricted cross- dimensional operations
 dynamic sparser matrix handling
 client-server architecture
 multi user support
 accessibility
 transparency
 intuitive data manipulation
 consistent reporting performance
 flexible reporting
Because they encompass large volume of data, data warehousing are
generally an order of magnitude larger than the source databases. The sheer
volume of data is an issue that has been deal with through enterprise, virtual
data warehouse, and data marts:

 Enterprise-wide data warehouses


are huge projects requiring massive investment of time and resources.
9

 Virtual data warehouse


provide views of optional databases that are materialized for efficient
access.
 Data marts
are targeted to a subset of the organization, such as department, and are
more tightly focused.

2.26Types of Data Mining


The term “knowledge” is very broadly interpreted as involving some
degree of intelligence. Knowledge is often classified as inductive and
deductive. Knowledge can be represented in many forms: in unstructured
sense, it can be represented by rules, or prepositional logic. In a structured
form, it may be represented in decision trees, semantic, neural networks or
hierarchical classes or frames. The knowledge discover during data mining
can be described in five ways as follows.

1. Association rules-These rules correlate the presence of a set of items with


another range of values for another set of variables. Examples:
(1) when a female retail shopper buys a handbag, she is likely to buy
shoes.
(2) An X-ray image containing characteristics a and b is likely to also
exhibit characteristic c.

2. Classification hierarchies-The goal is to work from an existing set of


events or transactions to create a hierarchy of classes. Examples:
(1) A population may be divided into five ranges of credit worthiness
based on a history of previous credit transactions.
(2) A model may be developed for the factors that determine these
desirability of location of a store on a 1-10 scale.
10

(3) Mutual funds may be classified based on performance data using


characteristics such as growth, income, and stability.
3. Sequential patterns- A sequence of actions or events is sought. Example: If
a patient underwent cardiac bypass surgery for blocks arties and an
aneurysm and later developed high blood urea within year of surgery, he is
likely to suffer from kidney failure within next 18 months. Detection of
sequential pattern is equivalent to detecting association among events with
certain relationship.
4. Patterns within time series-Similarities can be detected within positions of
time series. Three examples follow with the stock market price data as a
time series:
(1) stocks of a utility company ABC Power and a financial company XYZ
securities show the same pattern during 1998 in terms of closing stock
price.
(2) Two products show the same selling pattern in summer but a different
one in winter.
(3) A pattern in solar magnetic wind may be used to predict changes in
earth atmospheric conditions.
5. Categorization and segmentation-A given population of events or items
can be partitioned (segmented) into sets of “similar” elements. Examples:
(1) An entire population of treatment data on a disease may be divided
into groups based on similarities of side effects produced.
(2) The adult population may be categorized into five groups from “most
likely to buy” to “list likely to buy” a new product.
(3) The web excise a collection of users against a set of document may be
analyzed in terms of the keywords of documents to reviles clusters
categorized of users.

2.27How Data Mining Works


How exactly is data mining able to tell you important things that you didn't
know or what is going to happen next? The technique that is used to perform
11

these feats in data mining is called modeling. Modeling is simply the act of
building a model in one situation where you know the answer and then
applying it to another situation that you don't. For instance, if you were
looking for a sunken Spanish galleon on the high seas the first thing you
might do is to research the times when Spanish treasure had been found by
others in the past. You might note that these ships often tend to be found off
the coast of Bermuda and that there are certain characteristics to the ocean
currents, and certain routes that have likely been taken by the ship’s captains
in that era. You note these similarities and build a model that includes the
characteristics that are common to the locations of these sunken treasures.
With these models in hand you sail off looking for treasure where your model
indicates it most likely might be given a similar situation in the past.
Hopefully, if you've got a good model, you find your treasure.
This act of model building is thus something that people have been doing
for a long time, certainly before the advent of computers or data mining
technology. What happens on computers, however, is not much different than
the way people build models. Computers are loaded up with lots of
information about a variety of situations where an answer is known and then
the data mining software on the computer must run through that data and
distill the characteristics of the data that should go into the model. Once the
model is built it can then be used in similar situations where you don't know
the answer. For example, say that you are the director of marketing for a
telecommunications company and you'd like to acquire some new long
distance phone customers. You could just randomly go out and mail coupons
to the general population - just as you could randomly sail the seas looking
for sunken treasure. In neither case would you achieve the results you desired
and of course you have the opportunity to do much better than random - you
could use your business experience stored in your database to build a model.

As the marketing director you have access to a lot of information about all
of your customers: their age, sex, credit history and long distance calling
12

usage. The good news is that you also have a lot of information about your
prospective customers: their age, sex, credit history etc. Your problem is that
you don't know the long distance calling usage of these prospects (since they
are most likely now customers of your competition). You'd like to concentrate
on those prospects who have large amounts of long distance usage. You can
accomplish this by building a model. Table 2 illustrates the data used for
building a model for new customer prospecting in a data warehouse.

Customers Prospects

General information (e.g. demographic data) Known Known

Proprietary information (e.g. customer


Known Target
transactions)

Table 2 - Data Mining for Prospecting

The goal in prospecting is to make some calculated guesses about the


information in the lower right hand quadrant based on the model that we
build going from Customer General Information to Customer Proprietary
Information. For instance, a simple model for a telecommunications company
might be: 98% of my customers who make more than $60,000/year spend
more than $80/month on long distance.

This model could then be applied to the prospect data to try to tell
something about the proprietary information that this telecommunications
company does not currently have access to. With this model in hand new
customers can be selectively targeted.

Test marketing is an excellent source of data for this kind of modeling.


Mining the results of a test market representing a broad but relatively small
sample of prospects can provide a foundation for identifying good prospects
in the overall market. Table 3 shows another common scenario for building
models: predict what is going to happen in the future.
13

Yesterday Today Tomorrow

Static information and current plans


(e.g. demographic data, marketing Known Known Known
plans)

Dynamic information (e.g. customer


Known Known Target
transactions)

Table 3 - Data Mining for Predictions

If someone told you that he had a model that could predict customer usage
how would you know if he really had a good model? The first thing you
might try would be to ask him to apply his model to your customer base -
where you already knew the answer. With data mining, the best way to
accomplish this is by setting aside some of your data in a vault to isolate it
from the mining process. Once the mining is complete, the results can be
tested against the data held in the vault to confirm the model’s validity. If the
model works, its observations should hold for the vaulted data.

2.28 Goals of Data Mining


 Prediction:- Data mining can show how certain attributes within the data
will behave in the future. Examples of predictive data mining include the
analysis of buying transactions to predict what consumer will buy under
certain discount, how much sales volume store would generate in given
period whether deleting product line would yield more profits, business
logic is used coupled with data mining. In scientific context, certain
scientific wave patterns may predict an earthquake with high probability.
 Identification:-Data patterns can be used to identify the existence of an
item, an event, or an activity. For example, intruders trying to break a
system may be identified by the programs executed, files accessed, and
CPU time per session. In biological applications, existence of a gene may
14

be identified by certain sequences of nucleotide symbols in the DNA


sequence. The area known as authentication is a from of identification. It
ascertains whether a user is indeed a specific user or one from an
authorized class; it involves a comparison of parameters or images or
signals against a database.
 Classification:-Data Mining can partition data so that different classes or
categories can be identified based on combinations of parameters. For
example, customers in super market can be categorized in discount
seeking shoppers , shoppers in a rush, loyal regular shoppers , and
infrequent shoppers, this classification Is used in analysis of customer
buying transactions as post mining activity. Classification based on
common domain knowledge is used as input to decompose mining
problem and make it simpler. For instance, health foods, party foods,
school lunch foods are distinct categories in business super market. It
makes sense to analyze relationship within and across categories as
separate problems. Search categorization used to encode data appropriately
before subjecting it to further data mining.
 Optimization:- One eventual goal of data mining may be to optimize use
of limited resources such as time, space, money or materials and to
maximize output variables such as sales or profits under given set of
constraints. This goal of data mining resembles objective function used in
operations research problems that deals with optimization under
constraint.

2.29 Integrating Data Mining and Campaign Management


The closer Data Mining and Campaign Management work together, the
better the business results. Today, Campaign Management software uses
the scores generated by the Data Mining model to sharpen the focus of
targeted customers or prospects, thereby increasing response rates and
campaign effectiveness.
15

Unfortunately, the use of a model within Campaign Management today is


often a manual, time-intensive process. When someone in marketing wants to
run a campaign that uses model scores, he or she usually calls someone in the
modeling group to get a file containing the database scores. With the file in
hand, the marketer must then solicit the help of someone in the information
technology group to merge the scores with the marketing database.

This disjointed process is fraught with problems:

 The large numbers of campaigns that run on a daily or weekly basis can
be difficult to schedule and can swamp the available resources.
 The process is error prone; it is easy to score the wrong database or the
wrong fields in a database.
 Scoring is typically very inefficient. Entire databases are usually scored,
not just the segments defined for the campaign. Not only is effort wasted,
but the manual process may also be too slow to keep up with campaigns
run weekly or daily.

The solution to these problems is the tight integration of Data Mining and
Campaign Management technologies. Integration is crucial in two areas:

First, the Campaign Management software must share the definition of the
defined campaign segment with the Data Mining application to avoid
modeling the entire database. For example, a marketer may define a campaign
segment of high-income males between the ages of 25 and 35 living in the
northeast. Through the integration of the two applications, the Data Mining
application can automatically restrict its analysis to database records
containing just those characteristics.

Second, selected scores from the resulting predictive model must flow
seamlessly into the campaign segment in order to form targets with the
highest profit potential.

2.30 The integrated Data Mining and Campaign Management process


This section examines how to apply the integration of Data Mining and
Campaign Management to benefit the organization. The first step creates a
model using a Data Mining tool. The second step takes this model and puts it
to use in the production environment of an automated database marketing
campaign.
16

 Step 1: Creating the model


An analyst or user with a background in modeling creates a predictive
model using the Data Mining application. This modeling is usually
completely separate from campaign creation. The complexity of the model
creation typically depends on many factors, including database size, the
number of variables known about each customer, the kind of Data Mining
algorithms used and the modeler’s experience.
Interaction with the Campaign Management software begins when a
model of sufficient quality has been found. At this point, the Data Mining
user exports his or her model to a Campaign Management application,
which can be as simple as dragging and dropping the data from one
application to the other.
This process of exporting a model tells the Campaign Management
software that the model exists and is available for later use.

 Step 2: Dynamically scoring the data

Dynamic scoring allows you to score an already-defined customer


segment within your Campaign Management tool rather than in the Data
Mining tool. Dynamic scoring both avoids mundane, repetitive manual
chores and eliminates the need to score an entire database. Instead,
dynamic scoring marks only relevant customer subsets and only when
needed.

Scoring only the relevant customer subset and eliminating the manual
process shrinks cycle times. Scoring data only when needed assures
"fresh," up-to-date results.

Once a model is in the Campaign Management system, a user (usually


someone other than the person who created the model) can start to build
marketing campaigns using the predictive models. Models are invoked by
the Campaign Management System.
17

When a marketing campaign invokes a specific predictive model to


perform dynamic scoring, the output is usually stored as a temporary score
table. When the score table is available in the data warehouse, the Data
Mining engine notifies the Campaign Management system and the
marketing campaign execution continues.

2.31 Data Mining and Campaign Management in the real world

Ideally, marketers who build campaigns should be able to apply any model
logged in the Campaign Management system to a defined target segment. For
example, a marketing manager at a cellular telephone company might be
interested in high-value customers likely to switch to another carrier. This
segment might be defined as customers who are nine months into a twelve-
month contract, and whose average monthly balance is more than $150.

The easiest approach to retain these customers is to offer all of them a new
high-tech telephone. However, this is expensive and wasteful since many
customers would remain loyal without any incentive.

2.32 The Benefits of integrating Data Mining and Campaign Management

Formarketers:

 Improved campaign results through the use of model scores that further
refine customer and prospect segments.

Records can be scored when campaigns are ready to run, allowing the use
of the most recent data. "Fresh" data and the selection of "high" scores
within defined market segments improve direct marketing results.

 Accelerated marketing cycle times that reduce costs and increase the
likelihood of reaching customers and prospects before competitors.

Scoring takes place only for records defined by the customer segment,
eliminating the need to score an entire database. This is important to keep
18

pace with continuously running marketing campaigns with tight cycle


times.

Accelerated marketing "velocity" also increases the number of


opportunities used to refine and improve campaigns. The end of each
campaign cycle presents another chance to assess results and improve
future campaigns.

 Increased accuracy through the elimination of manually induced errors.


The Campaign Management software determines which records to score and
when.

For statisticians:

 Less time spent on mundane tasks of extracting and importing files,


leaving more time for creative – building and interpreting models.
Statisticians have greater impact on corporate bottom line.
As a database marketer, you understand that some customers present much
greater profit potential than others. But, how will you find those high-
potential customers in a database that contains hundreds of data items for
each of millions of customers?
Data Mining software can help find the "high-profit" gems buried in
mountains of information. However, merely identifying your best prospects is
not enough to improve Instead, to reduce costs and improve results, the
marketer could use a predictive model to select only those valuable customers
who would likely defect to a competitor unless they receive the offer.

2.33 The ten Steps of Data Mining


Here is process for extracting hidden knowledge from your data warehouse,
your customer information file, or any other company database.

1. Identify The Objective


Before you begin, be clear on what you hope to accomplish with your
analysis. Know in advance the business goal of the data mining. Establish
whether or not the goal is measurable. Some possible goals are to
- Find sales relationships between specific products or services
- Identify specific parching patterns over time
- Identify potential types of customers
- Find product sales trends.
2. Select The Data
19

Once you have defined your goal, your next step is to select the data to
meet this goal. This may be a subset of your data warehouse or a data mart
that contains specific product information. It may be your customer
information file. Segment as much as possible the scope of the data to be
mined.
Here are some key issues.

- Are the data adequate to describe the phenomena the data mining
analysis is attempting to model?
- Can you enhance internal customer records with external lifestyle and
demographic data?
- Are the data stable-will the mined attributes be the same after the
analysis?
- If you are merging databases and you find a common field for linking
them?
- How current and relevant are the data to the business goal?
3. Prepare The Data
Once you’ve assembled the data, you must decide which attributes to
convert into usable formats. Consider the input of domain experts-creators
and users of the data.
- Establish strategies for handling missing data, extraneous noise, and
outliers
- Identify redundant variables in the dataset and decide which fields to
exclude

- Decide on a log or square transformation, if necessary


- Visually inspect the dataset to get a feel for the database
- Determine the distribution frequencies of the data
You can postpone some of these decisions until you select a data-mining
tool. For example, if you need a neural network or polynomial network
you may have to transform some of your fields.

4. Audit The Data


Evaluate the structure of your data in data in order to determine the
appropriate tools.
- What is the radio of categorical/binary attributes in the database?
- What is the nature and structure of the database?
- What is the overall condition of the dataset?
- What is the distribution of the dataset?
20

Balance the objective assessment of the structure of your data against


your user’ need to understand the findings. Neural nets, for example, don’t
explain their results

5. Select The Tools


Two concerns drive he selection of the appropriate data mining tool-
your business objectives and your data structure. Both should guide you
to the same tool. Consider these questions when evaluating a set of
potential tools.
- Is the data set heavily categorical?
- What platforms do your candidate tools support?
- Are the candidate tools ODBC-compliant?
- What data format can the tools import?
No single tool is likely to provide the answer to your data mining
project,. Some tools integrate several technologies into a suite of statistical
analysis programs, a neural network, and a symbolic classifier.

6. Format The Solution


In conjunction with your data audit, your business objective and the
selection of your tool determine the format of your solution. The Key
questions are
- What is the optimum format of the solution- decision tree, rules, C
code, and SQL syntax?
- What are the available format options?
- What is thee goal of the solution?
- What do the end-users need-graphs, reports, code?
7. Construct The Model
At this point that the data mining processing begins. Usually the first
step is to use the random number seed to split the data into a training set
and a test set and construct and evaluate a model. The generation of the
21

classification rules, decision trees, clustering sub-groups, score, code,


weights and evaluation data/error rates takes place at this stage. Resolve
these issues:

- Are error rates at acceptable level? Can you improve them?


- What extraneous attributes did you find? Can you purge them?
- Is additional data or a different methodology necessary?
- Will you have to train and test a new data set?

8. Validate The Findings


Share and discuss the results of the analysis with the business client or
domain expert. Ensure that the findings are correct and appropriate to the
business objectives.
- Do the findings make sense?
- Do you have to return any prior steps to improve results?
- Can use other data mining tools to replicate the findings?

9. Deliver The Findings


Provide a final report to the business unit or client. The report should
source code, and rules, some of the issues are:
- Will additional data improve the analysis?
- What strategic insight did you discover and how is it applicable?
- What proposals can result from the data mining analysis?
- Do you findings meet the business objective?
10. Integrate The Solution
Share the findings with all interested end-users in the appropriate business
units. You might wind up incorporating the results of the analysis into the
company’s business procedures. Some of the data mining solutions may
involve
22

- SQL syntax for distribution to end-users


- C code incorporated into a production system
- Rules integrated into a decision support system.
Although data mining tools automate database analysis, they can lead to
faulty findings and erroneous conclusions if you’re not careful. Bear in mind
that data mining is a business process with a specific goal- to extract a
competitive insight from historical records in a database.

2.34 Evaluating the Benefits of a Data Mining Model


Other representations of the model often incorporate expected costs and
expected revenues to provide the most important measure of model quality:
profitability. A profitability graph like the one shown below can help
determine how many prospects to include in a campaign. In this example, it is
easy to see that contacting all customers will result in a net loss. However,
selecting a threshold score of approximately 0.8 will maximize profitability.

For a closer look at how the use of model scores can improve
profitability, consider an example campaign with the following
assumptions:

* Database size: 2,000,000


* Maximum possible response: 40,000
* Cost to reach one customer: $1.00
* Profit margin from a positive response: $40.00
As the table below shows, a random sampling of the full
customer/prospect database produces a loss regardless of the campaign target
size. However, by targeting customer using a Data Mining model, the
marketer can select a smaller target that includes a higher percentage of good
prospects. This more focused approach generates a profit until the target
becomes too large and includes too many poor prospects.
23

Random Selection Targeted Selection


Campaign
Size Cost Response Revenue Net Response Revenue Net

100,000 $100,000 2,000 $80,000 ($20,000) 4,000 $160,000 $60,000

400,000 $400,000 8,000 $320,000 ($80,000) 30,000 $1,200,000 $800,000

1,000,000 $1,000,000 20,000 $800,000 ($200,000) 35,000 $1,400,000 $400,000

2,000,000 $2,000,000 40,000 $1,600,000 ($400,000) 40,000 $1,600,000 ($400,000)

Table 4 - prospect database produces

2.35 The data mining suite

The Data Mining SuiteTM is truly unique, providing the most powerful, complete and
comprehensive solution for enterprise-wide, large scale decision support. It leads the
world of discovery with the exceptional ability to directly mine large multi-table
SQL databases.

The Data Mining Suite works directly on large SQL repositories with no need for
sampling or extract files. It accesses large volumes of multi-table relational data on
the server, incrementally discovers powerful patterns and delivers automatically
generated English text and graphs as explainable documents on the intranet.

The Data Mining Suite is based on a solid foundation with a total vision for decision
support. The three-tiered, server-based implementation provides highly scalable
discovery on huge SQL databases with well over 90% of the computations
performed directly on the server, in parallel if desired.
24

Figure 2 – The Data Mining Suite

The Data Mining Suite relies on the genuinely unique mathematicalfoundation we


pioneered to usher in a new level of functionality for decision support. This
mathematical foundation has given rise to novel algorithms that work directly on
very large datasets, delivering unprecedented power and functionality. The power of
these algorithms allows us to discover rich patterns of knowledge in huge databases
that could have never been found before.

With server-based discovery, the Data Mining Suite performs over 90% of the
analyses on the server, with SQL, C programs and Java. Discovery takes place
simultaneously along multiple dimensions on the server, and is not limited by the
power of the client. The system analyzes both relational and multi-dimensional data,
discovering highly refined patterns that reveal the real nature of the dataset. Using
built-in advanced mathematical techniques, these findings are carefully merged by
the system and the results are delivered to the user in plain English, accompanied by
tables and graphs that highlight the key patterns.

The Data Mining Suite pioneered multi-dimensional data mining. Before this,
OLAP had usually been a multi-dimensional manual endeavor, while data mining
25

had been a single dimensional automated activity. The Rule-based Influence


Discovery SystemTM bridged the gap between OLAP and data mining. This dramatic
new approach forever changed the way corporations use decision support. No longer
are OLAP and data mining viewed as separate activities, but are fused to deliver
maximum benefit. The patterns discovered by the system include multi-dimensional
influences and contributions, OLAP affinities and associations, comparisons, trends
and variations. The richness of these patterns delivers unparalleled business benefits
to users, allowing them to make better decisions than ever before.

The Data Mining Suite also pioneered the use of incremental pattern-base
population. With incremental data mining, the system automatically discovers
changes in patterns as well as the patterns of change. For instance, each month sales
data is mined and the changes in the sales trends as well as the trends of change in
how products sell together are added to the pattern-base. Over time, this knowledge
becomes a key strategic asset to the corporation.

The Data Mining Suite currently consists of these modules:

 Rule-based Influence Discovery


 Dimensional Affinity Discovery
 Trend Discovery Module
 Incremental Pattern Discovery
 Forensic Discovery
 The Predictive Modeler

These truly unique products are all designed to work together, d in concert with the
Knowledge Access SuiteTM.

 Rule-based Influence Discovery

The Rule-based Influence Discovery System is aware of both influences and


contributions along multiple dimensions and merges them in an intelligent
manner to produce very rich and powerful patterns that can not be obtained by
26

either OLAP or data mining alone. The system performs multi-table, dimensional
data mining at the server level, providing the best possible results. The Rule-
based Influence Discovery System is not a multi-dimensional repository, but a
data mining system. It accesses granular data in a large database via standard
SQL and reaches for multi-dimensional data via a ROLAP approach of the user's
choosing.

 Dimensional Affinity Discovery


The Affinity Discovery System automatically analyzes large datasets and finds
association patterns that describe how various items "group together" or "happen
together". Flat affinity just tells us how items group together, without providing
logical conditions for the association. Dimensional (OLAP) affinity is more
powerful and describes the dimensional conditions under which stronger item
groupings take place. The Affinity Discovery System includes a number of useful
features that make it a unique industrial strength product. These features include
hierarchy and cluster definitions, exclusion lists, unknown-value management,
among others.
 The OLAP Discovery System
The OLAP Discovery System is aware of both influences and contributions along
multiple dimensions and merges them in an intelligent manner to produce very
rich and powerful patterns that can not be obtained by either OLAP or data
mining alone. The system merges OLAP and data mining at the server level,
providing the best possible results. The OLAP Discovery System is not an OLAP
engine or a multi-dimensional repository, but a data mining system. It accesses
granular data in a large database via standard SQL and reaches for multi-
dimensional data via an OLAP/ROLAP engine of the user's choosing.
 Incremental Pattern Discovery
Incremental Pattern Discovery deals with temporal data segments that gradually
become available over time, e.g. once a week, once a month, etc. Data is
periodically supplied to the Incremental Discovery System in terms of a "data
snap-shot" which corresponds to a given time-segment, e.g. monthly sales
27

figures. Patterns in the data snap-shot are found on a monthly basis and are added
to the pattern-base. As new data becomes available (say once a month) the
system automatically finds new patterns, merges them with the previous patterns,
stores them in the pattern-base and notes the differences from the previous time-
periods.
 Trend Discovery
Trend Discovery with the Data Mining Suite uncovers time-related patterns that
deal with change and variation of quantities and measures. The system expresses
trends in terms of time-grains, time-windows, slopes and shapes. The time-grain
defines the smallest grain of time to be considered, e.g. a day, a week or a month.
Time-windows define how time grains are grouped together, e.g. we may look at
daily trends with weekly windows, or we may look at weekly grains with
monthly windows. Slopes define how quickly a measure is increasing or
decreasing, while shapes give us various categories of trend behavior, e.g.
smoothly increasing vs. erratically changing.
 Forensic Discovery
Forensic Discovery with the Data Mining Suite relies on automatic anomaly
detection. The system first identifies what is usual and establishes a set of norms
through pattern discovery. The transactions or activities that deviate from the
norm are then identified as unusual. Business users can discover where unusual
activities may be originating and the proper steps can be taken to remedy and
control the problems. The automatic discovery of anomalies is essential in that
the ingenious tactics used to spread activities within multiple transactions can
usually not be guessed beforehand
 Predictive Modeler
The Data Mining SuitePredictive Modeler makes predictions and forecasts by
using the rules and patterns which the data mining process generates. While
induction performs pattern discovery to generate rules, the Predictive Modeler
performs pattern matching to make predictions based on the application of these
rules. The predictive models produced by the system have higher accuracy
28

because the discovery process works on the entire dataset and need not rely on
sampling.

The output from the seven component products of the Data Mining Suite is stored
within the pattern-base and is accessible with PQL: The Pattern Query
Language. Readable English text and graphs are automatically generated in
ASCII and HTML formats for the delivery on the inter/intranet.

2.36 The Data Mining Suite is Unique

A. The Reasons for the Multi-faceted Power

The products in the Data Mining SuiteTM deliver the most advanced and
scalable technologies within a user friendly environment. The specific
reasons draw on the solid mathematical foundation, which Information
Discovery, Inc. pioneered and a highly scalable implementation. Click
here to see what makes The Knowledge Access Suite. So unique.

The Data Mining Suite is distinguished by the following unique


capabilities:

 Direct Access to Very Large SQL Databases


The Data Mining Suite works directly on very large SQL databases and
does not require samples, extracts and/or flat files. This alleviates the
problems associated with flat files which lose the SQL engine's power
(e.g. parallel execution) and which provide marginal results. Another
advantage of working on an SQL database is that the Data Mining Suite
has the ability to deal with both numeric and non-numeric data
uniformly. The Data Mining Suite does not fix the ranges in numerical
data beforehand, but finds ranges in the data dynamically by itself.
29

 Multi-Table Discovery
The Data Mining Suite discovers patterns in multi-table SQL databases
without having to join and build an extract file. This is a key issue in
mining large databases. The world is full of multi-table databases which
can not be joined and meshed into a single view. In fact, the theory of
normalization came about because data needs to be in more than one
table. Using single tables is an affront to all the work of E.F. Codd on
database design. If you challenge the DBA in a really large database to
put things in a single table you will either get a laugh or a blank stare --
in many cases the database size will balloon beyond control. In fact,
there are many cases where no single view can correctly represent the
semantics of influence because the ratios will always be off regardless
of how you join. The Data Mining Suite leads the world of discovery
with the unique ability to mine large multi-table databases.
 No Sampling or Extracts
Sampling theory was invented because one could not have access to the
underlying population being analyzed. But a warehouse is there to
provide such access.
 General and Powerful Patterns
The format of the patterns discovered by the Data Mining Suite is very
general and goes far beyond decision trees or simple affinities. The
advantage to this is that the general rules discovered are far more
powerful than decision trees. Decision trees are very limited in that they
cannot find all the information in the database. Being rule-based keeps
the Data Mining Suite from being constrained to one part of a search
space and makes sure that many more clusters and patterns are found --
allowing the Data Mining Suite to provide more information and better
predictions.
 Language of Expression
The Data Mining Suite has a powerful language of expression, going
several times beyond what most other systems can handle. For instance,
30

for logical statements it can express statements such as "IF Destination


State = Departure State THEN..." or "IF State is not Arizona THEN ...".
Surprisingly most other data mining systems can not express these
simple patterns. And the Data Mining Suite pioneered dimensional
affinities such as IF Day = Saturday WHEN PaintBrush is purchased
ALSO Paint is purchased". Again most other systems cannot handle this
obvious logic.
 Uniform Treatment of Numeric and Non-numeric Data
The Data Mining Suite is unique in its ability to deal with various data
types in a uniform manner. It can smoothly deal with a large number of
non-numeric values and also automatically discovers ranges within
numeric data. Moreover, the Data Mining Suite does not fix the ranges
in numerical data but discovers interesting ranges by itself. For
example, given the field Age, the Data Mining Suite does not expect
this to be broken into 3 segments of (1-30), (31-60), (61 and above).
Instead it may find two ranges such as (27-34) and (48-61) as important
in the data set and will use these in addition to the other ranges.
 Use of Data Dependencies
Should a data mining system be aware of the functional (and other
dependencies) that exist in a database? "Yes" and very much so. The
use of these dependencies can significantly enhance the power of a
discovery system -- in fact ignoring them can lead to confusion. The
Data Mining Suite takes advantage of data dependencies.
 Server-based Architectures
The Data Mining Suite has a three level client server architecture
whereby the user interface runs on a thin intranet client and the back-
end process for analysis is done on a Unix server. The majority of the
processing time is spent on the server and these computations run both
by using parallel SQL and non-SQL calls managed by the Data Mining
Suite itself. Only about 50% of the computations on the server are SQL-
based and the other statistical computations are already managed by the
31

Data Mining Suite program itself, at times by starting separate


processes on different nodes of the server.
 System Initiative
The Data Mining Suite uses system initiative in the data mining
process. It forms hypothesis automatically based on the character of the
data and converts the hypothesis into SQL statements forwarded to the
RDBMS for execution. The Data Mining Suite then selects the
significant patterns filter the unimportant trends.
 Transparent Discovery and Predictions
The Data Mining Suite provides explanations as to how the patterns are
being derived. This is unlike neural nets and other opaque techniques in
which the mining process is a mystery. Also, when performing
predictions, the results are transparent. Many business users insist on
understandable and transparent results.
 Not Noise Sensitive
The Data Mining Suite is not sensitive to noise because internally it
uses fuzzy logic analysis. As the data gathers noise, the Data Mining
Suite will only reduce the level on confidence associated with the
results provided. However, it will still produce the most significant
findings from the data set.
 Analysis of Large Databases
The Data Mining Suite has been specifically tuned to work on databases
with an extremely large number of rows. It can deal with data sets of 50
to 100 million records on parallel machines. It derives its capabilities
from the fact that it does not need to write extracts and uses SQL
statements to perform its process. Generally the analyses performed in
the Data Mining Suite are performed on about 50 to 120 variables and
30 to 100 million records directly. It is, however, easier to increase the
number of records based on the specific optimization options with the
Data Mining Suite to deal with very large databases.
32

These unique features and benefits make the Data Mining Suite the ideal
solution for large-scale Data Mining in business and industry.

B. What Data Mining can’t do?


Data mining is a tool, not a magic wand. It won’t sit in your
database watching what happens and send you e-mails to get your
attention when it sees an interesting pattern. It doesn’t eliminate the need
to know your business, to understand your data, or to understand analytical
methods. Data mining assists business analysts with finding patterns and
relationships in the data-it does not tell you the value of the patterns to
organization. Furthermore, the patterns uncovered by data mining must be
verified in the real world.
Remember that the predictive relationships found via data mining
are not necessarily causes of an action or behavior. For example, data
mining might determine that males with incomes between $50,000 and
$65,000 who subscribe to certain magazines are likely purchasers of a
product you want to sell. While you can take advantages of this pattern,
say by aiming your marketing at people who fit the pattern, you should not
assume that any of these factors cause them to buy your product.
To ensure meaningful results, it’s vital that you understand your
data. The quality of your output will often be sensitive to outliers (data
values that are very different from the typical values in your database),
irrelevant columns or columns that vary together (such as age and date of
birth), the way you encode your data, and the data you leave in and the
data you exclude. Algorithms vary in their sensitivity to such data issues,
but it is unwise to depend on a data-mining product to make all the right
decisions on its own.
Data mining will not automatically discover solutions without
guidance. Rather than setting the vague goal, “Help improve the response
to my direct mail solicitation”, you might use data mining to find the
characteristics of people who (1) respond to your solicitation, or (2)
33

respond AND make large purchase. The patterns data mining finds for
those two goals may be very different.
Although a good data mining tool shelters you from the intricacies
of statistical techniques, it requires you to understand the working of the
tools you choose and the algorithms on which they are based. The choices
you make in setting up your data mining tool and the optimizations you
choose will affect the accuracy and speed of your models.
CHAPTER III
CLOSING

3.1 CONCLUSION

The data warehouse systems enable us to store large volume of data from a
variety of interrelated databases and process them together. A data warehouse
thus solves the complex OLAP queries made by the analyst and give the
required information. Hence, data warehousing system provides the right way
to access large amount of databases at a fraction of time. Data Mining is the
extraction of hidden predictive information from large databases. This is a
new powerful new technology with great

potential to help companies focus on the most important information in data


warehousing. Data mining tools predict future trends and behaviors, allowing
businesses to make proactive, knowledge-driven decisions.

34
REFERENCES

Ani. (2010, January 13). Data Mining. In SCRIBD. Retrieved from


https://www.scribd.com/doc/25152759/Data-Mining
Dalal, Hiren. (2009, August 23). Data Mining. In SCRIBD. Retrieved from
https://www.scribd.com/document/19018763/Data-Mining
Smith, Bridget. (2009, November 16). Data Mining and Data Warehousing. In
SCRIBD. Retrieved from https://www.scribd.com/doc/22594630/DATA-
MINING-AND-DATA-WAREHOUSING

You might also like