You are on page 1of 30

DATA WAREHOUSING AND DATA MINING

Overview
Introduction Data Warehousing Data Warehousing V/S OLAP Data Mining

Motivation: Necessity is the Mother of Invention

Data explosion problem

Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing

Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

What is a Data Warehouse?


A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

Warehouses are Very Large Databases


35%

30%
25% Respondents 20% 15% 10% Initial 5% 0%

Projected 2Q96
Source: META Group, Inc.

5GB

10-19GB
5-9GB

50-99GB

250-499GB
500GB-1TB
5

20-49GB

100-249GB

Very Large Data Bases


Terabytes -- 1012 bytes: Petabytes -- 1015 bytes: Exabytes -- 1018 bytes: Walmart -- 24 Terabytes Geographic Information Systems National Medical Records

Zettabytes -- 1021 bytes: Weather images

Zottabytes -- 1024 bytes: Intelligence Agency Videos


6

Data Warehousing -It is a process


Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible A decision support database maintained separately from the organizations operational database 7

Characteristics of Data Warehouse


A data warehouse is a
subject-oriented
integrated time-varying

non-volatile

collection of data that is used primarily in organizational decision making.

Data Warehouse Architecture


Relational Databases Optimized Loader ERP Systems

Extraction Cleansing Data Warehouse Engine Analyze Query

Metadata Repository
9

Application Areas
Industry Finance Insurance Telecommunication Transport Consumer goods Data Service providers Utilities Application Credit Card Analysis Claims, Fraud Analysis Call record analysis Logistics management promotion analysis Value added data Power usage analysis
10

What makes data mining possible?


Advances in the following areas are making data mining deployable:
data warehousing better and more data (i.e., operational, behavioral, and demographic) the emergence of easily deployed data mining tools and the advent of new data mining techniques.
11

Benefits of Data warehouse


A data warehouse provides a common data model for all data of interest regardless of the data's source. Prior to loading data into the data warehouse, inconsistencies are identified and resolved. Because they are separate from operational systems, data warehouses provide retrieval of data without slowing down operational systems. Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems.
12

DISADVANTAGES OF DATA WAREHOUSES


Data warehouses are not the optimal environment for unstructured data. There is an element of latency in data warehouse data. Data warehouses can have high costs. Maintenance costs are high. Data warehouses can get outdated relatively quickly.

13

So, whats different b/w OLTP & DW?

OLTP vs Data Warehouse


OLTP
Application Oriented Used to run business Detailed data Current up to date Isolated Data

Warehouse
Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data

15

OLTP V/S Data Warehouse


OLTP
Performance Sensitive Few Records accessed at a time (tens) Read/Update Access

Data Warehouse
Performance relaxed Large volumes accessed at a time(millions) Mostly Read (Batch Update) Redundancy present Database Size 100 GB - few terabytes

No data redundancy Database Size 100MB -100 GB

16

To summarize DW & OLTP...


OLTP Systems are used to run a business

The Data Warehouse helps to optimize the business


17

What Is Data Mining?


The objective of data mining is to extract valuable information from your data, to discover the hidden gold. that you do not need a data warehouse to successfully use data miningall you need is data. On-Line Analytical Processing (OLAP)- DM tool.

18

DATA MINING MODELS


Acc. To IBM Verification Model

. Discovery Model

19

Steps for Data Mining


Identify
Find sales relationships between specific products or services Identify specific purchasing patterns over time Identify potential types of customers Find product sales trends.

20

Select
Are the data adequate to describe the phenomena the data mining analysis is attempting to model? Can you enhance internal customer records with external lifestyle and demographic data? Are the data stablewill the mined attributes be the same after the analysis? If you are merging databases can you find a common field for linking them? How current and relevant are the data to the business goal?

21

Prepare
Establish strategies for handling missing data, extraneous noise, and outliers Identify redundant variables in the dataset and decide which fields to exclude Decide on a log or square transformation, if necessary Visually inspect the dataset to get a feel for the database Determine the distribution frequencies of the data

22

Audit the data


What is the ratio of categorical/binary attributes in the database? What is the nature and structure of the database? What is the overall condition of the dataset?

Select the Tool


Is the data set heavily categorical? What platforms do your candidate tools support? Are the candidate tools ODBC-compliant? What data format can the tools import?

23

Format the Solution


What is the optimum format of the solution decision tree, rules, C code, SQL syntax? What are the available format options? What is the goal of the solution?

Construct the Model


Are error rates at acceptable levels? Can you improve them? What extraneous attributes did you find? Can you purge them? Is additional data or a different methodology necessary? Will you have to train and test a new data set?
24

Validate the Findings


Do the findings make sense? Do you have to return to any prior steps to improve results? Can use other data mining tools to replicate the findings?

Deliver The Findings


Will additional data improve the analysis? What strategic insight did you discover and how is it applicable? What proposals can result from the data mining analysis? Do the findings meet the business objective?
25

Integrate The Solution


SQL syntax for distribution to end-users C code incorporated into a production system Rules integrated into a decision support system.

26

Data Mining Algorithms


Some of the DM algorithms are Neural Networks Decision Trees

27

Neural Network

28

Decision Trees

29

Thank you !!!


30

You might also like