You are on page 1of 33

DATA WAREHOUSE

&
QUALITY ISSUES
Information Search and Analysis
Skills
Venue : NIIT Ltd, Agra.

Date: 15 Dec 2008

Semester: 4
Credits:
Amol Shrivastav
Mohit Bhaduria
Harsha Rajwanshi
Guidance & support

Gunjan Verma
Contents

 Introduction
 Measuring Data Quality
 Tools for Data Quality
 Data Quality Methodology
 ETL
Section 1

By
Amol Shrivastav
A producer wants to know….
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?

What product prom- Which customers


-otions have the biggest are most likely to go
impact on revenue? to the competition ?
What impact will
new products/services
have on revenue
and margins?
Data, Data everywhere
yet ... I can’t find the data I need

 data is scattered over the network


 many versions, subtle differences

 I can’t get the data I need


 need an expert to get the data

 I can’t understand the data I found


 available data poorly documented

 I can’t use the data I found


 results are unexpected
 data needs to be transformed from one
form to other
What is a Data Warehouse?
A single, complete and
consistent store of data obtained
from a variety of different
sources made available to end
users in a what they can
understand and use in a
business context.

[Barry Devlin]
Data Flow
Section 2

By
Mohit Bhaduria
Measurin
g
Data
Quality
Attributes for measuring Data
doQuality
I know what the fields
mean, do I know when the
data I’m using usefulness
was last updated?

Believability. Data Quality Accessibility

Interpretability
Attributes for measuring Data
doQuality
I know what the fields
Usefulness
mean, do I knowiswhen
the data
the
datarelevant for my needs? Is the
I’m using usefulness
wasdata current?
last updated?

Believability. Data Quality Accessibility

Interpretability
Attributes for measuring Data
doQuality
I know what the fields
Usefulness
mean, do I knowiswhen
the data
the
I’mam
datarelevant I for
usingmissing too much
my needs? data?
Is the usefulness
lastAre
wasdata there strong biais? Is the
current?
updated?
data quality
consistent?

Believability. Data Quality Accessibility

Interpretability
Attributes for measuring Data
doQuality
I know what the fields
Usefulness
mean, is the
do I know data
when the
am using
relevant I missing
for my too much
needs? data?
data I’mdo the people whoIsneed
the to usefulness
data
was Are there
lastcurrent?
updated?strong biais? Is the
have access to the data have
data quality
the proper access?
consistent?
Is the system crashing or too
slow?

Believability. Data Quality Accessibility

Interpretability
Linking Quality Factors to DW
quality
Data warehouse
quality

Accessibility Interpretability usefulness believability Validation

DW Design
Update policy
Data Sources Models Language
DW evolution DW Sources, Design
Data warehouse Design Query processing DW process
Data warehouse process Data sources , and process
DW Data & process
DW Design & process
Quality metamodel
The quality meta model can
be used for both design and
analysis purposes. The
DWQ quality metamodel is
based on the Goal-Question-
Metric approach

DWQM is
continuous
process in
life DW
Section 3

By
Harsha Rajwanshi
Tools for
Data
Warehouse
Quality
Tools for Data quality
 The tools that may be used to
extract/transform/clean the source
data or to measure/control the
quality of the inserted data can be
grouped in the following
categories

 Data auditing tools.


 Data Cleansing tools.
 Data Migration tools.
 Data Quality Analysis tools.
Tools for Data quality
 Data auditing tools enhance the accuracy and correctness of the data at the source.

 Data cleansing tools are used in the intermediate staging area. The data cleansing tools
contain features which perform the following functions:

 Data parsing (elementising)


 Data standardization
 Data correction and verification
 Record matching
 Data transformation
 House holding
 Documenting

 Data migration tool, is responsible for converting the data from one platform to another.

 SQL Loader module of Oracle, Carleton’s Pure Integrate (formerly known as


Enterprise/Integrator), ETI Data Cleanse, EDD Data Cleanser tool and Integrity from Vality can
be used for the application of rules that govern data cleaning, typical integration tasks etc.
Data Quality
Methodology
Data Quality Methodology
 Profiling and Assessment

 Cleansing

 Data integration /consolidation

 Data Augmentation
Profiling & Assessment
 There are many different techniques
and processes for data profiling.
Grouping them together into three
major categories:

 Pattern Analysis – Expected patterns,


pattern distribution, pattern frequency and
drill down analysis
 Column Analysis – Cardinality, null
values, ranges, minimum/maximum
values, frequency distribution and various
statistics.
 Domain Analysis – Expected or accepted
data values and ranges
Cleansing
 Data Cleansing focus on 3
main categories:
 Business Rule Creation,
 Standardizing and
 Parsing.
Data Integration and
consolidation
 The data can also be linked implicitly by defining join
criteria on similar values using a generated unique value
or match codes based on fuzzy logic algorithms
 Determine what process to follow to consolidate/combine
or remove redundant data.
Extraction,
Transformati
on & Loading
in DW
Capture = extract…obtaining a snapshot of
a chosen subset of the source data for
loading into the data warehouse

Static extract = capturing a Incremental extract = capturing


snapshot of the source data at a changes that have occurred since
point in time the last static extract
Scrub = cleanse…uses pattern recognition
and techniques to upgrade data quality

Fixing errors: misspellings, Also: decoding, reformatting, time


erroneous dates, incorrect field usage, stamping, conversion, key generation,
mismatched addresses, missing data, merging, error detection/logging, locating
duplicate data, inconsistencies missing data
Load/Index= place transformed data into the
warehouse and create indexes

Refresh mode: bulk rewriting of Update mode: only changes in source


target data at periodic intervals data are written to data warehouse
Queries
Thank You!!!

You might also like