BRM

Data Cleansing
Techniques
Exampl
e
Why is Data Dirty?
Incorrect data When the value entered doesnt comply with

the field's valid values.
Inaccurate data When the data is inaccurately entered.
For instance, if the DOB is entered inaccurately.
Business rule violations - Data that violates business rule is

another type of dirty data.
For instance, the value entered in the month field should range from 1 to
12
For instance, an effective date must always come before an expiry date.
Inconsistent data - Unchecked data redundancy leads to data

inconsistencies.
Why is Data Dirty?
Incomplete data - Data with missing values is the main type of

incomplete data.
Duplicate data - Duplicate data may occur due to repeated

submissions, improper data joining or user error.
Nonintegrated dataMost organizations store data

redundantly and inconsistently across many systems, which
were never designed with integration in mind.
What is Data Cleansing?

Data cleansing or data scrubbing is the act of detecting and
correcting (or removing) corrupt or inaccurate records from a record
set, table, or database. Used mainly in databases, the term refers to
identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of
the data and then replacing, modifying or deleting this dirty data.
Why is Data Cleansing Required?
Cleaning activities stop the wastage of money on old or obsolete data,
The improved accuracy can result in greater return on investment for

marketing activities.
This greater accuracy in terms assists in your target profiling, meaning

that your marketing campaigns will be more focused, relevant and
subsequently more likely to be successful.
In modern business the Data Protection Act should always be a

consideration, particularly in regards to the storage of personal
information, where failure to cleanse can actually result in noncompliance.
The data on which you base your data marketing strategy must be
accurate, up-to-date, as complete as possible, and should not contain
duplicate entries.
Data Cleansing
1. When there is missing data
2. When there are outliers
Missing data
Missing data due to non response or non relevancy
Missing data - values,

attributes, entire records, entire
sections
Missing values and defaults are

indistinguishable
Truncation/censoring - not
aware, mechanisms not known
Misleading results, bias.
Identify Missing data
Overtly missing data

Match data specifications against data - are all the
attributes present?
Scan individual records - are there gaps?
Rough checks : number of files, file sizes, number of
records, number of duplicates
Compare estimates (averages, frequencies, medians)
with expected values and bounds; check at various
levels of granularity since aggregates can be
misleading.
Trimming / removing non entries
If the number of
non entries are low
( less than 5% ), we
can remove the
non entries
We need to trim
the non entries in
case of less
number of blank
space
Imputing Values to Missing Data
In federated data, between 30%-70% of the data points will

have at least one missing attribute - data wastage
If we ignore all records with a missing value remaining data

is seriously biased
Lack of confidence in results
Understanding pattern of missing data unearths data integrity

issues
Missing Value Imputation - 1

Standalone imputation
Mean,
median, other point estimates
Assume:
Distribution of the missing values is the same

as the non-missing values.
Does
not take into account inter-relationships
Introduces
bias
Convenient,
easy to implement

Better imputation - use attribute relationships
Assume : all prior attributes are populated

That
is, monotonicity in missing values.
X1| X2| X3| X4| X5

1.0| 20| 3.5| 4| .
1.1| 18| 4.0| 2| .
1.9| 22| 2.2|
.| .
0.9| 15|
.| .
.|
Two techniques
Regression
(parametric),
Propensity
score (nonparametric)
Missing Value Imputation 3

Regression method
Use
linear regression, sweep left-to-right
X3=a+b*X2+c*X1;
X4=d+e*X3+f*X2+g*X1, and so on
X3
in the second equation is estimated from the

first equation if it is missing

Propensity Scores (nonparametric)
Let
Yj=1 if Xj is missing, 0 otherwise
Estimate
P(Yj =1) based on X1 through X(j-1) using

logistic regression
Group
by propensity score P(Yj =1)
Within
each group, estimate missing Xjs from known

Xjs using approximate Bayesian bootstrap.
Repeat
until all attributes are populated.
Statistical
Association
Rules
When
Outliers
are
present?
Pattern
Based
Clustering
1. Statistical
In this method, outlier fields and records are identified using the values such as
mean, standard deviation, range and considering the confidence intervals for each
field. While this method may generate many false positives, it is simple and fast.
A field f in a record r is considered an outlier if the value of f> + or the value
of f< - , where the mean for the field f, is the standard deviation, and is a
user defined factor.
Several values of can be used before finalizing on a particular value that gives the
best results (i.e., less false positives and false negatives).
A visualization tool can be used to analyze the results because trying to analyze the
entire data set to identify outliers by hand would be impossible.
2. Clustering
In this method, outlier records are identified using clustering techniques based on
Euclidean(or other ) distance. The main drawback of this method is a high
computational complexity.
Several clustering algorithms can be used in this method. Lets for example, study the
K-means clustering algorithm.
We use a measure called LDOF (Local Distance-based Outlier Factor), which tells how
much a point is deviating from its neighbors. The high LDOF value of a point indicates
that the point is deviating more from its neighbors and probably it may be an outlier.
LDOF of a point P is defined as :
LDOF(P)=d/D
Where d is the average distance of the k-nearest points(denoted by the set Np) from
P and D is the average distance between any two points in Np
3. Pattern Based
Pattern
Group of records that have similar behavior or characteristics.
P% of fields show a similar behavior where p is decided by the user.

100-P% are the outliers.
Multiple technique are used to find out the pattern. (Example Classification and Clustering)
Classification
Classify an instance based on the user defined models .
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile, who
will buy a new computer.
Clustering
The records are clustered using Euclidian distance and the k-mean algorithm
Each cluster is classified according to the number of records it contains.
Pattern Based Clustering- K means
4. Association Rules
Association Rules Was introduced first in context of Market Basket Analysis
and can be used to detect outliers based on the specific

rules set.
Support
Confidence
The rule
XYholds with support s
if s% of transactions in
Dataset contain X Y
The rule
X
Y
holds
with
confidence c if c% of the
transactions in D that
contain X also contain Y
Data Cleansing Problems
Error Correction and conflict resolution- The most challenging problem within
data cleansing remains the correction of values to eliminate domain format
errors, constraint violations, duplicates and invalid tuples.
Maintenance of cleansed data- After having performed data cleansing
and achieved a data collection free of errors one does not want to perform the
whole data cleansing process in its entirety after some of the values in data
collection change. Only the part of the cleansing process should be reperformed that is affected by the changed value. This affection can be
determined by analysing the cleansing lineage.
Data cleansing in virtually integrated environments- In these

environments it is often impossible to propagate corrections to the sources
because of their autonomy. Therefore, cleansing of data has to be
performed every time the data is accessed. This considerably decreases
the response time.
Data cleansing framework- The whole data cleansing
process is more the result of flexible workflow execution. Process specification,

execution and documentation should be done within a data cleansing
framework which in turn is closely coupled with other data processing activities
like transformation, integration, and maintenance activities. The framework is a
collection of methods for error detection and elimination as well as methods for
auditing data and specifying the cleansing task using appropriate user
interfaces.
Data Cleansing Tools
AJAX
AJAX is an extensible and flexible framework attempting to separate the

logical and physical levels of data cleansing. AJAX major concern is
transforming existing data from one or more data collections into a
target schema and eliminating duplicates within this process.
FraQL
This language is an extension to SQL based on an object-relational data

model. It supports the specification of schema transformations as well
as data transformations. at the instance level, i.e., standardization and
normalization of values.
Potters Wheel
Potters Wheel is an interactive data cleansing system that integrates data transformation and
error detection using spreadsheet-like interface. Potters Wheel allows users to define custom
domains, and corresponding algorithms to enforce domain format constraints.
ARKTOS
ARKTOS is a framework capable of modelling and executing the ExtractionTransformation-Load

process (ETL process) which consists of single steps that extract relevant data from the sources,
transform it to the target format and cleanse it, then loading it into the data warehouse.
IntelliClean
IntelliClean is a rule based approach to data cleansing with the main focus on duplicate
elimination. The proposed framework consists of three stages:
Pre-processing Stage,
Processing Stage, Human Verification and Validation During the first two stages, the actions
taken are logged providing documentation of the performed operations. In the third stage these
logs are investigated to verify and possibly correct the performed actions.
Business Example- Clover ETL

Solutions
Client : Czech subsidiary of an international publishing company

with 200 permanent employees and $17 million in annual
earnings
Clients Problem:
Undeliverable packages and e-mails
Failure to reach some clients by phone
Duplicate mail deliveries
Impossible to use house holding techniques
to identify members of a household
or employees of a department/company
Business Example- Clover ETL

Solutions
Overall Cost Savings

17% = 750,000 Euro
Conclusion

BRM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BRM

Uploaded by

Copyright:

Available Formats

Data Cleansing

Why is Data Dirty?

Incorrect data When the value entered doesnt comply with

Inaccurate data When the data is inaccurately entered.

For instance, if the DOB is entered inaccurately.

Business rule violations - Data that violates business rule is

Inconsistent data - Unchecked data redundancy leads to data

Why is Data Dirty?

Incomplete data - Data with missing values is the main type of

Duplicate data - Duplicate data may occur due to repeated

Nonintegrated dataMost organizations store data

What is Data Cleansing?

Why is Data Cleansing Required?

Cleaning activities stop the wastage of money on old or obsolete data,

The improved accuracy can result in greater return on investment for

This greater accuracy in terms assists in your target profiling, meaning

In modern business the Data Protection Act should always be a

Missing data due to non response or non relevancy

Missing data - values,

Missing values and defaults are

Misleading results, bias.

Identify Missing data

Overtly missing data

Trimming / removing non entries

Imputing Values to Missing Data

In federated data, between 30%-70% of the data points will

If we ignore all records with a missing value remaining data

Lack of confidence in results

Understanding pattern of missing data unearths data integrity

Missing Value Imputation - 1

median, other point estimates

Distribution of the missing values is the same

not take into account inter-relationships

Missing Value Imputation - 2

Assume : all prior attributes are populated

is, monotonicity in missing values.

X1| X2| X3| X4| X5

1.9| 22| 2.2|

Missing Value Imputation 3

linear regression, sweep left-to-right

in the second equation is estimated from the

Missing Value Imputation - 4

Yj=1 if Xj is missing, 0 otherwise

P(Yj =1) based on X1 through X(j-1) using

by propensity score P(Yj =1)

each group, estimate missing Xjs from known

until all attributes are populated.

Group of records that have similar behavior or characteristics.

P% of fields show a similar behavior where p is decided by the user.

Classify an instance based on the user defined models .

Each cluster is classified according to the number of records it contains.

Pattern Based Clustering- K means

and can be used to detect outliers based on the specific

Data Cleansing Problems

Maintenance of cleansed data- After having performed data cleansing

Data cleansing in virtually integrated environments- In these

Data cleansing framework- The whole data cleansing

process is more the result of flexible workflow execution. Process specification,

Data Cleansing Tools

AJAX is an extensible and flexible framework attempting to separate the

This language is an extension to SQL based on an object-relational data

ARKTOS is a framework capable of modelling and executing the ExtractionTransformation-Load

Business Example- Clover ETL