You are on page 1of 29

Data Cleansing

Techniques

Exampl
e

Why is Data Dirty?

Incorrect data When the value entered doesnt comply with


the field's valid values.

Inaccurate data When the data is inaccurately entered.

For instance, if the DOB is entered inaccurately.

Business rule violations - Data that violates business rule is


another type of dirty data.

For instance, the value entered in the month field should range from 1 to
12

For instance, an effective date must always come before an expiry date.

Inconsistent data - Unchecked data redundancy leads to data


inconsistencies.

Why is Data Dirty?

Incomplete data - Data with missing values is the main type of


incomplete data.

Duplicate data - Duplicate data may occur due to repeated


submissions, improper data joining or user error.

Nonintegrated dataMost organizations store data


redundantly and inconsistently across many systems, which
were never designed with integration in mind.

What is Data Cleansing?


Data cleansing or data scrubbing is the act of detecting and
correcting (or removing) corrupt or inaccurate records from a record
set, table, or database. Used mainly in databases, the term refers to
identifying incomplete, incorrect, inaccurate, irrelevant etc. parts of
the data and then replacing, modifying or deleting this dirty data.

Why is Data Cleansing Required?

Cleaning activities stop the wastage of money on old or obsolete data,

The improved accuracy can result in greater return on investment for


marketing activities.

This greater accuracy in terms assists in your target profiling, meaning


that your marketing campaigns will be more focused, relevant and
subsequently more likely to be successful.

In modern business the Data Protection Act should always be a


consideration, particularly in regards to the storage of personal
information, where failure to cleanse can actually result in noncompliance.

The data on which you base your data marketing strategy must be
accurate, up-to-date, as complete as possible, and should not contain
duplicate entries.

Data Cleansing
1. When there is missing data
2. When there are outliers

Missing data

Missing data due to non response or non relevancy

Missing data - values,


attributes, entire records, entire
sections

Missing values and defaults are


indistinguishable

Truncation/censoring - not
aware, mechanisms not known

Misleading results, bias.

Identify Missing data

Overtly missing data


Match data specifications against data - are all the
attributes present?
Scan individual records - are there gaps?
Rough checks : number of files, file sizes, number of
records, number of duplicates
Compare estimates (averages, frequencies, medians)
with expected values and bounds; check at various
levels of granularity since aggregates can be
misleading.

Trimming / removing non entries

If the number of
non entries are low
( less than 5% ), we
can remove the
non entries

We need to trim
the non entries in
case of less
number of blank
space

Imputing Values to Missing Data

In federated data, between 30%-70% of the data points will


have at least one missing attribute - data wastage

If we ignore all records with a missing value remaining data


is seriously biased

Lack of confidence in results

Understanding pattern of missing data unearths data integrity


issues

Missing Value Imputation - 1


Standalone imputation
Mean,

median, other point estimates

Assume:

Distribution of the missing values is the same


as the non-missing values.

Does

not take into account inter-relationships

Introduces

bias

Convenient,

easy to implement

Missing Value Imputation - 2


Better imputation - use attribute relationships

Assume : all prior attributes are populated


That

is, monotonicity in missing values.

X1| X2| X3| X4| X5


1.0| 20| 3.5| 4| .
1.1| 18| 4.0| 2| .

1.9| 22| 2.2|

.| .

0.9| 15|

.| .

.|

Two techniques
Regression

(parametric),

Propensity

score (nonparametric)

Missing Value Imputation 3


Regression method
Use

linear regression, sweep left-to-right

X3=a+b*X2+c*X1;
X4=d+e*X3+f*X2+g*X1, and so on
X3

in the second equation is estimated from the


first equation if it is missing

Missing Value Imputation - 4


Propensity Scores (nonparametric)
Let

Yj=1 if Xj is missing, 0 otherwise

Estimate

P(Yj =1) based on X1 through X(j-1) using


logistic regression

Group

by propensity score P(Yj =1)

Within

each group, estimate missing Xjs from known


Xjs using approximate Bayesian bootstrap.

Repeat

until all attributes are populated.

Statistical

Association
Rules

When
Outliers
are
present?
Pattern
Based

Clustering

1. Statistical
In this method, outlier fields and records are identified using the values such as
mean, standard deviation, range and considering the confidence intervals for each
field. While this method may generate many false positives, it is simple and fast.
A field f in a record r is considered an outlier if the value of f> + or the value
of f< - , where the mean for the field f, is the standard deviation, and is a
user defined factor.
Several values of can be used before finalizing on a particular value that gives the
best results (i.e., less false positives and false negatives).
A visualization tool can be used to analyze the results because trying to analyze the
entire data set to identify outliers by hand would be impossible.

2. Clustering
In this method, outlier records are identified using clustering techniques based on
Euclidean(or other ) distance. The main drawback of this method is a high
computational complexity.
Several clustering algorithms can be used in this method. Lets for example, study the
K-means clustering algorithm.
We use a measure called LDOF (Local Distance-based Outlier Factor), which tells how
much a point is deviating from its neighbors. The high LDOF value of a point indicates
that the point is deviating more from its neighbors and probably it may be an outlier.
LDOF of a point P is defined as :
LDOF(P)=d/D
Where d is the average distance of the k-nearest points(denoted by the set Np) from
P and D is the average distance between any two points in Np

3. Pattern Based
Pattern

Group of records that have similar behavior or characteristics.

P% of fields show a similar behavior where p is decided by the user.


100-P% are the outliers.
Multiple technique are used to find out the pattern. (Example Classification and Clustering)

Classification

Classify an instance based on the user defined models .

A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile, who
will buy a new computer.

Clustering

The records are clustered using Euclidian distance and the k-mean algorithm

Each cluster is classified according to the number of records it contains.

Pattern Based Clustering- K means

4. Association Rules
Association Rules Was introduced first in context of Market Basket Analysis

and can be used to detect outliers based on the specific


rules set.
Support

Confidence

The rule
XYholds with support s
if s% of transactions in
Dataset contain X Y

The rule
X
Y
holds
with
confidence c if c% of the
transactions in D that
contain X also contain Y

Data Cleansing Problems

Error Correction and conflict resolution- The most challenging problem within
data cleansing remains the correction of values to eliminate domain format
errors, constraint violations, duplicates and invalid tuples.

Maintenance of cleansed data- After having performed data cleansing

and achieved a data collection free of errors one does not want to perform the
whole data cleansing process in its entirety after some of the values in data
collection change. Only the part of the cleansing process should be reperformed that is affected by the changed value. This affection can be
determined by analysing the cleansing lineage.

Data cleansing in virtually integrated environments- In these


environments it is often impossible to propagate corrections to the sources
because of their autonomy. Therefore, cleansing of data has to be
performed every time the data is accessed. This considerably decreases
the response time.

Data cleansing framework- The whole data cleansing

process is more the result of flexible workflow execution. Process specification,


execution and documentation should be done within a data cleansing
framework which in turn is closely coupled with other data processing activities
like transformation, integration, and maintenance activities. The framework is a
collection of methods for error detection and elimination as well as methods for
auditing data and specifying the cleansing task using appropriate user
interfaces.

Data Cleansing Tools

AJAX

AJAX is an extensible and flexible framework attempting to separate the


logical and physical levels of data cleansing. AJAX major concern is
transforming existing data from one or more data collections into a
target schema and eliminating duplicates within this process.

FraQL

This language is an extension to SQL based on an object-relational data


model. It supports the specification of schema transformations as well
as data transformations. at the instance level, i.e., standardization and
normalization of values.

Potters Wheel

Potters Wheel is an interactive data cleansing system that integrates data transformation and
error detection using spreadsheet-like interface. Potters Wheel allows users to define custom
domains, and corresponding algorithms to enforce domain format constraints.

ARKTOS

ARKTOS is a framework capable of modelling and executing the ExtractionTransformation-Load


process (ETL process) which consists of single steps that extract relevant data from the sources,
transform it to the target format and cleanse it, then loading it into the data warehouse.

IntelliClean

IntelliClean is a rule based approach to data cleansing with the main focus on duplicate
elimination. The proposed framework consists of three stages:
Pre-processing Stage,
Processing Stage, Human Verification and Validation During the first two stages, the actions
taken are logged providing documentation of the performed operations. In the third stage these
logs are investigated to verify and possibly correct the performed actions.

Business Example- Clover ETL


Solutions

Client : Czech subsidiary of an international publishing company


with 200 permanent employees and $17 million in annual
earnings

Clients Problem:

Undeliverable packages and e-mails

Failure to reach some clients by phone

Duplicate mail deliveries

Impossible to use house holding techniques

to identify members of a household

or employees of a department/company

Business Example- Clover ETL


Solutions

Overall Cost Savings


17% = 750,000 Euro

Conclusion

You might also like