Professional Documents
Culture Documents
Techniques
Exampl
e
For instance, the value entered in the month field should range from 1 to
12
For instance, an effective date must always come before an expiry date.
The data on which you base your data marketing strategy must be
accurate, up-to-date, as complete as possible, and should not contain
duplicate entries.
Data Cleansing
1. When there is missing data
2. When there are outliers
Missing data
Truncation/censoring - not
aware, mechanisms not known
If the number of
non entries are low
( less than 5% ), we
can remove the
non entries
We need to trim
the non entries in
case of less
number of blank
space
Assume:
Does
Introduces
bias
Convenient,
easy to implement
.| .
0.9| 15|
.| .
.|
Two techniques
Regression
(parametric),
Propensity
score (nonparametric)
X3=a+b*X2+c*X1;
X4=d+e*X3+f*X2+g*X1, and so on
X3
Estimate
Group
Within
Repeat
Statistical
Association
Rules
When
Outliers
are
present?
Pattern
Based
Clustering
1. Statistical
In this method, outlier fields and records are identified using the values such as
mean, standard deviation, range and considering the confidence intervals for each
field. While this method may generate many false positives, it is simple and fast.
A field f in a record r is considered an outlier if the value of f> + or the value
of f< - , where the mean for the field f, is the standard deviation, and is a
user defined factor.
Several values of can be used before finalizing on a particular value that gives the
best results (i.e., less false positives and false negatives).
A visualization tool can be used to analyze the results because trying to analyze the
entire data set to identify outliers by hand would be impossible.
2. Clustering
In this method, outlier records are identified using clustering techniques based on
Euclidean(or other ) distance. The main drawback of this method is a high
computational complexity.
Several clustering algorithms can be used in this method. Lets for example, study the
K-means clustering algorithm.
We use a measure called LDOF (Local Distance-based Outlier Factor), which tells how
much a point is deviating from its neighbors. The high LDOF value of a point indicates
that the point is deviating more from its neighbors and probably it may be an outlier.
LDOF of a point P is defined as :
LDOF(P)=d/D
Where d is the average distance of the k-nearest points(denoted by the set Np) from
P and D is the average distance between any two points in Np
3. Pattern Based
Pattern
Classification
A bank loan officer wants to analyze the data in order to know which customer (loan
applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with a given profile, who
will buy a new computer.
Clustering
The records are clustered using Euclidian distance and the k-mean algorithm
4. Association Rules
Association Rules Was introduced first in context of Market Basket Analysis
Confidence
The rule
XYholds with support s
if s% of transactions in
Dataset contain X Y
The rule
X
Y
holds
with
confidence c if c% of the
transactions in D that
contain X also contain Y
Error Correction and conflict resolution- The most challenging problem within
data cleansing remains the correction of values to eliminate domain format
errors, constraint violations, duplicates and invalid tuples.
and achieved a data collection free of errors one does not want to perform the
whole data cleansing process in its entirety after some of the values in data
collection change. Only the part of the cleansing process should be reperformed that is affected by the changed value. This affection can be
determined by analysing the cleansing lineage.
AJAX
FraQL
Potters Wheel
Potters Wheel is an interactive data cleansing system that integrates data transformation and
error detection using spreadsheet-like interface. Potters Wheel allows users to define custom
domains, and corresponding algorithms to enforce domain format constraints.
ARKTOS
IntelliClean
IntelliClean is a rule based approach to data cleansing with the main focus on duplicate
elimination. The proposed framework consists of three stages:
Pre-processing Stage,
Processing Stage, Human Verification and Validation During the first two stages, the actions
taken are logged providing documentation of the performed operations. In the third stage these
logs are investigated to verify and possibly correct the performed actions.
Clients Problem:
or employees of a department/company
Conclusion