You are on page 1of 42

Information Integration Lecture 13

Data Cleaning

Michael Kassoff Spring 2003


Big Picture
1) Problem Statement – What is the desired
end result? (“Business requirements”)
2) Data Location / Collection – Find some
data to integrate, and/or make some yourself.
Put the data in staging tables if desired and
possible.
3) Data Cleaning - Get rid of bad data
4) Data Integration – See first 12 lectures of
CS 246
Staging Tables
• Local tables in a database in the schema
of imported data and with ‘dirty data’

• May have more staging tables as


intermediates between imported schema
staging tables and local schema

• Especially useful if importing data


infrequently
Data Cleaning
• That’s what this lecture is about!

• Why is data cleaning needed?

• How do we clean it?


Data is often of low quality
• Why?
– You didn’t collect it yourself!

– It probably was created for some other use,


and then you came along wanting to integrate
it

– People make mistakes (typos)

– People are busy (“this is good enough”)


Problems with data
• Some data have problems on their own

• Other data is problematic only when you


want to integrate it
Data with problems on their
own
Problems due to lack of structure
– Nowadays, we have databases and XML
– Before, everyone had their own way of
structuring data, based on what was
convenient for them
– Even today, not everyone uses structured
data
• Convenience
• Lack of time
• Efficiency of representation
Case study: government agency
data
• What we want:

id name city state

1 Dept. of Transportation New York NY

2 Dept. of Finance New York NY

3 Office of Veteran's Affairs New York NY


First problem
What’s wrong here?

1'Dept. of Transportation'New York'NY


2'Dept. of Finance'New York'NY
3'Office of Veteran's Affairs'New York'NY

• The separator is used in the data.

• Easy to miss if you don’t check the # of


columns when parsing each row.
Second problem
What’s wrong here?

1,Dept. of Transportation,New York City,NY


2,Dept. of Finance,City of New York,NY
3,Office of Veteran's Affairs,New York,NY

• We need standardization / naming conventions

• Could we enforce this in a database? In XML?


Third problem
What’s wrong here?

1,Dept. of Transportation,New York,NY


,Dept. of Finance,New York,NY
3,Office of Veteran's Affairs,New York,NY

• A missing required field


• Couldn’t occur in a database because of primary
key constraint
• Could fix this in XML with a REQUIRED tag
Fourth problem
What’s wrong here?

1,Dept. of Transportation,New York,NY


Two,Dept. of Finance,New York,NY
Office of Veteran's Affairs,3,New York,NY

• No data type constraints


• Can’t fix in XML w/ DTDs.
• XML Schema fixes this problem
Fifth Problem
What’s potentially a problem here?

1,Dept. of Transportation,New York,NY


2,Dept. of Finance,New York,NY
3,Commission for the United Nations Consular
Corps and Protocol,New York,NY

• Field longer than documentation tells us.


• XML DTDs cannot help, XML Schema can help
Sixth Problem
What’s wrong here?

1,Dept. of Transportation,New York,NY


2,Dept. of Finance,New York,NY
2,Office of Veteran's Affairs,New York,NY

• Primary key constraint violation

• XML can help us if we use ID attribute


Seventh Problem
What’s wrong here?

1,Dept. of Transportation,New York,NY


2,Dept. of Finance,New York,NY
3,Dept. of Finance,New York,NY

• Redundancy!
• XML can’t help here
• Databases can help (use constraints)
• Solution is to normalize data
Eighth problem
What’s wrong here?
Contract

1,Dept. of Transportation,New York,NY


2,Dept. of Finance,New York,NY
3,Office of Veteran's Affairs,New York,NY

• Contractual obligations disallow you from


using the data as you like
Problems not due to lack of
structure (it’s in a database)
Flags
• More subtle if they are in a database

• Most frequently used flags: 0s and 9s


– Signify ‘no data’ or ‘out of range’
– Be careful: sometimes 0 is a better value for
‘out of range’ than is null (Ex: trace elements)
Typos
• How to catch them?
– Pay someone to manually check all data
– Sort data by frequency
• ‘green’ is more frequent than ‘rgeen’
• Works well for categorical data
– Use constraints
– Use statistical techniques
Using Constraints to Catch Corrupt
Data
• Foreign key constraints work well if data
possibilities can be pre-enumerated

• Numerical constraints
– Weight can’t be negative
– People can’t have more than 2 parents
– Women can’t bear 80 children
Using Statistical Techniques to
Catch Corrupt Data
• Check for outliers (the case of the 60 foot
man)

• Check for correlated outliers (“pregnant


males”)
– People can be male
– People can be pregnant
– People can’t be male AND pregnant
Hidden Semantics
• Even in a structured document, not all
data is explicit

• Order may be important, even if there is


no explicit ‘order tag’

• Whitespace may be important


Hidden Semantics Example:
MYCIN rules
• MYCIN is a program that uses a
knowledge base of rules to diagnose a
patient using backward chaining
• The order of the rules is important – using
the same rules in a different order
produces different results
• There is nothing in the rules themselves
that suggests the importance of order
Hidden Semantics Example: Gene
Ontology (GO)

• Above is some data from the GO


• The whitespace is important
• What does the whitespace mean? I have no
idea. I’d need to look at the documentation.
Misleading data
• The table on the top building name stories
is data as it is given to Guildford Plaza 9
us Hartford Apts. 35
Braun Hotel 6
• The table on the
bottom is what the
data “means” building name stories
Guildford Plaza 9.0

• Whitespace is used in Hartford Apts. 3.5


a misleading way Braun Hotel 6.0
Data that that is fine on its own,
but becomes problematic when
you want to integrate it
Formats
• Not everyone uses the same format as you

• Dates are especially problematic:


– 12/19/77
– 12/19/1977
– 12-19-77
– 19/12/77
– Dec 19, 1977
– 19 December 1977
– 9 in Tevet, 5738
Data that Moves
• Be careful of taking snapshots of a moving
target
• Let’s say you want to store the price of a skillet
in France, and the price of a pot in Germany
• You can’t store it all in the same currency (say,
US$) because the exchange rate changes
• Price in foreign currency stays the same
• Must keep the data in foreign currency and use
the current exchange rate to convert
Data at a different level of detail
than you need
• If it is at a finer level of detail, you can
sometimes bin it
• Example
– I need age ranges of 20-30, 30-40, 40-50, etc.
– Imported data contains birth date
– No problem! Divide data into appropriate
categories
Data at a different level of detail
than you need (cont’d)
• Sometimes you cannot bin it
• Example
– I need age ranges 20-30, 30-40, 40-50 etc.
– Data is of age ranges 25-35, 35-45, etc.
– What to do?
• Ignore age ranges because you aren’t sure
• Make educated guess based on imported data
(e.g., assume that # people of age 25-35 are
average # of people of age 20-30 & 30-40)
Conflicting Data
• Information source #1 says that George lives in
Texas
• Information source #2 says that George lives in
Washington, DC
• What to do?
– Use both (He lives in both places)
– Use the most recently updated piece of info
– Use the “most trusted” info
– Flag row to be investigated further by hand
– Use neither (We’d rather be incomplete than wrong)
Data Cleaning seems like a big
pain. Can I avoid it somehow?
One Possibility: Do Nothing
• Maybe you don’t want to clean up the data at all
• Example: RiboWeb
• RiboWeb consists of integrated data from scientific
papers about ribosomal structure
• Even if data is known to be incorrect, it is kept in the
RiboWeb database
• The source of the data is contained in the schema,
allowing you to judge for yourself the likely quality of the
data
– Author
– Journal name
– Whether paper was reviewed by peers
5 Steps to Cleaner Data
5 Steps of Data Cleaning

1. Semi-structure
2. Standardize
3. Local consistency check
4. Global consistency check
5. Document
Example
example adapted from Dealing with Dirty Data
By Ralph Kimball DBMS, September 1996

Address entry from unstructured file:

Ralph B and Julianne Kimball


Ste. 116
13150 Hiway 9
Box 1234 Boulder Crk
Colo 95006
Semi-structuring
Also known as parsing:

Addressee First Name(1): Ralph


Addressee Middle Initial(1): B
Addressee Last Name(1): Kimball
Addressee First Name(2): Julianne
Addressee Last Name(2): Kimball
Street Address Number: 13150
Street Name: Hiway 9
Suite Number: 116
Post Office Box Number: 1234
City: Boulder Crk
State: Colo
Five Digit Zip: 95006
Standardize
• We should replace synonyms with one
standard term

• Hiway 9 ‡ Highway 9
• Boulder Crk ‡ Boulder Creek
• Colo ‡ Colorado
Local consistency check
Does each piece of data make sense on its
own?

• Boulder Creek and Zip Code 95006 are in


California

• State was listed as Colorado

• 2/3 attributes point to California as the correct


state. Change state to California.
Global consistency check
• Find either Ralph Kimball or Julianne
Kimball in other customer records and
ensure that all of the elements of all the
addresses are identical.

• This differs from the sanity check because


it looks at other pieces of data, not just the
current one
Documenting
Document the results of semi-structuring,
standardizing, and consistency checking in
metadata

• Important for users of the integrated


database

• Important for doing future updates of the


database

You might also like