You are on page 1of 52

UNIT2DataPreprocessing

p g
Lecture Topic
p
**********************************************
L
Lecture13
13 Wh
Whypreprocessthedata?
h d ?
Lecture14 Datacleaning
Lecture15 Dataintegrationandtransformation
Lecture16 Datareduction
Lecture17 Discretizationandconcept
hierarch generation
hierarchgeneration

1
Lecture13
Lecture 13
Whypreprocessthedata?

2
Lecture13WhyDataPreprocessing?
Dataintherealworldis:
incomplete:
incomplete:lackingattributevalues,lackingcertain
lacking attribute values, lacking certain
attributesofinterest,orcontainingonlyaggregatedata
noisy:containingerrorsoroutliers
inconsistent:containingdiscrepanciesincodesornames
Noqualitydata,noqualityminingresults!
Qualitydecisionsmustbebasedonqualitydata
Datawarehouseneedsconsistentintegrationofquality
data

3
MultiDimensionalMeasureofDataQuality
Awellacceptedmultidimensionalview:
Accuracy
Completeness
Consistency
Ti li
Timeliness
Believability
Valueadded
Interpretability
Accessibility
Broadcategories:
intrinsic,contextual,representational,andaccessibility.

4
MajorTasksinDataPreprocessing
Datacleaning
Fillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,
and resolve inconsistencies
andresolveinconsistencies
Dataintegration
Integrationofmultipledatabases,datacubes,orfiles
Datatransformation
Normalizationandaggregation
Datareduction
Obtainsreducedrepresentationinvolumebutproducesthesameor
similar analytical results
similaranalyticalresults
Datadiscretization
Partofdatareductionbutwithparticularimportance,especiallyfor
numericaldata

5
Formsofdatapreprocessing
p p g

6
Lecture14
Datacleaning

7
Data Cleaning
DataCleaning

Datacleaningtasks
l k
Fillinmissingvalues
Fill in missing values
Identifyoutliersandsmoothoutnoisydata
Correctinconsistentdata

8
MissingData
g
Dataisnotalwaysavailable
Data is not always available
E.g.,manytupleshavenorecordedvalueforseveralattributes,such
ascustomerincomeinsalesdata
Missingdatamaybedueto
equipmentmalfunction
inconsistentwithotherrecordeddataandthusdeleted
datanotenteredduetomisunderstanding
certaindatamaynotbeconsideredimportantatthetimeofentry
notregisterhistoryorchangesofthedata
Missingdatamayneedtobeinferred.
Missing data may need to be inferred

9
HowtoHandleMissingData?
Ignorethetuple:usuallydonewhenclasslabelis
missing
g y
Fillinthemissingvaluemanually
Useaglobalconstanttofillinthemissingvalue:ex.
unknown

10
How to Handle Missing Data?
HowtoHandleMissingData?
Usetheattributemeantofillinthemissingvalue
Use the attribute mean to fill in the missing value
Usetheattributemeanforallsamplesbelongingtothesame
classtofillinthemissingvalue
l fill i h i i l
Usethemostprobablevaluetofillinthemissingvalue:
inferencebasedsuchasBayesianformulaordecision tree

11
NoisyData
Noise:randomerrororvarianceinameasured
variable
Incorrectattributevaluesmaydueto
faultydatacollectioninstruments
data entry problems
dataentryproblems
datatransmissionproblems
technologylimitation
gy
inconsistencyinnamingconvention
Otherdataproblemswhichrequiresdatacleaning
duplicaterecords
incompletedata
inconsistentdata
inconsistent data

12
HowtoHandleNoisyData?
Binningmethod:
firstsortdataandpartitioninto(equalfrequency)
bins
thenonecansmoothbybinmeans,smoothbybin
median,smoothbybinboundaries
Clustering
l
detectandremoveoutliers
Regression
smoothbyfittingthedatatoaregressionfunctions
linearregression

13
SimpleDiscretizationMethods:Binning

Equalwidth(distance)partitioning:
It
ItdividestherangeintoN
divides the range into N intervalsofequalsize:uniform
intervals of equal size: uniform
grid
ifA andB arethelowestandhighestvaluesofthe
attribute the width of intervals will be: W = (B A)/N
attribute,thewidthofintervalswillbe:W=(BA)/N.
Themoststraightforward
Butoutliersmaydominatepresentation
Skeweddataisnothandledwell.
Skewed data is not handled well
Equaldepth(frequency)partitioning:
ItdividestherangeintoN intervals,eachcontaining
approximatelysamenumberofsamples
Gooddatascaling
Managingcategoricalattributescanbetricky.

14
BinningMethodsforDataSmoothing
**Sorteddataforprice(indollars):4,8,9,15,21,21,24,25,
S t dd t f i (i d ll ) 4 8 9 15 21 21 24 25
26,28,29,34
*Partitioninto(equidepth)bins:
Bin1:4,8,9,15
Bin2:21,21,24,25
Bin3:26,28,29,34
*Smoothingbybinmeans:
Bin1:9,9,9,9
Bin2:23,23,23,23
Bin 2: 23 23 23 23
Bin3:29,29,29,29
*Smoothingbybinboundaries:
Bin1:4,4,4,15
Bi 1 4 4 4 15
Bin2:21,21,25,25
Bin3:26,26,26,34

15
ClusterAnalysis
y

16
Regression
y

Y1

Y1 y=x+1

X1 x

17
Lecture15
Dataintegrationandtransformation

18
DataIntegration
Dataintegration:
combinesdatafrommultiplesourcesintoacoherentstore
combines data from multiple sources into a coherent store
Schemaintegration
integratemetadatafromdifferentsources
Entityidentificationproblem:identifyrealworldentities
frommultipledatasources,e.g.,A.custid B.cust#
Detectingandresolvingdatavalueconflicts
Detecting and resolving data value conflicts
forthesamerealworldentity,attributevaluesfromdifferent
sourcesaredifferent
possiblereasons:differentrepresentations,differentscales,
e.g.,metricvs.Britishunits

19
Handling Redundant Data in Data Integration
HandlingRedundantDatainDataIntegration

Redundantdataoccuroftenwhenintegration
ofmultipledatabases
Thesameattributemayhavedifferentnamesin
differentdatabases
Oneattributemaybeaderivedattributein
anothertable,e.g.,annualrevenue
, g,

20
Handling Redundant Data in Data Integration
HandlingRedundantDatainDataIntegration

Redundant
Redundantdatamaybeabletobedetected
data may be able to be detected
bycorrelationanalysis
Carefulintegrationofthedatafrommultiple
sourcesmayhelpreduce/avoidredundancies
andinconsistenciesandimproveminingspeed
q
andqualityy

21
DataTransformation

Smoothing:removenoisefromdata
Aggregation:summarization,datacubeconstruction
Generalization:concepthierarchyclimbing

22
Data Transformation
DataTransformation

Normalization:
Normalization:scaledtofallwithinasmall,specified
scaled to fall within a small specified
range
min
minmax
maxnormalization
normalization
zscorenormalization
normalizationbydecimalscaling
y g
Attribute/featureconstruction
Newattributesconstructedfromthegivenones
New attributes constructed from the given ones

23
Data Transformation: Normalization
DataTransformation:Normalization

minmaxnormalization
i li ti
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
zscorenormalization
v mean A
v'
stand _ dev A

normalizationbydecimalscaling
normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(| v ' |)<1
10

24
Lecture16
Datareduction

25
DataReduction

Warehouse
Warehousemaystoreterabytesofdata:
may store terabytes of data:
Complexdataanalysis/miningmaytakeavery
long time to run on the complete data set
longtimetorunonthecompletedataset
Datareduction
Obtainsareducedrepresentationofthedataset
thatismuchsmallerinvolumebutyetproducesthe
same (or almost the same) analytical results
same(oralmostthesame)analyticalresults

26
Data Reduction Strategies
DataReductionStrategies

Datareductionstrategies
Data reduction strategies
Datacubeaggregation
Attributesubsetselection
Attribute subset selection
Dimensionalityreduction
Numerosityreduction
N it d ti
Discretizationandconcepthierarchygeneration

27
DataCubeAggregation
gg g
Thelowestlevelofadatacube
theaggregateddataforanindividualentityofinterest
the aggregated data for an individual entity of interest
e.g.,acustomerinaphonecallingdatawarehouse.
Multiplelevelsofaggregationindatacubes
M lti l l l f ti i d t b
Furtherreducethesizeofdatatodealwith
Referenceappropriatelevels
Usethesmallestrepresentationwhichisenoughtosolvethe
task
Queriesregardingaggregatedinformationshouldbe
answeredusingdatacube,whenpossible

28
DimensionalityReduction
y
Featureselection(attributesubsetselection):
Select
Selectaminimumsetoffeaturessuchthattheprobability
a minimum set of features such that the probability
distributionofdifferentclassesgiventhevaluesforthose
featuresisascloseaspossibletotheoriginaldistribution
giventhevaluesofallfeatures
reduce#ofpatternsinthepatterns,easiertounderstand
Heuristicmethods
He risti methods
stepwiseforwardselection
stepwise backward elimination
stepwisebackwardelimination
combiningforwardselectionandbackwardelimination
decisiontreeinduction

29
Wavelet Transforms
WaveletTransforms
Haar2 Daubechie4

Discretewavelettransform(DWT):linearsignalprocessing
( ) g p g
Compressedapproximation:storeonlyasmallfractionofthe
strongestofthewaveletcoefficients
SimilartodiscreteFouriertransform(DFT),butbetterlossy
compression,localizedinspace
Method:
Length,L,mustbeanintegerpowerof2(paddingwith0s,when
necessary)
Eachtransformhas2functions:smoothing,difference
Appliestopairsofdata,resultingintwosetofdataoflengthL/2
Appliestwofunctionsrecursively,untilreachesthedesiredlength

30
PrincipalComponentAnalysis

GivenN datavectorsfromkdimensions,findc<=k
orthogonal vectors that can be best used to
orthogonalvectorsthatcanbebestusedto
representdata
The
TheoriginaldatasetisreducedtooneconsistingofNdata
original data set is reduced to one consisting of N data
vectorsoncprincipalcomponents(reduceddimensions)
Each
Eachdatavectorisalinearcombinationofthec
data vector is a linear combination of the c
principalcomponentvectors
Worksfornumericdataonly
Works for numeric data only
Usedwhenthenumberofdimensionsislarge

31
Principal Component Analysis

X2

Y1
Y2

X1

32
Attribute subset selection
Attributesubsetselection
Attribute
Attributesubsetselectionreducesthedataset
subset selection reduces the data set
sizebyremovingirreleventorredundant
attributes.
attributes
Goalisfindminsetofattributes
Usesbasicheuristicmethodsofattribute
U b i h i i h d f ib
selection

33
HeuristicSelectionMethods
Thereare2d possiblesubfeaturesofd features
Severalheuristicselectionmethods:
Several heuristic selection methods:
Stepwiseforwardselection
Stepwisebackwardelimination
St i b k d li i ti
Combinationofforwardselectionandbackward
elimination
Decisiontreeinduction

34
Example
p of Decision Tree Induction

Initial attribute set:


{A1, A2, A3, A4, A5, A6}
A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced
d d attribute
ib set: {A1,
{A1 A4
A4, A6}

35
NumerosityReduction
Parametricmethods
Assumethedatafitssomemodel,estimatemodel
parameters,storeonlytheparameters,anddiscardthe
data(exceptpossibleoutliers)
Loglinearmodels:obtainvalueatapointinmDspaceas
theproductonappropriatemarginalsubspaces
Nonparametricmethods
N ti th d
Donotassumemodels
Majorfamilies:histograms,clustering,sampling
M j f ili hi t l t i li

36
RegressionandLogLinearModels
Linearregression:Dataaremodeledtofitastraight
line
Oftenusestheleastsquaremethodtofittheline

Multipleregression:allowsaresponsevariableYto
be modeled as a linear function of multidimensional
bemodeledasalinearfunctionofmultidimensional
featurevector

Loglinearmodel:approximatesdiscrete
multidimensionalprobabilitydistributions
ltidi i l b bilit di t ib ti

37
RegressAnalysisandLogLinearModels
Linearregression:Y= + X
Two parameters and
Twoparameters, and specifythelineandaretobe
specify the line and are to be
estimatedbyusingthedataathand.
usingtheleastsquarescriteriontotheknownvaluesofY1,
Y2,,X1,X2,.
Multipleregression:Y=b0+b1X1+b2X2.
Manynonlinearfunctionscanbetransformedintothe
above.
Loglinearmodels
Log linear models:
Themultiwaytableofjointprobabilitiesisapproximatedby
aproductoflowerordertables.
p
Probability:p(a,b,c,d)=abacad bcd

38
Histograms
Apopulardata
reductiontechnique
q 40
Dividedataintobuckets 35
andstoreaverage(sum)
g ( ) 30
foreachbucket
25
Canbeconstructed
optimallyinone 20
dimensionusing 15
d
dynamicprogramming
i i 10
Relatedtoquantization 5
problems.
bl
0
10000 30000 50000 70000 90000

39
Clustering

Partitiondatasetintoclusters,andonecanstorecluster
representationonly
Canbeveryeffectiveifdataisclusteredbutnotifdatais
y
smeared
Canhavehierarchicalclusteringandbestoredinmulti
Can have hierarchical clustering and be stored in multi
dimensionalindextreestructures
Therearemanychoicesofclusteringdefinitionsand
h h i f l i d fi i i d
clusteringalgorithms.

40
Sampling
Allowsalargedatasettoberepresentedbya
much smaller of the data.
muchsmallerofthedata.
LetalargedatasetD,containsNtuples.
MethodstoreducedatasetD:
M th d t d d t tD
Simplerandomsamplewithoutreplacement
(SRSWOR)
Simplerandomsamplewithreplacement(SRSWR)
Clustersample
l l
Strightsample

41
Sampling

Raw Data

42
Sampling

Raw Data Cluster/Stratified Sample

43
L t
Lecture17
17
Di
Discretizationandconcepthierarchy
ti ti d t hi h
generation

44
Discretization

Threetypesofattributes:
Three types of attributes:
Nominal valuesfromanunorderedset
Ordinal valuesfromanorderedset
Continuous realnumbers
Discretization:dividetherangeofacontinuous
attributeintointervals
Someclassificationalgorithmsonlyacceptcategorical
attributes.
tt ib t
Reducedatasizebydiscretization
Prepareforfurtheranalysis
Prepare for further analysis

45
DiscretizationandConcepthierachy

Discretization
reducethenumberofvaluesforagivencontinuous
attributebydividingtherangeoftheattributeinto
intervals Interval labels can then be used to replace
intervals.Intervallabelscanthenbeusedtoreplace
actualdatavalues.
Concepthierarchies
Concept hierarchies
reducethedatabycollectingandreplacinglowlevel
concepts(suchasnumericvaluesfortheattributeage)by
higherlevelconcepts(suchasyoung,middleaged,or
senior).

46
Discretizationandconcepthierarchygeneration
f
fornumericdata
i d

Binning
Binning

Histogramanalysis
Histogram analysis

Clusteringanalysis
g y

Entropybaseddiscretization

Discretizationbyintuitivepartitioning

47
EntropyBasedDiscretization
GivenasetofsamplesS,ifSispartitionedintotwo
intervalsS1andS2usingboundaryT,theentropy
afterpartitioningis
ft titi i i
| S 1| |S 2|
E (S ,T ) Ent ( S 1) Ent ( S 2)
|S| |S|
Theboundarythatminimizestheentropyfunction
h b d h h f
overallpossibleboundariesisselectedasabinary
discretization.
discretization
Theprocessisrecursivelyappliedtopartitions
obtained until some stopping criterion is met e g
obtaineduntilsomestoppingcriterionismet,e.g.,
Experimentsshowthatitmayreducedatasizeand
improve classification accuracy
improveclassificationaccuracy
Ent ( S ) E (T , S )
48
Discretizationbyintuitivepartitioning

345rulecanbeusedtosegmentnumericdatainto
relativelyuniform,naturalintervals.
l ti l if t l i t l
*Ifanintervalcovers3,6,7or9distinctvaluesatthemost
significant digit partition the range into 3 equal width
significantdigit,partitiontherangeinto3equalwidth
intervals
*Ifitcovers2,4,or8distinctvaluesatthemostsignificantdigit,
If it covers 2 4 or 8 distinct values at the most significant digit
partitiontherangeinto4intervals
*Ifitcovers1,5,or10distinctvaluesatthemostsignificant
If it covers 1 5 or 10 distinct values at the most significant
digit,partitiontherangeinto5intervals

49
Example of 345
Exampleof3 4 5rule
rule
count

Step 1: -$351 -$159 profit $1,838 $4,700


Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1 000
msd=1,000 Low= $1 000
Low=-$1,000 High=$2 000
High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$4000 -$5,000)
Step 4:

($2,000 - $5, 000)


(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
(0 -
($1,000 -
(-$400 - $200)
$1,200) ($2,000 -
-$300) $3,000)
($200 -
($1,200 -
$400)
(-$300 - $1,400)
($3,000 -
-$200)
($400 - ($1,400 - $4,000)
(-$200 - $600) $
$1,600) ($4,000 -
-$100) ($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)

50
Concepthierarchygenerationforcategorical
data
Specification
Specificationofapartialorderingofattributes
of a partial ordering of attributes
explicitlyattheschemalevelbyusersorexperts
Specificationofaportionofahierarchybyexplicit
Specification of a portion of a hierarchy by explicit
datagrouping
Specificationofasetofattributes,butnotoftheir
partialordering
Specificationofonlyapartialsetofattributes

51
Specification of a set of attributes
Specificationofasetofattributes

Concepthierarchycanbeautomaticallygenerated
p y yg
basedonthenumberofdistinctvaluesper
attributeinthegivenattributeset.Theattribute
with the most distinct values is placed at the
withthemostdistinctvaluesisplacedatthe
lowestlevelofthehierarchy.

country 15 distinct values

province or state
province_or_ 65 distinct values

city 3567 distinct values

street 674,339 distinct values


52

You might also like