Unit 2

UNIT2DataPreprocessing
p g
Lecture Topic
p
**********************************************
L
Lecture13
13 Wh
Whypreprocessthedata?
h d ?
Lecture14 Datacleaning
Lecture15 Dataintegrationandtransformation
Lecture16 Datareduction
Lecture17 Discretizationandconcept
hierarch generation
hierarchgeneration
1
Lecture13
Lecture 13
Whypreprocessthedata?
2
Lecture13WhyDataPreprocessing?
Dataintherealworldis:
incomplete:
incomplete:lackingattributevalues,lackingcertain
lacking attribute values, lacking certain
attributesofinterest,orcontainingonlyaggregatedata
noisy:containingerrorsoroutliers
inconsistent:containingdiscrepanciesincodesornames
Noqualitydata,noqualityminingresults!
Qualitydecisionsmustbebasedonqualitydata
Datawarehouseneedsconsistentintegrationofquality
data
3
MultiDimensionalMeasureofDataQuality
Awellacceptedmultidimensionalview:
Accuracy
Completeness
Consistency
Ti li
Timeliness
Believability
Valueadded
Interpretability
Accessibility
Broadcategories:
intrinsic,contextual,representational,andaccessibility.
4
MajorTasksinDataPreprocessing
Datacleaning
Fillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,
and resolve inconsistencies
andresolveinconsistencies
Dataintegration
Integrationofmultipledatabases,datacubes,orfiles
Datatransformation
Normalizationandaggregation
Datareduction
Obtainsreducedrepresentationinvolumebutproducesthesameor
similar analytical results
similaranalyticalresults
Datadiscretization
Partofdatareductionbutwithparticularimportance,especiallyfor
numericaldata
5
Formsofdatapreprocessing
p p g
6
Lecture14
Datacleaning
7
Data Cleaning
DataCleaning
Datacleaningtasks
l k
Fillinmissingvalues
Fill in missing values
Identifyoutliersandsmoothoutnoisydata
Correctinconsistentdata
8
MissingData
g
Dataisnotalwaysavailable
Data is not always available
E.g.,manytupleshavenorecordedvalueforseveralattributes,such
ascustomerincomeinsalesdata
Missingdatamaybedueto
equipmentmalfunction
inconsistentwithotherrecordeddataandthusdeleted
datanotenteredduetomisunderstanding
certaindatamaynotbeconsideredimportantatthetimeofentry
notregisterhistoryorchangesofthedata
Missingdatamayneedtobeinferred.
Missing data may need to be inferred
9
HowtoHandleMissingData?
Ignorethetuple:usuallydonewhenclasslabelis
missing
g y
Fillinthemissingvaluemanually
Useaglobalconstanttofillinthemissingvalue:ex.
unknown
10
How to Handle Missing Data?
HowtoHandleMissingData?
Usetheattributemeantofillinthemissingvalue
Use the attribute mean to fill in the missing value
Usetheattributemeanforallsamplesbelongingtothesame
classtofillinthemissingvalue
l fill i h i i l
Usethemostprobablevaluetofillinthemissingvalue:
inferencebasedsuchasBayesianformulaordecision tree
11
NoisyData
Noise:randomerrororvarianceinameasured
variable
Incorrectattributevaluesmaydueto
faultydatacollectioninstruments
data entry problems
dataentryproblems
datatransmissionproblems
technologylimitation
gy
inconsistencyinnamingconvention
Otherdataproblemswhichrequiresdatacleaning
duplicaterecords
incompletedata
inconsistentdata
inconsistent data
12
HowtoHandleNoisyData?
Binningmethod:
firstsortdataandpartitioninto(equalfrequency)
bins
thenonecansmoothbybinmeans,smoothbybin
median,smoothbybinboundaries
Clustering
l
detectandremoveoutliers
Regression
smoothbyfittingthedatatoaregressionfunctions
linearregression
13
SimpleDiscretizationMethods:Binning
Equalwidth(distance)partitioning:
It
ItdividestherangeintoN
divides the range into N intervalsofequalsize:uniform
intervals of equal size: uniform
grid
ifA andB arethelowestandhighestvaluesofthe
attribute the width of intervals will be: W = (B A)/N
attribute,thewidthofintervalswillbe:W=(BA)/N.
Themoststraightforward
Butoutliersmaydominatepresentation
Skeweddataisnothandledwell.
Skewed data is not handled well
Equaldepth(frequency)partitioning:
ItdividestherangeintoN intervals,eachcontaining
approximatelysamenumberofsamples
Gooddatascaling
Managingcategoricalattributescanbetricky.
14
BinningMethodsforDataSmoothing
**Sorteddataforprice(indollars):4,8,9,15,21,21,24,25,
S t dd t f i (i d ll ) 4 8 9 15 21 21 24 25
26,28,29,34
*Partitioninto(equidepth)bins:
Bin1:4,8,9,15
Bin2:21,21,24,25
Bin3:26,28,29,34
*Smoothingbybinmeans:
Bin1:9,9,9,9
Bin2:23,23,23,23
Bin 2: 23 23 23 23
Bin3:29,29,29,29
*Smoothingbybinboundaries:
Bin1:4,4,4,15
Bi 1 4 4 4 15
Bin2:21,21,25,25
Bin3:26,26,26,34
15
ClusterAnalysis
y
16
Regression
y
Y1
Y1 y=x+1
X1 x
17
Lecture15
Dataintegrationandtransformation
18
DataIntegration
Dataintegration:
combinesdatafrommultiplesourcesintoacoherentstore
combines data from multiple sources into a coherent store
Schemaintegration
integratemetadatafromdifferentsources
Entityidentificationproblem:identifyrealworldentities
frommultipledatasources,e.g.,A.custid B.cust#
Detectingandresolvingdatavalueconflicts
Detecting and resolving data value conflicts
forthesamerealworldentity,attributevaluesfromdifferent
sourcesaredifferent
possiblereasons:differentrepresentations,differentscales,
e.g.,metricvs.Britishunits
19
Handling Redundant Data in Data Integration
HandlingRedundantDatainDataIntegration
Redundantdataoccuroftenwhenintegration
ofmultipledatabases
Thesameattributemayhavedifferentnamesin
differentdatabases
Oneattributemaybeaderivedattributein
anothertable,e.g.,annualrevenue
, g,
20
Handling Redundant Data in Data Integration
HandlingRedundantDatainDataIntegration
Redundant
Redundantdatamaybeabletobedetected
data may be able to be detected
bycorrelationanalysis
Carefulintegrationofthedatafrommultiple
sourcesmayhelpreduce/avoidredundancies
andinconsistenciesandimproveminingspeed
q
andqualityy
21
DataTransformation
Smoothing:removenoisefromdata
Aggregation:summarization,datacubeconstruction
Generalization:concepthierarchyclimbing
22
Data Transformation
DataTransformation
Normalization:
Normalization:scaledtofallwithinasmall,specified
scaled to fall within a small specified
range
min
minmax
maxnormalization
normalization
zscorenormalization
normalizationbydecimalscaling
y g
Attribute/featureconstruction
Newattributesconstructedfromthegivenones
New attributes constructed from the given ones
23
Data Transformation: Normalization
DataTransformation:Normalization
minmaxnormalization
i li ti
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
zscorenormalization
v mean A
v'
stand _ dev A
normalizationbydecimalscaling
normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(| v ' |)<1
10
24
Lecture16
Datareduction
25
DataReduction
Warehouse
Warehousemaystoreterabytesofdata:
may store terabytes of data:
Complexdataanalysis/miningmaytakeavery
long time to run on the complete data set
longtimetorunonthecompletedataset
Datareduction
Obtainsareducedrepresentationofthedataset
thatismuchsmallerinvolumebutyetproducesthe
same (or almost the same) analytical results
same(oralmostthesame)analyticalresults
26
Data Reduction Strategies
DataReductionStrategies
Datareductionstrategies
Data reduction strategies
Datacubeaggregation
Attributesubsetselection
Attribute subset selection
Dimensionalityreduction
Numerosityreduction
N it d ti
Discretizationandconcepthierarchygeneration
27
DataCubeAggregation
gg g
Thelowestlevelofadatacube
theaggregateddataforanindividualentityofinterest
the aggregated data for an individual entity of interest
e.g.,acustomerinaphonecallingdatawarehouse.
Multiplelevelsofaggregationindatacubes
M lti l l l f ti i d t b
Furtherreducethesizeofdatatodealwith
Referenceappropriatelevels
Usethesmallestrepresentationwhichisenoughtosolvethe
task
Queriesregardingaggregatedinformationshouldbe
answeredusingdatacube,whenpossible
28
DimensionalityReduction
y
Featureselection(attributesubsetselection):
Select
Selectaminimumsetoffeaturessuchthattheprobability
a minimum set of features such that the probability
distributionofdifferentclassesgiventhevaluesforthose
featuresisascloseaspossibletotheoriginaldistribution
giventhevaluesofallfeatures
reduce#ofpatternsinthepatterns,easiertounderstand
Heuristicmethods
He risti methods
stepwiseforwardselection
stepwise backward elimination
stepwisebackwardelimination
combiningforwardselectionandbackwardelimination
decisiontreeinduction
29
Wavelet Transforms
WaveletTransforms
Haar2 Daubechie4
Discretewavelettransform(DWT):linearsignalprocessing
( ) g p g
Compressedapproximation:storeonlyasmallfractionofthe
strongestofthewaveletcoefficients
SimilartodiscreteFouriertransform(DFT),butbetterlossy
compression,localizedinspace
Method:
Length,L,mustbeanintegerpowerof2(paddingwith0s,when
necessary)
Eachtransformhas2functions:smoothing,difference
Appliestopairsofdata,resultingintwosetofdataoflengthL/2
Appliestwofunctionsrecursively,untilreachesthedesiredlength
30
PrincipalComponentAnalysis
GivenN datavectorsfromkdimensions,findc<=k
orthogonal vectors that can be best used to
orthogonalvectorsthatcanbebestusedto
representdata
The
TheoriginaldatasetisreducedtooneconsistingofNdata
original data set is reduced to one consisting of N data
vectorsoncprincipalcomponents(reduceddimensions)
Each
Eachdatavectorisalinearcombinationofthec
data vector is a linear combination of the c
principalcomponentvectors
Worksfornumericdataonly
Works for numeric data only
Usedwhenthenumberofdimensionsislarge
31
Principal Component Analysis
X2
Y1
Y2
X1
32
Attribute subset selection
Attributesubsetselection
Attribute
Attributesubsetselectionreducesthedataset
subset selection reduces the data set
sizebyremovingirreleventorredundant
attributes.
attributes
Goalisfindminsetofattributes
Usesbasicheuristicmethodsofattribute
U b i h i i h d f ib
selection
33
HeuristicSelectionMethods
Thereare2d possiblesubfeaturesofd features
Severalheuristicselectionmethods:
Several heuristic selection methods:
Stepwiseforwardselection
Stepwisebackwardelimination
St i b k d li i ti
Combinationofforwardselectionandbackward
elimination
Decisiontreeinduction
34
Example
p of Decision Tree Induction
Initial attribute set:

{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced
d d attribute
ib set: {A1,
{A1 A4
A4, A6}
35
NumerosityReduction
Parametricmethods
Assumethedatafitssomemodel,estimatemodel
parameters,storeonlytheparameters,anddiscardthe
data(exceptpossibleoutliers)
Loglinearmodels:obtainvalueatapointinmDspaceas
theproductonappropriatemarginalsubspaces
Nonparametricmethods
N ti th d
Donotassumemodels
Majorfamilies:histograms,clustering,sampling
M j f ili hi t l t i li
36
RegressionandLogLinearModels
Linearregression:Dataaremodeledtofitastraight
line
Oftenusestheleastsquaremethodtofittheline
Multipleregression:allowsaresponsevariableYto
be modeled as a linear function of multidimensional
bemodeledasalinearfunctionofmultidimensional
featurevector
Loglinearmodel:approximatesdiscrete
multidimensionalprobabilitydistributions
ltidi i l b bilit di t ib ti
37
RegressAnalysisandLogLinearModels
Linearregression:Y= + X
Two parameters and
Twoparameters, and specifythelineandaretobe
specify the line and are to be
estimatedbyusingthedataathand.
usingtheleastsquarescriteriontotheknownvaluesofY1,
Y2,,X1,X2,.
Multipleregression:Y=b0+b1X1+b2X2.
Manynonlinearfunctionscanbetransformedintothe
above.
Loglinearmodels
Log linear models:
Themultiwaytableofjointprobabilitiesisapproximatedby
aproductoflowerordertables.
p
Probability:p(a,b,c,d)=abacad bcd
38
Histograms
Apopulardata
reductiontechnique
q 40
Dividedataintobuckets 35
andstoreaverage(sum)
g ( ) 30
foreachbucket
25
Canbeconstructed
optimallyinone 20
dimensionusing 15
d
dynamicprogramming
i i 10
Relatedtoquantization 5
problems.
bl
0
10000 30000 50000 70000 90000
39
Clustering
Partitiondatasetintoclusters,andonecanstorecluster
representationonly
Canbeveryeffectiveifdataisclusteredbutnotifdatais
y
smeared
Canhavehierarchicalclusteringandbestoredinmulti
Can have hierarchical clustering and be stored in multi
dimensionalindextreestructures
Therearemanychoicesofclusteringdefinitionsand
h h i f l i d fi i i d
clusteringalgorithms.
40
Sampling
Allowsalargedatasettoberepresentedbya
much smaller of the data.
muchsmallerofthedata.
LetalargedatasetD,containsNtuples.
MethodstoreducedatasetD:
M th d t d d t tD
Simplerandomsamplewithoutreplacement
(SRSWOR)
Simplerandomsamplewithreplacement(SRSWR)
Clustersample
l l
Strightsample
41
Sampling
Raw Data
42
Sampling
Raw Data Cluster/Stratified Sample
43
L t
Lecture17
17
Di
Discretizationandconcepthierarchy
ti ti d t hi h
generation
44
Discretization
Threetypesofattributes:
Three types of attributes:
Nominal valuesfromanunorderedset
Ordinal valuesfromanorderedset
Continuous realnumbers
Discretization:dividetherangeofacontinuous
attributeintointervals
Someclassificationalgorithmsonlyacceptcategorical
attributes.
tt ib t
Reducedatasizebydiscretization
Prepareforfurtheranalysis
Prepare for further analysis
45
DiscretizationandConcepthierachy
Discretization
reducethenumberofvaluesforagivencontinuous
attributebydividingtherangeoftheattributeinto
intervals Interval labels can then be used to replace
intervals.Intervallabelscanthenbeusedtoreplace
actualdatavalues.
Concepthierarchies
Concept hierarchies
reducethedatabycollectingandreplacinglowlevel
concepts(suchasnumericvaluesfortheattributeage)by
higherlevelconcepts(suchasyoung,middleaged,or
senior).
46
Discretizationandconcepthierarchygeneration
f
fornumericdata
i d
Binning
Binning
Histogramanalysis
Histogram analysis
Clusteringanalysis
g y
Entropybaseddiscretization
Discretizationbyintuitivepartitioning
47
EntropyBasedDiscretization
GivenasetofsamplesS,ifSispartitionedintotwo
intervalsS1andS2usingboundaryT,theentropy
afterpartitioningis
ft titi i i
| S 1| |S 2|
E (S ,T ) Ent ( S 1) Ent ( S 2)
|S| |S|
Theboundarythatminimizestheentropyfunction
h b d h h f
overallpossibleboundariesisselectedasabinary
discretization.
discretization
Theprocessisrecursivelyappliedtopartitions
obtained until some stopping criterion is met e g
obtaineduntilsomestoppingcriterionismet,e.g.,
Experimentsshowthatitmayreducedatasizeand
improve classification accuracy
improveclassificationaccuracy
Ent ( S ) E (T , S )
48
Discretizationbyintuitivepartitioning
345rulecanbeusedtosegmentnumericdatainto
relativelyuniform,naturalintervals.
l ti l if t l i t l
*Ifanintervalcovers3,6,7or9distinctvaluesatthemost
significant digit partition the range into 3 equal width
significantdigit,partitiontherangeinto3equalwidth
intervals
*Ifitcovers2,4,or8distinctvaluesatthemostsignificantdigit,
If it covers 2 4 or 8 distinct values at the most significant digit
partitiontherangeinto4intervals
*Ifitcovers1,5,or10distinctvaluesatthemostsignificant
If it covers 1 5 or 10 distinct values at the most significant
digit,partitiontherangeinto5intervals
49
Example of 345
Exampleof3 4 5rule
rule
count
Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1 000
msd=1,000 Low= $1 000
Low=-$1,000 High=$2 000
High=$2,000
(-$1,000 - $2,000)
Step 3:
(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)
(-$4000 -$5,000)
Step 4:
($2,000 - $5, 000)

(-$400 - 0) (0 - $1,000) ($1,000 - $2, 000)
(0 -
($1,000 -
(-$400 - $200)
$1,200) ($2,000 -
-$300) $3,000)
($200 -
($1,200 -
$400)
(-$300 - $1,400)
($3,000 -
-$200)
($400 - ($1,400 - $4,000)
(-$200 - $600) $
$1,600) ($4,000 -
-$100) ($600 - ($1,600 - $5,000)
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
50
Concepthierarchygenerationforcategorical
data
Specification
Specificationofapartialorderingofattributes
of a partial ordering of attributes
explicitlyattheschemalevelbyusersorexperts
Specificationofaportionofahierarchybyexplicit
Specification of a portion of a hierarchy by explicit
datagrouping
Specificationofasetofattributes,butnotoftheir
partialordering
Specificationofonlyapartialsetofattributes
51
Specification of a set of attributes
Specificationofasetofattributes
Concepthierarchycanbeautomaticallygenerated
p y yg
basedonthenumberofdistinctvaluesper
attributeinthegivenattributeset.Theattribute
with the most distinct values is placed at the
withthemostdistinctvaluesisplacedatthe
lowestlevelofthehierarchy.
country 15 distinct values
province or state
province_or_ 65 distinct values
city 3567 distinct values
street 674,339 distinct values

52

Unit 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2

Uploaded by

Copyright:

Available Formats

UNIT2DataPreprocessing

Initial attribute set:

Class 1 Class 2 Class 1 Class 2

Raw Data Cluster/Stratified Sample

Step 1: -$351 -$159 profit $1,838 $4,700

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

($2,000 - $5, 000)

country 15 distinct values

city 3567 distinct values

street 674,339 distinct values

You might also like