You are on page 1of 7

7/19/2010

Intro:MLLearningProcess
DataMiningbelongstoawiderfamilyof machinelearningapplications Specifics ofprocessmaydiffergreatlydueto severalfactors:
Applicationtype Tools andTechniquesinuse q Sourceandtypeofdata Targetapplication Experienceofpractitioners/organization Methodologyinuse etc.

03.DataMiningProcess

Data MiningasanexampleMLProcess
DataMiningDefinitions:
Theextractionofimplicit,previouslyunknown andpotentiallyuseful informationfromdataindatabases. Extractionofinteresting(nontrivial,implicit,previouslyunknown andpotentiallyuseful)patternsorknowledgefromhugeamountof data Exploration&analysis,byautomaticorsemiautomaticmeans,of largequantitiesofdatainordertodiscover meaningfulpatterns g p

DataMiningandBusinessIntelligence
Increasing potential to support business decisions

Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery

End User

Business Analyst Data Analyst

AlternativeNames:
Knowledgediscovery(mining)indatabases(KDD) Knowledgeextraction Data/patternanalysis Dataarcheology Datadredging Informationharvesting BusinessIntelligence,etc.

Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
DBA

DataMining:ConfluenceofMultipleDisciplines
Machine Learning Pattern Recognition

WhyNotTraditionalDataAnalysis?
Tremendousamountofdata Algorithmsmustbehighlyscalabletohandlesuchasterabytesof data Highdimensionalityofdata Microarraymayhavetensofthousandsofdimensions

Statistics

Applications

DataMining

Visualization

Algorithm

Database Technology

HighPerformance Computing

Highcomplexityofdata High complexity of data Datastreamsandsensordata Timeseriesdata,temporaldata,sequencedata Structuredata,graphs,socialnetworksandmultilinkeddata Heterogeneousdatabasesandlegacydatabases Spatial,spatiotemporal,multimedia,textandWebdata Softwareprograms,scientificsimulations
6

Newandsophisticatedapplications

7/19/2010

WhatDatatomine?
BusinessTransactions ScientificData MedicalandPersonalData Surveillancevideoandpictures Satellitesensingdata Games DigitalMedia CADandSoftwareengineeringdata VirtualWorldsData Textreports,memos,email,blog WWWrepositories

Whereisthisdata?
Flatfiles Relational Databases DataWarehouses TransactionDatabases MultimediaDatabases Multimedia Databases SpatialDatabases Timeseriesdatabases WorldWideWeb HardCopyDocuments

Whatcanbediscovered?
CharacterizationRules DiscriminationRules Association Rules ClassificationRules Prediction Clustering OutlierAnalysis EvolutionandDeviationAnalysis

ClassificationOfDataMiningSystems
Bytypeofdatasourcemined:
Spatialdata,multimediadata,timeseriesdata,textdata, www dataetc.

Bydatamodeldrawnon
Relational database,datawarehouse,transactionaldatabase, objectorienteddatabaseetc.

Bykindofknowledgediscovered By kind of knowledge discovered


Characterization,discrimination,association,classification, clustering,predictionetc.

Bymining techniqueused:
Neuralnetwork,geneticalgorithm,statistics,visualization, databaseorientedetc. Alsotakesintoaccountthedegreeofuserinteractioninvolvedin theprocess:querydriven,interactiveexploratorysystems, autonomoussystem

SomeChallengesforDataMiningPractitioners
DataSourceIssues:Access;Quantities;Formats; Quality Business/Domainunderstandingandrelatedchallenges: incorporationofconstraints,expertknowledgeandbackground knowledgeindatamining Socialissues:resistance,politics,misinterpretationofrole Protectionofsecurity,integrityandprivacyindatamining Highdatadimensionality;Userinterfaceissues High data dimensionality; User interface issues HandlingNoise,uncertaintyandincompletenessofdata Patternevaluationandknowledgeintegration Miningdiverseandheterogeneouskindsofdata:e.g., bioinformatics,Web,software/systemengineering,information networks PerformanceIssues:Efficiencyandscalabilityofdatamining algorithms

Ex.1:MarketAnalysisandManagement
Wheredoesthedatacomefrom?Creditcardtransactions,loyalty cards,discountcoupons,customercomplaintcalls,plus(public) lifestylestudies Targetmarketing
Findclustersofmodelcustomerswhosharethesamecharacteristics: interest,incomelevel,spendinghabits,etc. Determinecustomerpurchasingpatternsovertime

CrossmarketanalysisFindassociations/corelationsbetween productsales,&predictbasedonsuchassociation d l & d b d h CustomerprofilingWhattypesofcustomersbuywhatproducts (clusteringorclassification) Customerrequirementanalysis


Identifythebestproductsfordifferentgroupsofcustomers Predictwhatfactorswillattractnewcustomers

Provisionofsummaryinformation
Multidimensionalsummaryreports
12

Statisticalsummaryinformation(datacentraltendencyand variation)

7/19/2010

Ex.2:CorporateAnalysis&RiskManagement Financeplanningandassetevaluation
Cashflowanalysisandprediction Contingentclaimanalysistoevaluateassets Crosssectionalandtimeseriesanalysis(financialratio,trend analysis,etc.)

Ex.3:FraudDetection&MiningUnusualPatterns Approaches:Clustering&modelconstructionforfrauds, outlieranalysis Applications:Healthcare,retail,creditcardservice, telecomm.


Autoinsurance:ringofcollisions Moneylaundering:suspiciousmonetarytransactions Medicalinsurance d l
Professionalpatients,ringofdoctors,andringofreferences Unnecessaryorcorrelatedscreeningtests

Resourceplanning
Summarizeandcomparetheresourcesandspending

Competition
Monitorcompetitorsandmarketdirections Groupcustomersintoclassesandaclassbasedpricingprocedure
13

Telecommunications:phonecallfraud
Phonecallmodel:destinationofthecall,duration,timeofdayorweek. Analyzepatternsthatdeviatefromanexpectednorm

Retailindustry
Analystsestimatethat38%ofretailshrinkisduetodishonestemployees
14

Setpricingstrategyinahighlycompetitivemarket

Antiterrorism

AreAlltheDiscoveredPatternsInteresting?
Dataminingmaygeneratethousandsofpatterns:Notallofthemare interesting
Suggestedapproach:Humancentered,querybased,focusedmining

FindAllandOnlyInterestingPatterns?
Findalltheinterestingpatterns:Completeness Canadataminingsystemfindalltheinterestingpatterns?Do weneedtofindalloftheinterestingpatterns? Heuristicvs.exhaustivesearch Associationvs.classificationvs.clustering Sea c o o y te est g patte s Searchforonlyinterestingpatterns:Anoptimizationproblem opt at o p ob e Canadataminingsystemfindonlytheinteresting patterns? Approaches
Firstgeneralallthepatternsandthenfilterouttheuninteresting ones Generateonlytheinterestingpatternsminingquery optimization

Interestingnessmeasures
Apatternisinterestingifitiseasilyunderstoodbyhumans,validonnew ortestdatawithsomedegreeofcertainty,potentiallyuseful,novel,or validatessomehypothesisthatauserseekstoconfirm

Objectivevs.subjectiveinterestingnessmeasures
Objective:basedonstatisticsandstructuresofpatterns,e.g.,support, confidence,etc. Subjective:basedonusersbeliefinthedata,e.g.,unexpectedness,
15

novelty,actionability,etc.

16

OtherPatternMiningIssues
Precisepatternsvs.approximatepatterns Associationandcorrelationmining:possiblefindsetsof precisepatterns
Butapproximatepatternscanbemorecompactandsufficient Howtofindhighqualityapproximatepatterns??

CRISPDM
Expanded:CRossIndustryStandardProcessforDataMining WhyShouldTherebeaStandardProcess?
Thedataminingprocessmustbereliableandrepeatableby peoplewithlittledataminingbackground.

Genesequencemining:approximatepatternsareinherent
Howtoderiveefficientapproximatepatternminingalgorithms??

Constrainedvs.nonconstrainedpatterns Whyconstraintbasedmining? Whatarethepossiblekindsofconstraints?Howtopush constraintsintotheminingprocess?


17 18

7/19/2010

WhyShouldTherebeaStandardProcess?
Frameworkforrecordingexperience Allowsprojectstobereplicated Aidtoprojectplanningandmanagement Comfort factorfornewadopters Comfortfactor for new adopters DemonstratesmaturityofDataMining Reducesdependencyonstars

ProcessStandardization
Initiativelaunchedinlate1996bythreeveteransofdata miningmarket. DaimlerChrysler(thenDaimlerBenz),SPSS(thenISL),NCR
1999)

Developedandrefinedthroughseriesofworkshops(from1997 Over300organizationcontributedtotheprocessmodel Over 300 organization contributed to the process model PublishedCRISPDM1.0(1999) Over200membersoftheCRISPDMSIGworldwide
DMVendors SPSS,NCR,IBM,SAS,SGI,DataDistilleries,Syllogic,etc. SystemSuppliers/consultants CapGemini,ICLRetail,Deloitte&Touche,etc. EndUsers BT,ABB,LloydsBank,AirTouch,Experian,etc.

19

20

CRISPDM:Overview

CRISPDM
Nonproprietary Application/Industryneutral Toolneutral Focusonbusinessissues Focus on business issues Aswellastechnicalanalysis Frameworkforguidance Experiencebase TemplatesforAnalysis
21

DataMiningmethodology ProcessModel Foranyone Providesacompleteblueprint Lifecycle:6phases

22

CRISPDM:Phases
BusinessUnderstanding
Projectobjectivesandrequirementsunderstanding,Dataminingproblem definition

PhasesandTasks
Business Understanding
Determine Business Objectives

Data Understanding
Collect Initial Data

Data Preparation

Modeling

Evaluation

Deployment

DataUnderstanding
Initialdatacollectionandfamiliarization,Dataqualityproblems identification

Select Data

Select Modeling Technique

Evaluate Results

Plan Deployment

DataPreparation
Table,recordandattributeselection,Datatransformationandcleaning

Assess Situation

Describe Data

Clean Data

Generate Test Design

Review Process

Plan Monitering & Maintenance Produce Final Report

Modeling
Modelingtechniquesselectionandapplication,Parameterscalibration

Determine Data Mining Goals

Explore Data

Construct Data

Build Model

Determine Next Steps

Evaluation
Businessobjectives&issuesachievementevaluation

Produce Project Plan

Verify Data Quality

Integrate Data

Assess Model

Review Project

Deployment
23

Resultmodeldeployment,Repeatabledataminingprocess implementation

24

Format Data

7/19/2010

Phase1.BusinessUnderstanding
StatementofBusinessObjective StatementofDataMiningObjective StatementofSuccessCriteria

Phase1.BusinessUnderstanding
Determinebusinessobjectives Thoroughlyunderstand,fromabusinessperspective, whattheclientreallywantstoaccomplish Uncoverimportantfactors,atthebeginning,thatcan influencetheoutcomeoftheproject Neglectingthisstepistoexpendagreatdealofeffort l i hi i d d l f ff producingtherightanswerstothewrongquestions Assesssituation Moredetailedfactfindingaboutalloftheresources, constraints,assumptionsandotherfactorsthatshould beconsidered Fleshoutthedetails 26

Focusesonunderstandingtheprojectobjectivesandrequirements fromabusinessperspective,thenconvertingthisknowledgeintoa dataminingproblemdefinitionandapreliminaryplandesignedto achievetheobjectives


25

Phase1.BusinessUnderstanding
Determinedatamininggoals A businessgoalstatesobjectivesinbusinessterminology Adatamininggoalstatesprojectobjectivesintechnicalterms Example:
thebusinessgoal:Increasecatalogsalestoexistingcustomers. adatamininggoal:Predicthowmanymusictracksacustomerwillbuy,given p p y , g p ( g , theirpurchasesoverthepastthreeyears,demographicinformation(age, salary,city)andthepriceoftheitem.

Phase2.DataUnderstanding
ExploretheData VerifytheQuality FindOutliers

Produceprojectplan Describetheintendedplanforachievingthedatamininggoals andthebusinessgoals Theplanshouldspecifytheanticipatedsetofstepstobe performedduringtherestoftheprojectincludinganinitial selectionoftoolsandtechniques


27

Startswithaninitialdatacollectionandproceedswithactivities inordertogetfamiliarwiththedata,toidentifydataquality problems,todiscoverfirstinsightsintothedataortodetect interestingsubsetstoformhypothesesforhiddeninformation. 28

Phase2.DataUnderstanding
Collectinitialdata
Acquirewithintheprojectthedatalistedintheproject resources Includesdataloadingifnecessaryfordataunderstanding Possiblyleadstoinitialdatapreparationsteps Ifacquiringmultipledatasources,integrationisan If acquiring multiple data sources integration is an additionalissue,eitherhereorinthelaterdata preparationphase

Phase2.DataUnderstanding
Exploredata Tacklesthedataminingquestions,whichcanbeaddressed usingquerying,visualizationandreportingincluding:
Distribution ofkeyattributes,resultsofsimpleaggregations Relationsbetweenpairsorsmallnumbersofattributes Propertiesofsignificantsubpopulations,simplestatisticalanalyses

Mayaddressdirectlythedatamininggoals y y gg

Describedata
Examinethegrossorsurfacepropertiesofthe acquireddata Reportontheresults
29 30

Maycontributetoorrefinethedatadescriptionandquality reports Mayfeedintothetransformationandotherdatapreparation needed Verifydataquality Examinethequalityofthedata,addressingquestionssuchas:


Isthedatacomplete? Aretheremissingvaluesinthedata?

7/19/2010

Phase3.DataPreparation
Takesusuallyover90%ofthetime
Collection Assessment ConsolidationandCleaning Dataselection Transformations

Phase3.DataPreparation
Selectdata Decide onthedatatobeusedforanalysis Criteriaincluderelevancetothedatamininggoals,quality andtechnicalconstraintssuchaslimitsondatavolumeor datatypes Coversselectionofattributesaswellasselectionofrecords inatable in a table Cleandata Raisethedataqualitytothelevelrequiredbytheselected analysistechniques Mayinvolveselectionofcleansubsetsofthedata,the insertionofsuitabledefaultsormoreambitioustechniques suchastheestimationofmissingdatabymodeling
32

Coversallactivitiestoconstructthefinaldatasetfromtheinitial rawdata.Datapreparationtasksarelikelytobeperformed multipletimesandnotinanyprescribedorder.Tasksinclude table,recordandattributeselectionaswellastransformation 31 andcleaningofdataformodelingtools.

Phase3.DataPreparation
Constructdata Constructivedatapreparationoperationssuchasthe productionofderivedattributes,entirenewrecords ortransformedvaluesforexistingattributes Integratedata Methodswherebyinformationiscombinedfrom Methods whereby information is combined from multipletablesorrecordstocreatenewrecordsor values Formatdata Formattingtransformationsrefertoprimarily syntacticmodificationsmadetothedatathatdonot changeitsmeaning,butmightberequiredbythe 33 modelingtool

Phase4.Modeling
Selectthemodelingtechnique
(baseduponthedataminingobjective)

Buildmodel
(Parametersettings)

Assessmodel(rankthemodels)
Variousmodelingtechniquesareselectedandappliedandtheir parametersarecalibratedtooptimalvalues.Sometechniques havespecificrequirementsontheformofdata.Therefore, steppingbacktothedatapreparationphaseisoftennecessary.
34

Phase4.Modeling
Selectmodelingtechnique Selecttheactualmodelingtechniquethatitistobeused
Example:decisiontree,neuralnetwork

Phase4.Modeling
Buildmodel Run themodelingtoolontheprepareddatasettocreateoneor moremodels Assessmodel Interpretthemodelsaccordingtohisdomainknowledge,thedata miningsuccesscriteriaandthedesiredtestdesign Judgethesuccessoftheapplicationofmodelinganddiscovery J d th f th li ti f d li d di techniquesmoretechnically Contactbusinessanalystsanddomainexpertslaterinorderto discussthedataminingresultsinthebusinesscontext Onlyconsidermodelswhereastheevaluationphasealsotakes intoaccountallotherresultsthatwereproducedinthecourseof theproject
36

Ifmultipletechniquesareapplied,performthistaskfor eachtechniquesseparately Generatetestdesign Beforeactuallybuildingamodel,generateaprocedure ormechanismtotestthemodelsqualityandvalidity


Example:Inclassification,itiscommontouseerrorratesasqualitymeasures fordataminingmodels.Therefore,typicallyseparatethedatasetintotrain andtestset,buildthemodelonthetrainsetandestimateitsqualityonthe separatetestset

35

7/19/2010

Phase5.Evaluation
Evaluationofmodel
Howwellitperformedontestdata

Phase5.Evaluation
Evaluateresults
Assessthedegreetowhichthemodelmeetsthe businessobjectives Seekstodetermineifthereissomebusinessreason whythismodelisdeficient Testthemodel(s)ontestapplicationsinthereal T h d l( ) li i i h l applicationiftimeandbudgetconstraintspermit Alsoassessesotherdataminingresultsgenerated Unveiladditionalchallenges,informationorhintsfor futuredirections
38

Methodsandcriteria
dependonmodeltype

Interpretationofmodel
importantornot,easyorharddependsonalgorithm

Thoroughlyevaluatethemodelandreviewthestepsexecutedtoconstructthemodeltobe certainitproperlyachievesthebusinessobjectives.Akeyobjectiveistodetermineifthere issomeimportantbusinessissuethathasnotbeensufficientlyconsidered.Attheendof thisphase,adecisionontheuseofthedataminingresultsshouldbereached


37

Phase5.Evaluation
Reviewprocess Do amorethoroughreviewofthedataminingengagementin ordertodetermineifthereisanyimportantfactorortaskthat hassomehowbeenoverlooked Reviewthequalityassuranceissues
Example:Didwecorrectlybuildthemodel?

Phase6.Deployment
Determinehowtheresultsneedtobeutilized
Whoneedstousethem? Howoftendotheyneedtobeused

DeployDataMiningresultsby
Scoringadatabase,utilizingresultsas businessrules,interactivescoringonline business rules interactive scoring on line

Determinenextsteps Determine next steps Decideshowtoproceedatthisstage Decideswhethertofinishtheprojectandmoveonto deploymentifappropriateorwhethertoinitiatefurther iterationsorsetupnewdataminingprojects Includeanalysesofremainingresourcesandbudgetthat influencesthedecisions
39

Theknowledgegainedwillneedtobeorganizedandpresentedina waythatthecustomercanuseit.However,dependingonthe requirements,thedeploymentphasecanbeassimpleasgenerating areportorascomplexasimplementingarepeatabledatamining 40 processacrosstheenterprise.

Phase6.Deployment
Plandeployment
Inordertodeploythedataminingresult(s)intothe business,takestheevaluationresultsandconcludesa strategyfordeployment Documenttheprocedureforlaterdeployment

Phase6.Deployment
Producefinalreport
Theprojectleaderandhisteamwriteupafinalreport Maybeonlyasummaryoftheprojectanditsexperiences Maybeafinalandcomprehensivepresentationofthe dataminingresult(s)

Planmonitoringandmaintenance Plan monitoring and maintenance


Importantifthedataminingresultsbecomepartofthe daytodaybusinessanditenvironment Helpstoavoidunnecessarilylongperiodsofincorrectusage ofdataminingresults Needsadetailedonmonitoringprocess Takesintoaccountthespecifictypeofdeployment
41

Reviewproject i j Assess whatwentrightandwhatwentwrong, whatwasdonewellandwhatneedstobe improved

42

You might also like