Professional Documents
Culture Documents
Intro:MLLearningProcess
DataMiningbelongstoawiderfamilyof machinelearningapplications Specifics ofprocessmaydiffergreatlydueto severalfactors:
Applicationtype Tools andTechniquesinuse q Sourceandtypeofdata Targetapplication Experienceofpractitioners/organization Methodologyinuse etc.
03.DataMiningProcess
Data MiningasanexampleMLProcess
DataMiningDefinitions:
Theextractionofimplicit,previouslyunknown andpotentiallyuseful informationfromdataindatabases. Extractionofinteresting(nontrivial,implicit,previouslyunknown andpotentiallyuseful)patternsorknowledgefromhugeamountof data Exploration&analysis,byautomaticorsemiautomaticmeans,of largequantitiesofdatainordertodiscover meaningfulpatterns g p
DataMiningandBusinessIntelligence
Increasing potential to support business decisions
Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery
End User
AlternativeNames:
Knowledgediscovery(mining)indatabases(KDD) Knowledgeextraction Data/patternanalysis Dataarcheology Datadredging Informationharvesting BusinessIntelligence,etc.
Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems
DBA
DataMining:ConfluenceofMultipleDisciplines
Machine Learning Pattern Recognition
WhyNotTraditionalDataAnalysis?
Tremendousamountofdata Algorithmsmustbehighlyscalabletohandlesuchasterabytesof data Highdimensionalityofdata Microarraymayhavetensofthousandsofdimensions
Statistics
Applications
DataMining
Visualization
Algorithm
Database Technology
HighPerformance Computing
Highcomplexityofdata High complexity of data Datastreamsandsensordata Timeseriesdata,temporaldata,sequencedata Structuredata,graphs,socialnetworksandmultilinkeddata Heterogeneousdatabasesandlegacydatabases Spatial,spatiotemporal,multimedia,textandWebdata Softwareprograms,scientificsimulations
6
Newandsophisticatedapplications
7/19/2010
WhatDatatomine?
BusinessTransactions ScientificData MedicalandPersonalData Surveillancevideoandpictures Satellitesensingdata Games DigitalMedia CADandSoftwareengineeringdata VirtualWorldsData Textreports,memos,email,blog WWWrepositories
Whereisthisdata?
Flatfiles Relational Databases DataWarehouses TransactionDatabases MultimediaDatabases Multimedia Databases SpatialDatabases Timeseriesdatabases WorldWideWeb HardCopyDocuments
Whatcanbediscovered?
CharacterizationRules DiscriminationRules Association Rules ClassificationRules Prediction Clustering OutlierAnalysis EvolutionandDeviationAnalysis
ClassificationOfDataMiningSystems
Bytypeofdatasourcemined:
Spatialdata,multimediadata,timeseriesdata,textdata, www dataetc.
Bydatamodeldrawnon
Relational database,datawarehouse,transactionaldatabase, objectorienteddatabaseetc.
Bymining techniqueused:
Neuralnetwork,geneticalgorithm,statistics,visualization, databaseorientedetc. Alsotakesintoaccountthedegreeofuserinteractioninvolvedin theprocess:querydriven,interactiveexploratorysystems, autonomoussystem
SomeChallengesforDataMiningPractitioners
DataSourceIssues:Access;Quantities;Formats; Quality Business/Domainunderstandingandrelatedchallenges: incorporationofconstraints,expertknowledgeandbackground knowledgeindatamining Socialissues:resistance,politics,misinterpretationofrole Protectionofsecurity,integrityandprivacyindatamining Highdatadimensionality;Userinterfaceissues High data dimensionality; User interface issues HandlingNoise,uncertaintyandincompletenessofdata Patternevaluationandknowledgeintegration Miningdiverseandheterogeneouskindsofdata:e.g., bioinformatics,Web,software/systemengineering,information networks PerformanceIssues:Efficiencyandscalabilityofdatamining algorithms
Ex.1:MarketAnalysisandManagement
Wheredoesthedatacomefrom?Creditcardtransactions,loyalty cards,discountcoupons,customercomplaintcalls,plus(public) lifestylestudies Targetmarketing
Findclustersofmodelcustomerswhosharethesamecharacteristics: interest,incomelevel,spendinghabits,etc. Determinecustomerpurchasingpatternsovertime
Provisionofsummaryinformation
Multidimensionalsummaryreports
12
Statisticalsummaryinformation(datacentraltendencyand variation)
7/19/2010
Ex.2:CorporateAnalysis&RiskManagement Financeplanningandassetevaluation
Cashflowanalysisandprediction Contingentclaimanalysistoevaluateassets Crosssectionalandtimeseriesanalysis(financialratio,trend analysis,etc.)
Resourceplanning
Summarizeandcomparetheresourcesandspending
Competition
Monitorcompetitorsandmarketdirections Groupcustomersintoclassesandaclassbasedpricingprocedure
13
Telecommunications:phonecallfraud
Phonecallmodel:destinationofthecall,duration,timeofdayorweek. Analyzepatternsthatdeviatefromanexpectednorm
Retailindustry
Analystsestimatethat38%ofretailshrinkisduetodishonestemployees
14
Setpricingstrategyinahighlycompetitivemarket
Antiterrorism
AreAlltheDiscoveredPatternsInteresting?
Dataminingmaygeneratethousandsofpatterns:Notallofthemare interesting
Suggestedapproach:Humancentered,querybased,focusedmining
FindAllandOnlyInterestingPatterns?
Findalltheinterestingpatterns:Completeness Canadataminingsystemfindalltheinterestingpatterns?Do weneedtofindalloftheinterestingpatterns? Heuristicvs.exhaustivesearch Associationvs.classificationvs.clustering Sea c o o y te est g patte s Searchforonlyinterestingpatterns:Anoptimizationproblem opt at o p ob e Canadataminingsystemfindonlytheinteresting patterns? Approaches
Firstgeneralallthepatternsandthenfilterouttheuninteresting ones Generateonlytheinterestingpatternsminingquery optimization
Interestingnessmeasures
Apatternisinterestingifitiseasilyunderstoodbyhumans,validonnew ortestdatawithsomedegreeofcertainty,potentiallyuseful,novel,or validatessomehypothesisthatauserseekstoconfirm
Objectivevs.subjectiveinterestingnessmeasures
Objective:basedonstatisticsandstructuresofpatterns,e.g.,support, confidence,etc. Subjective:basedonusersbeliefinthedata,e.g.,unexpectedness,
15
novelty,actionability,etc.
16
OtherPatternMiningIssues
Precisepatternsvs.approximatepatterns Associationandcorrelationmining:possiblefindsetsof precisepatterns
Butapproximatepatternscanbemorecompactandsufficient Howtofindhighqualityapproximatepatterns??
CRISPDM
Expanded:CRossIndustryStandardProcessforDataMining WhyShouldTherebeaStandardProcess?
Thedataminingprocessmustbereliableandrepeatableby peoplewithlittledataminingbackground.
Genesequencemining:approximatepatternsareinherent
Howtoderiveefficientapproximatepatternminingalgorithms??
7/19/2010
WhyShouldTherebeaStandardProcess?
Frameworkforrecordingexperience Allowsprojectstobereplicated Aidtoprojectplanningandmanagement Comfort factorfornewadopters Comfortfactor for new adopters DemonstratesmaturityofDataMining Reducesdependencyonstars
ProcessStandardization
Initiativelaunchedinlate1996bythreeveteransofdata miningmarket. DaimlerChrysler(thenDaimlerBenz),SPSS(thenISL),NCR
1999)
Developedandrefinedthroughseriesofworkshops(from1997 Over300organizationcontributedtotheprocessmodel Over 300 organization contributed to the process model PublishedCRISPDM1.0(1999) Over200membersoftheCRISPDMSIGworldwide
DMVendors SPSS,NCR,IBM,SAS,SGI,DataDistilleries,Syllogic,etc. SystemSuppliers/consultants CapGemini,ICLRetail,Deloitte&Touche,etc. EndUsers BT,ABB,LloydsBank,AirTouch,Experian,etc.
19
20
CRISPDM:Overview
CRISPDM
Nonproprietary Application/Industryneutral Toolneutral Focusonbusinessissues Focus on business issues Aswellastechnicalanalysis Frameworkforguidance Experiencebase TemplatesforAnalysis
21
22
CRISPDM:Phases
BusinessUnderstanding
Projectobjectivesandrequirementsunderstanding,Dataminingproblem definition
PhasesandTasks
Business Understanding
Determine Business Objectives
Data Understanding
Collect Initial Data
Data Preparation
Modeling
Evaluation
Deployment
DataUnderstanding
Initialdatacollectionandfamiliarization,Dataqualityproblems identification
Select Data
Evaluate Results
Plan Deployment
DataPreparation
Table,recordandattributeselection,Datatransformationandcleaning
Assess Situation
Describe Data
Clean Data
Review Process
Modeling
Modelingtechniquesselectionandapplication,Parameterscalibration
Explore Data
Construct Data
Build Model
Evaluation
Businessobjectives&issuesachievementevaluation
Integrate Data
Assess Model
Review Project
Deployment
23
Resultmodeldeployment,Repeatabledataminingprocess implementation
24
Format Data
7/19/2010
Phase1.BusinessUnderstanding
StatementofBusinessObjective StatementofDataMiningObjective StatementofSuccessCriteria
Phase1.BusinessUnderstanding
Determinebusinessobjectives Thoroughlyunderstand,fromabusinessperspective, whattheclientreallywantstoaccomplish Uncoverimportantfactors,atthebeginning,thatcan influencetheoutcomeoftheproject Neglectingthisstepistoexpendagreatdealofeffort l i hi i d d l f ff producingtherightanswerstothewrongquestions Assesssituation Moredetailedfactfindingaboutalloftheresources, constraints,assumptionsandotherfactorsthatshould beconsidered Fleshoutthedetails 26
Phase1.BusinessUnderstanding
Determinedatamininggoals A businessgoalstatesobjectivesinbusinessterminology Adatamininggoalstatesprojectobjectivesintechnicalterms Example:
thebusinessgoal:Increasecatalogsalestoexistingcustomers. adatamininggoal:Predicthowmanymusictracksacustomerwillbuy,given p p y , g p ( g , theirpurchasesoverthepastthreeyears,demographicinformation(age, salary,city)andthepriceoftheitem.
Phase2.DataUnderstanding
ExploretheData VerifytheQuality FindOutliers
Phase2.DataUnderstanding
Collectinitialdata
Acquirewithintheprojectthedatalistedintheproject resources Includesdataloadingifnecessaryfordataunderstanding Possiblyleadstoinitialdatapreparationsteps Ifacquiringmultipledatasources,integrationisan If acquiring multiple data sources integration is an additionalissue,eitherhereorinthelaterdata preparationphase
Phase2.DataUnderstanding
Exploredata Tacklesthedataminingquestions,whichcanbeaddressed usingquerying,visualizationandreportingincluding:
Distribution ofkeyattributes,resultsofsimpleaggregations Relationsbetweenpairsorsmallnumbersofattributes Propertiesofsignificantsubpopulations,simplestatisticalanalyses
Mayaddressdirectlythedatamininggoals y y gg
Describedata
Examinethegrossorsurfacepropertiesofthe acquireddata Reportontheresults
29 30
7/19/2010
Phase3.DataPreparation
Takesusuallyover90%ofthetime
Collection Assessment ConsolidationandCleaning Dataselection Transformations
Phase3.DataPreparation
Selectdata Decide onthedatatobeusedforanalysis Criteriaincluderelevancetothedatamininggoals,quality andtechnicalconstraintssuchaslimitsondatavolumeor datatypes Coversselectionofattributesaswellasselectionofrecords inatable in a table Cleandata Raisethedataqualitytothelevelrequiredbytheselected analysistechniques Mayinvolveselectionofcleansubsetsofthedata,the insertionofsuitabledefaultsormoreambitioustechniques suchastheestimationofmissingdatabymodeling
32
Phase3.DataPreparation
Constructdata Constructivedatapreparationoperationssuchasthe productionofderivedattributes,entirenewrecords ortransformedvaluesforexistingattributes Integratedata Methodswherebyinformationiscombinedfrom Methods whereby information is combined from multipletablesorrecordstocreatenewrecordsor values Formatdata Formattingtransformationsrefertoprimarily syntacticmodificationsmadetothedatathatdonot changeitsmeaning,butmightberequiredbythe 33 modelingtool
Phase4.Modeling
Selectthemodelingtechnique
(baseduponthedataminingobjective)
Buildmodel
(Parametersettings)
Assessmodel(rankthemodels)
Variousmodelingtechniquesareselectedandappliedandtheir parametersarecalibratedtooptimalvalues.Sometechniques havespecificrequirementsontheformofdata.Therefore, steppingbacktothedatapreparationphaseisoftennecessary.
34
Phase4.Modeling
Selectmodelingtechnique Selecttheactualmodelingtechniquethatitistobeused
Example:decisiontree,neuralnetwork
Phase4.Modeling
Buildmodel Run themodelingtoolontheprepareddatasettocreateoneor moremodels Assessmodel Interpretthemodelsaccordingtohisdomainknowledge,thedata miningsuccesscriteriaandthedesiredtestdesign Judgethesuccessoftheapplicationofmodelinganddiscovery J d th f th li ti f d li d di techniquesmoretechnically Contactbusinessanalystsanddomainexpertslaterinorderto discussthedataminingresultsinthebusinesscontext Onlyconsidermodelswhereastheevaluationphasealsotakes intoaccountallotherresultsthatwereproducedinthecourseof theproject
36
35
7/19/2010
Phase5.Evaluation
Evaluationofmodel
Howwellitperformedontestdata
Phase5.Evaluation
Evaluateresults
Assessthedegreetowhichthemodelmeetsthe businessobjectives Seekstodetermineifthereissomebusinessreason whythismodelisdeficient Testthemodel(s)ontestapplicationsinthereal T h d l( ) li i i h l applicationiftimeandbudgetconstraintspermit Alsoassessesotherdataminingresultsgenerated Unveiladditionalchallenges,informationorhintsfor futuredirections
38
Methodsandcriteria
dependonmodeltype
Interpretationofmodel
importantornot,easyorharddependsonalgorithm
Phase5.Evaluation
Reviewprocess Do amorethoroughreviewofthedataminingengagementin ordertodetermineifthereisanyimportantfactorortaskthat hassomehowbeenoverlooked Reviewthequalityassuranceissues
Example:Didwecorrectlybuildthemodel?
Phase6.Deployment
Determinehowtheresultsneedtobeutilized
Whoneedstousethem? Howoftendotheyneedtobeused
DeployDataMiningresultsby
Scoringadatabase,utilizingresultsas businessrules,interactivescoringonline business rules interactive scoring on line
Determinenextsteps Determine next steps Decideshowtoproceedatthisstage Decideswhethertofinishtheprojectandmoveonto deploymentifappropriateorwhethertoinitiatefurther iterationsorsetupnewdataminingprojects Includeanalysesofremainingresourcesandbudgetthat influencesthedecisions
39
Phase6.Deployment
Plandeployment
Inordertodeploythedataminingresult(s)intothe business,takestheevaluationresultsandconcludesa strategyfordeployment Documenttheprocedureforlaterdeployment
Phase6.Deployment
Producefinalreport
Theprojectleaderandhisteamwriteupafinalreport Maybeonlyasummaryoftheprojectanditsexperiences Maybeafinalandcomprehensivepresentationofthe dataminingresult(s)
42