Professional Documents
Culture Documents
of Data Mining
Week 1
Topics
Introduction
Syllabus
DataMiningConcepts
TeamOrganization
Introduction Session
Yournameandmajor
Thedefinitionofdatamining
Th d fi iti
fd t
i i
Yourexpectationfromthiscourse
Your expectation from this course
Course Syllabus
Syllabus
S ll b
Source: www.kdnu
uggets.com
m
DataMiningApplications
Percentage
Banking
13
Bioinformatics/biotech
10
Directmarketing/fundraising
10
F d d t ti
Frauddetection
Scientificdata
Insurance
Telecommunication
l
Medical/pharmaceuticals
Retail
eCommerce/Web
Other
Investment/stocks
Manufacturing
Security
Supplychainanalysis
Travel
Entertainment
Newsweek,May22,2006
Figure 9.14
9
A Ch
hemical database
d
e.
C mistrry Infform
Chem
matic
cs
cs
Source:CoverpageofAdvancedinKnowledgeDiscoveryandDataMining,
editedbyU.Fayyad,
Shapiro,P.SmythandR.Uthurusamy,MITPress
edited by U Fayyad G.Piatesky
G PiateskyShapiro
P Smyth and R Uthurusamy MIT Press
http://www.sims.berkeley.edu/research/proje
cts/how much info 2003/
cts/howmuchinfo2003/
STATISTICS
& MATH
MACHINE
LEARNING
DATA
MINING
INFORMATION
RETRIEVAL
INFORMATION
THEORY
OTHER
DISCIPLINES
Da
ata Minin
M ng iss a
P
Proce
ess of
o knowle
edge
e
discove
ery
Figure 1.4 Data mining as a step in the process of knowledge discovery
KnowledgeBase
Database or
Data Warehouse Server
data cleaning, integration, and selection
Database
Data
Warehouse
World-Wide
o d de Other Info
Repositories
Web
Data
a a Mining
ga
and
d SStakeholders
a e o de s
Increasing potential
to support
business decisions
Making
M
ki
Decisions
End User
Data Presentation
Visualization Techniques
Business
Analyst
Data Mining
K
Knowledge
l d Discovery
Di
Data Exploration
y
Querying
y g and Reporting
p
g
Statistical Analysis,
Data
Analyst
DBA
Structured
Semistructured
S i t t d
Unstructured
20
Dataisorganizedinsemanticentities
g
Similarentitiesaregroupedtogether
( l ti
(relationsorclasses)
l
)
Entities
Entitiesinthesamegrouphavethesame
in the same group have the same
descriptions(attributes,features)
21
22
Semi-structured
Semi
structured Data (1)
Semistructureddataareorganizedin
g
semanticentities
Similarentitiesaregroupedtogether
Si il
titi
d t th
Entities
Entitiesinsamegroupmaynothavesame
in same group may not have same
attributes
23
Semi-structured
Semi
structured Data (2)
Attributes
Orderofattributesnotnecessarilyimportant
Notallattributesmayberequired
Sizeofsameattributesinagroupmaydiffer
Typeofsameattributesinagroupmaydiffer
24
XML
<bank1>
<customer>
<customer_name>Hayes</customer_name>
H
/
<customer_street>Main</customer_street>
<customer_city>Harrison</customer_city>
<account>
<account_number>A102</account_number>
<branch_name>Perryridge </branch_name>
<balance>400</balance>
</account>
<account>
</account>
</customer>
.
.
</bank 1>
</bank1>
25
Massesofcomputerizeddata
whichdonothaveadatastructure
whichiseasilyreadablebyamachine
26
MerrillLynchestimatesthatmorethan85percentof
allbusinessinformationexistsasunstructureddata
commonlyappearinginemails,memos,notesfrom
callcentersandsupportoperations,news,user
ll
d
i
groups,chats,reports,letters,surveys,whitepapers,
marketing material research presentations and Web
marketingmaterial,research,presentationsandWeb
pages. DMReviewMagazine,February2003Issue
Numericandcategorical
Numeric and categorical
Quantitativeandqualitative
Nominalandordinal
Staticanddynamic(temporal)
28
Numericdata
Numeric data
Realnumberdata,integernumberdata
Properties
Orderrelations(2<5)
Distancerelation(d(2.3,4.2)
Distance relation (d(2.3, 4.2) =1.9)
1.9)
Equalityrelation(2=2)
29
Categorical(symbolic)values
Categorical (symbolic) values
Equalityrelation
Blue=BlueorRea<>Blue
Blue = Blue or Rea <> Blue
Categoricalvaluescanbeconvertedtoanumeric
values
Gender(male,female) (0,1)
30
Qualitativedata
Nominal
N i l
Ordinal
31
Nominal Data
Utilitycustomertype(residential,commercial,
industrial,governmental)
Usedifferentsymbols,characters,and
numbers
ThesevaluescanbecodedalphabeticallyasA,
B,andC,ornumericallyas1,2,and3
d
i ll
d
Orderless
Order less
32
Ordinal Data
Therankofthestudentinaclass
O
Ordinalvariablesisacategoricalvariablefor
di l
i bl i
i l i bl f
whichanorderrelationisdefinedbutnota
di t
distancerelation
l ti
The
Theorderedscaleneednotbenecessarily
ordered scale need not be necessarily
linear;differencebetween4th and5th students
are different to that of 14th and15
aredifferenttothatof14
and 15th students
33
Staticdata
Attributevaluesdonotchangewithtime
Dynamicdata
Attributevalueschangewithtime
Att ib t
l
h
ith ti
34
Data Repositories
Transactionaldatabase
Relationaldatabase
Relational database
Datawarehouse
Advanceddatabase
Datastream
The World Wide Web
TheWorldWideWeb
35
Transactional Database
TID
List of item_IDs
T100
I1, I2, I5
T200
I2 I4
I2,
T300
I2, I3
T400
I1, I2, I4
T500
I1, I3
T600
I2, I3
T700
I1 I3
I1,
T800
T900
I1, I2, I3
Fig
gure 1.6. Fragme
ents of Re
elations
Fro
om a Rellational Databas
D
se for AllE
Electroniics
37
39
40
Table 3.3
3 3 A 3-D
3 D view of sales data for AllElectronics,
AllElectronics according to the
dimensions time, item, and location. The measure displayed is dollar_sold (in
thousands).
Figure 3.1 A 3-D data cube representation of the data in Table 3.3,
according to the dimensions time, item, and location. The measure
displayed is dollar_sold (in thousands).
42
Fig
gure 3.10. Example
es of Typic
cal OLAP
op
perations on
o multid
dimension
nal data cube,
c
co
ommonly used for data
d
warrehousing
g
43
Advanced Databases
Objectrelationaldatabases
Temporaldatabases
Sequencedatabases
Timeseriesdatabases
Spatialdatabases
Saptiotemporal
Saptio
temporaldatabases
databases
Textdatabases
H t
Heterogeneousdatabases
d t b
Data Streams
Th
Thefeaturesofdatastream:hugeorpossibly
f
fd
h
ibl
infinitevolume,dynamicallychanging,flowing
i
inandoutinafixedorder,allowingonlyone
d t i fi d d
ll i
l
orasmallnumberofscans,anddemanding
f t ( ft
fast(oftenrealtime)responsetime
l ti )
ti
TheWWWservesahuge,distributed,global
g ,
,g
informationservicecenterfornews,
,
,
advertisements,consumerinformation,
financialmanagement,education,
g
government,ecommerce,andmanyother
,
,
y
informationservices
ThechallengesforKD
g
Size
Complexity
p
y
Dynamic
Diversity
Relevance
Lab Activities
IntroductiontoR
Organizeyourteam
Eachteamconsistofthree(four)students
Emailyourteaminformation(namesandemailaddresses)to
theinstructorbytheendoftodayslabsession
Readthechapter2ofthelecturetextbookanddoteam
homeworkassignment#1
Readthechapters1,2and3ofthelabtextbook
Brainstorm on the topic of you group project
Brainstormonthetopicofyougroupproject
Datatypesanddatarepositories(Section1.3)
Datapreprocessing(Ch.2)