You are on page 1of 63

1

1
Ch
Ch

ng 1: Tng quan
ng 1: Tng quan
v khai ph d liu
v khai ph d liu
Hc k 1 2011-2012
Khoa Khoa Khoa Khoa Hc & K Thut My Tnh Hc & K Thut My Tnh
Tr Tr ng i Hc Bch Khoa Tp. H Ch Minh ng i Hc Bch Khoa Tp. H Ch Minh
Cao
Cao
Hc
Hc
Ngnh
Ngnh
Khoa
Khoa
Hc
Hc
My
My
Tnh
Tnh
Gio
Gio
trnh
trnh

in
in
t
t
Bin
Bin
son
son
bi
bi
: TS.
: TS.
V
V
Th
Th
Ngc
Ngc
Chu
Chu
(
(
chauvtn@cse.hcmut.edu.vn
chauvtn@cse.hcmut.edu.vn
)
)
2
2
Ti liu tham kho
[1] Jiawei Han, Micheline Kamber, Data Mining: Concepts and
Techniques, Second Edition, Morgan Kaufmann Publishers, 2006.
[2] David Hand, Heikki Mannila, Padhraic Smyth, Principles of Data
Mining, MIT Press, 2001.
[3] David L. Olson, Dursun Delen, Advanced Data Mining
Techniques, Springer-Verlag, 2008.
[4] Graham J. Williams, Simeon J. Simoff, Data Mining: Theory,
Methodology, Techniques, and Applications, Springer-Verlag, 2006.
[5] Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and
Vipin Kumar, Next Generation of Data Mining, Taylor & Francis
Group, LLC, 2009.
[6] Daniel T. Larose, Data mining methods and models, John Wiley
& Sons, Inc, 2006.
[7] Ian H.Witten, Eibe Frank, Data mining : practical machine
learning tools and techniques, Second Edition, Elsevier Inc, 2005.
[8] Florent Messeglia, Pascal Poncelet & Maguelonne Teisseire,
Successes and new directions in data mining, IGI Global, 2008.
[9] Oded Maimon, Lior Rokach, Data Mining and Knowledge
Discovery Handbook, Second Edition, Springer Science + Business
Media, LLC 2005, 2010.
3
3
Ni dung
Chng 1: Tng quan v khai ph d liu
Chng 2: Cc vn tin x l d liu
Chng 3: Hi qui d liu
Chng 4: Phn loi d liu
Chng 5: Gom cm d liu
Chng 6: Lut kt hp
Chng 7: Khai ph d liu v cng ngh c s
d liu
Chng 8: ng dng khai ph d liu
Chng 9: Cc ti nghin cu trong khai ph
d liu
Chng 10: n tp
4
4
Chng 1: Tng quan v khai ph d
liu
1.0. Tnh hung
1.1. Qu trnh khm ph tri thc
1.2. Cc khi nim
1.3. ngha v vai tr ca khai ph d
liu
1.4. ng dng ca khai ph d liu
1.5. Tm tt
5
5
1.0. Tnh hung 1
Ngi ang s dng
th ID = 1234 tht
s l ch nhn ca
th hay l mt tn
trm?
6
6
1.0. Tnh hung 2
Tid Refund
Marital
Status
Taxable
Income
Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

ng A (Tid = 100)
c kh nng trn
thu???
7
7
1.0. Tnh hung 3
Ngy mai c
phiu STB s
tng???
8
8
1.0. Tnh hung 4
Khng (97%) 3.0 2.0 47 2008

Khng (45%) 4.5 5.5 82 2007
C (90%) 7.5 9.5 24 2006
C (80%) 6.0 7.0 90 2005
Khng 3.5 5.5 8 2004

14
3
2
1
MSV

C 5.5 5.0 2004
Khng 2.5 4.0 2004
C 8.0 6.5 2004
C 8.5 9.0 2004
TtNghip MnHc2 MnHc1 Kha
Lm sao xc nh c
kh nng tt nghip ca
mt sinh vin hin ti?
9
9
1.0. Tnh hung
We are data rich, but information poor.
Necessity is the mother of invention. - Plato
10
10
1.1. Qu trnh khm ph tri thc
Data
Cleaning
Data Integration
Data Sources
Data Warehouse
Task-relevant Data
Selection/Transformation
Data Mining
Pattern Evaluation/
Presentation
Patterns
11
11
1.1. Qu trnh khm ph tri thc
Knowledge discovery in databases is the nontrivial
process of identifying valid, novel, potentially useful,
and ultimately understandable patterns in data.
Frawley, W. J et al. (1991). Knowledge discovery in
databases: an overview.
Knowledge discovery from databases is the
process of using the database along with any
required selection, preprocessing, sub-sampling, and
transformations of it; to apply data mining methods
(algorithms) to enumerate patterns from it; and to
evaluate the products of data mining to identify the
subset of the enumerated patterns deemed
knowledge.
Fayyad, U.M et al. (1996). Advances in Knowledge Discovery
and Data Mining. MIT Press.
12
12
1.1. Qu trnh khm ph tri thc
Qu trnh khm ph tri thc l mt chui lp
gm cc bc:
Data cleaning (lm sch d liu)
Data integration (tch hp d liu)
Data selection (chn la d liu)
Data transformation (bin i d liu)
Data mining (khai ph d liu)
Pattern evaluation (nh gi mu)
Knowledge presentation (biu din tri thc)
13
13
1.1. Qu trnh khm ph tri thc
Qu trnh khm ph tri thc l mt chui
lp gm cc bc c thc thi vi:
Data sources (cc ngun d liu)
Data warehouse (kho d liu)
Task-relevant data (d liu c th s c khai
ph)
Patterns (mu kt qu t khai ph d liu)
Knowledge (tri thc t c)
14
14
1.1. Qu trnh khm ph tri thc
Increasing potential
to support
business decisions
End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
15
15
1.2. Cc khi nim
1.2.1. Khai ph d liu (data mining)
1.2.2. Cc tc v khai ph d liu (data
mining tasks/functions)
1.2.3. Cc quy trnh khai ph d liu (data
mining processes)
1.2.4. Cc h thng khai ph d liu (data
mining systems)
16
16
1.2.1. Khai ph d liu
Khai ph d liu
mt qu trnh trch xut tri thc t lng ln d liu
extracting or mining knowledge from large amounts of data
knowledge mining from data
mt qu trnh khng d trch xut thng tin n, hu ch,
cha c bit trc t d liu
the nontrivial extraction of implicit, previously unknown, and
potentially useful information from data
Cc thut ng thng c dng tng ng:
knowledge discovery/mining in data/databases
(KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging,
information harvesting, business intelligence
17
17
1.2.1. Khai ph d liu
Lng ln d liu sn c khai ph
Bt k loi d liu c lu tr hay tm thi, c cu trc hay
bn cu trc hay phi cu trc
D liu c lu tr
Cc tp tin truyn thng (flat files)
Cc c s d liu quan h (relational databases) hay quan h
i tng (object relational databases)
Cc c s d liu giao tc (transactional databases) hay kho d
liu (data warehouses)
Cc c s d liu hng ng dng: c s d liu khng gian
(spatial databases), c s d liu thi gian (temporal
databases), c s d liu khng thi gian (spatio-temporal
databases), c s d liu chui thi gian (time series
databases), c s d liu vn bn (text databases), c s d
liu a phng tin (multimedia databases),
Cc kho thng tin: the World Wide Web,
D liu tm thi: cc dng d liu (data streams)
18
18
1.2.1. Khai ph d liu
Tri thc t c t qu trnh khai ph
M t lp/khi nim (c trng ha v phn bit
ha)
Mu thng xuyn, cc mi quan h kt
hp/tng quan
M hnh phn loi v d on
M hnh gom cm
Cc phn t bin
Xu hng hay mc thng xuyn ca cc i
tng c hnh vi thay i theo thi gian

19
19
1.2.1. Khai ph d liu
Tri thc t c t qu trnh khai ph
Tri thc t c c th c tnh m t hay d on ty
thuc vo qu trnh khai ph c th.
M t (Descriptive): c kh nng c trng ha cc thuc tnh
chung ca d liu c khai ph (Tnh hung 1)
D on (Predictive): c kh nng suy lun t d liu hin c
d on (Tnh hung 2, 3, v 4)
Tri thc t c c th c cu trc, bn cu trc, hoc phi
cu trc.
Tri thc t c c th c/khng c ngi dng quan
tm cc o nh gi tri thc t c.
Tri thc t c c th c dng trong vic h tr ra
quyt nh, iu khin quy trnh, qun l thng tin, x l
truy vn
20
20
1.2.1. Khai ph d liu
(trends,
regularities, )
(characterization
and
discrimination)
21
21
1.2.1. Khai ph d liu
Khai ph d liu l mt lnh vc lin ngnh, ni hi
t ca nhiu hc thuyt v cng ngh.
Data mining as a confluence of multiple disciplines
Data Mining
Statistics
Machine
Learning
Database
Technology
Visualization
Other
Disciplines
22
22
1.2.1. Khai ph d liu
Khai ph d liu v cng ngh c s d liu
Kh nng ng gp ca cng ngh c s d liu
Cng ngh c s d liu cho vic qun l d liu c
khai ph.
D liu rt ln, c th vt qu kh nng ca b nh
chnh (main memory).
D liu c thu thp theo thi gian.
Cc h c s d liu c kh nng x l hiu qu lng
ln d liu vi cc c ch phn trang (paging) v hon
chuyn (swapping) d liu vo/ra b nh chnh.
Cc h c s d liu hin i c kh nng x l nhiu
loi d liu phc tp (spatial, temporal,
spatiotemporal, multimedia, text, Web, ).
Cc chc nng khc (x l ng thi, bo mt, hiu
nng, ti u ha, ) ca cc h c s d liu c
pht trin tt.
23
23
1.2.1. Khai ph d liu
Khai ph d liu v cng ngh c s d liu
Thc trng ng gp ca cng ngh c s d liu
Cc h qun tr c s d liu (DBMS) h tr khai ph d
liu.
Oracle Data Mining (Oracle 9i, 10g, 11g)
Cc cng c khai ph d liu ca Microsoft (MS SQL Server
2000, 2005, 2008)
Intelligent Miner (IBM)
Cc h c s d liu qui np (inductive database) h tr
khm ph tri thc.
Chun SQL/MM 6:Data Mining ca ISO/IEC 13249-
6:2006 h tr khai ph d liu.
c t giao din SQL cho cc ng dng v dch v khai ph
d liu t cc c s d liu quan h
24
24
1.2.1. Khai ph d liu
Khai ph d liu v l thuyt thng k
Inductive
Statistics
Statistics
Descriptive
Statistics
Hai tp d liu mu
c cng phn b?
D bo v
suy lun
M t d liu
25
25
1.2.1. Khai ph d liu
Khai ph d liu v hc my
Supervised
Machine Learning
Unsupervised
Reinforcement
Natural groupings
26
26
1.2.1. Khai ph d liu
Khai ph d liu v trc quan ha
D liu: 3D cubes, distribution charts, curves, surfaces, link
graphs, image frames and movies, parallel coordinates
Kt qu (tri thc): pie charts, scatter plots, box plots,
association rules, parallel coordinates, dendograms,
temporal evolution
Pie chart
Parallel coordinates
Temporal evolution
27
27
1.2.1. Khai ph d liu
Khai ph d liu v trc quan ha
Feature Selection
Mean Feature Image
28
28
1.2.1. Khai ph d liu
Khai ph d liu v trc quan ha
Gn nhn cc lp
Isodata (K-means)
Clustering
Mean Feature Image
Label Image
29
29
1.2.2. Cc tc v khai ph d liu
Khai ph m t lp/khi nim (c trng
ha v phn bit ha d liu)
Khai ph lut kt hp/tng quan
Phn loi d liu
D on
Gom cm d liu
Phn tch xu hng
Phn tch lch v phn t bin
Phn tch tng t

30
30
1.2.2. Cc tc v khai ph d liu
Tid Refund Marital
Status
Taxable
Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes
10

C
l
a
s
s
i
f
i
c
a
t
i
o
n
C
l
u
s
t
e
r
i
n
g
A
s
s
o
c
i
a
t
i
o
n

R
u
l
e
s
A
n
o
m
a
l
y

D
e
t
e
c
t
i
o
n
Milk
Data
o
t
h
e
r
s
31
31
1.2.2. Cc tc v khai ph d liu
Nm thnh t c bn c t mt tc v
khai ph d liu
D liu c th s c khai ph (task-relevant
data)
Loi tri thc s t c (kind of knowledge)
Tri thc nn (background knowledge)
Cc o (interestingness measures)
Cc k thut biu din tri thc/trc quan ha
mu (pattern visualization and knowledge
presentation)
32
32
1.2.2. Cc tc v khai ph d liu
D liu c th s c khai ph (task-
relevant data)
Phn d liu t cc d liu ngun c quan
tm
Tng ng vi cc thuc tnh hay chiu d liu
c quan tm
Bao gm: tn kho d liu/c s d liu, cc
bng d liu hay cc khi d liu, cc iu kin
chn d liu, cc thuc tnh hay chiu d liu
c tm, cc tiu ch gom nhm d liu
33
33
1.2.2. Cc tc v khai ph d liu
Loi tri thc s t c (kind of
knowledge)
Bao gm: c trng ha d liu, phn bit ha
d liu, m hnh phn tch kt hp hay tng
quan, m hnh phn lp, m hnh d on, m
hnh gom cm, m hnh phn tch phn t bin,
m hnh phn tch tin ha
Tng ng vi tc v khai ph d liu c th s
c thc thi
34
34
1.2.2. Cc tc v khai ph d liu
Tri thc nn (background knowledge)
Tng ng vi lnh vc c th s c khai ph
Hng dn qu trnh khm ph tri thc
H tr khai ph d liu nhiu mc tru tng khc
nhau
nh gi cc mu c tm thy
Bao gm: cc phn cp nim, nim tin ca
ngi s dng v cc mi quan h ca d liu
35
35
1.2.2. Cc tc v khai ph d liu
Cc o (interestingness measures)
Thng i km vi cc ngng gi tr (threshold)
Dn ng cho qu trnh khai ph hoc nh gi
cc mu c tm thy
Tng ng vi loi tri thc s t c v do ,
tng ng vi tc v khai ph d liu c th s
c thc thi
Kim tra: tnh n gin (simplicity), tnh chc
chn (certainty), tnh hu dng (utility), tnh mi
(novelty)
36
36
1.2.2. Cc tc v khai ph d liu
Cc k thut biu din tri thc/trc quan
ha mu (pattern visualization and
knowledge presentation)
Xc nh dng cc mu/tri thc c tm thy
th hin n ngi s dng
Bao gm: lut (rules), bng (tables), bo co
(reports), biu (charts), th (graphs), cy
(trees), v khi (cubes)
37
37
1.2.2. Cc tc v khai ph d liu
Khai ph d liu
Phn loi d liu
Gii thut phn loi vi cy quyt nh
Gii thut phn loi vi mng Bayes

Gom cm d liu
Gii thut gom cm k-means
Gii thut gom cm phn cp nhm

Khai ph lut kt hp
Gii thut Apriori


38
38
1.2.2. Cc tc v khai ph d liu
Khai Ph D Liu
Tc V Khai Ph D Liu
Task-relevant
Data
Interesting
Patterns
(Knowledge)
Gii
Thut
Gii
Thut
Gii
Thut
39
39
1.2.2. Cc tc v khai ph d liu
Bn thnh phn c bn ca mt gii thut khai
ph d liu
Cu trc mu hay cu trc m hnh (model or
pattern structure)
Hm t s (score function)
Phng php tm kim v ti u ha (optimization
and search method)
Chin lc qun l d liu (data management
strategy)
40
40
1.2.2. Cc tc v khai ph d liu
Cu trc mu hay cu trc m hnh (model or
pattern structure)
M hnh l m t ca tp d liu, mang tnh ton cc
mc cao.
Mu l c im (c trng) ca d liu, mang tnh cc
b, ch cho mt vi bn ghi/i tng hay vi bin.
Cu trc biu din cc dng chc nng chung vi cc
thng s cha c xc nh tr.
Cu trc m hnh l mt tm tt ton cc v d liu.
V d: Y = aX + b l mt cu trc m hnh v Y = 3X + 2 l
mt m hnh c th c nh ngha da trn cu trc ny.
Cu trc mu l nhng cu trc lin quan mt phn tng
i nh ca d liu hay ca khng gian d liu.
V d: p(Y>y1|X>x1) = p1 l mt cu trc mu v
p(Y>5|X>10) = 0.5 l mt mu c xc nh da trn cu
trc ny.
41
41
1.2.2. Cc tc v khai ph d liu
Hm t s (score function)
Hm t s l hm xc nh mt cu trc m
hnh/mu p ng tp d liu cho tt mc
no .
Hm t s cho bit liu mt m hnh c tt hn
cc m hnh khc hay khng.
Hm t s khng nn ph thuc nhiu vo tp d
liu, khng nn chim nhiu thi gian tnh ton.
Mt vi hm t s thng dng: likelihood, sum of
squared errors, misclassification rate,
42
42
1.2.2. Cc tc v khai ph d liu
Phng php tm kim v ti u ha (optimization and
search method)
Mc tiu ca phng php tm kim v ti u
ha l xc nh cu trc v gi tr cc thng s
p ng tt nht hm t s t d liu sn c.
Tm kim cc mu v m hnh
Khng gian trng thi: tp ri rc cc trng thi
Bi ton tm kim: bt u ti mt node (trng thi) c
th, di chuyn qua khng gian trng thi tm thy node
tng ng vi trng thi p ng tt nht hm t s.
Phng php tm kim: chin lc tham lam, c dng
heuristics, chin lc nhnh-cn
Ti u ha thng s
43
43
1.2.2. Cc tc v khai ph d liu
Chin lc qun l d liu (data management
strategy)
D liu c khai ph
t, ton b c x l ng thi trong b nh chnh
Nhiu, trn a, mt phn c x l ng thi trong b
nh chnh
Chin lc qun l d liu h tr cch d liu c
lu tr, nh ch mc, v truy xut
Gii thut khai ph d liu hiu qu (efficiency) v c tnh
co gin (scalability) vi d liu c khai ph.
Cng ngh c s d liu
44
44
1.2.3. Cc quy trnh khai ph d liu
Quy trnh khai ph d liu l mt chui lp
(iterative) (v tng tc(interactive)) gm
cc bc (giai on) bt u vi d liu th
(raw data) v kt thc vi tri thc
(knowledge of interest) p ng c s
quan tm ca ngi s dng.
Cross Industry Standard Process for Data Mining
(CRISP-DM at www.crisp-dm.org)
SEMMA (Sample, Explore, Modify, Model,
Assess) at the SAS Institute
45
45
1.2.3. Cc quy trnh khai ph d liu
S cn thit ca mt quy trnh khai ph d
liu
Cch thc tin hnh (hoch nh v qun l) d
n khai ph d liu c h thng
m bo n lc dnh cho mt d n khai ph
d liu c ti u ha
Vic nh gi v cp nht cc m hnh trong d
n c din ra lin tc.
46
46
1.2.3. Quy trnh CRISP-DM
Chun quy trnh cng nghip
c khi xng t 09/1996 v c h tr bi
hn 200 thnh vin
Chun m
H tr cng nghip/ng dng v cng c khai ph
d liu hin c
Tp trung vo cc vn nghip v cng nh
phn tch k thut
To ra mt khung thc hng dn qui trnh khai
ph d liu
C nn tng kinh nghim t cc lnh vc ng dng
47
47
1.2.3. Quy trnh CRISP-DM
48
48
1.2.3. Quy trnh CRISP-DM
Quy trnh CRISP-DM l mt quy trnh lp,
c kh nng quay lui (backtracking) gm 6
giai on:
Tm hiu nghip v (Business understanding)
Tm hiu d liu (Data understanding)
Chun b d liu (Data preparation)
M hnh ho (Modeling)
nh gi (Evaluation)
Trin khai (Deployment)
49
49
1.2.4. Cc h thng khai ph d liu
H thng khai ph d liu c pht trin da trn
khi nim rng ca khai ph d liu.
Khai ph d liu l mt qu trnh khm ph tri thc c
quan tm t lng ln d liu trong cc c s d liu, kho
d liu, hay cc kho thng tin khc.
Cc thnh phn chnh c th c
Database, data warehouse, World Wide Web, v
information repositories
Database hay data warehouse server
Knowledge base
Data mining engine
Pattern evaluation module
User interface
50
50
1.2.4. Kin trc ca mt h thng
khai ph d liu
51
51
1.2.4. Cc h thng khai ph d liu
Database, data warehouse, World Wide
Web, v information repositories
Thnh phn ny l cc ngun d liu/thng tin
s c khai ph.
Trong nhng tnh hung c th, thnh phn ny
l ngun nhp (input) ca cc k thut tch hp
v lm sch d liu.
Database hay data warehouse server
Thnh phn chu trch nhim chun b d liu
thch hp cho cc yu cu khai ph d liu.
52
52
1.2.4. Cc h thng khai ph d liu
Knowledge base
Thnh phn cha tri thc min, c dng
hng dn qu trnh tm kim, nh gi cc
mu kt qu c tm thy.
Tri thc min c th l cc phn cp khi nim,
nim tin ca ngi s dng, cc rng buc hay
cc ngng gi tr, siu d liu,
Data mining engine
Thnh phn cha cc khi chc nng thc hin
cc tc v khai ph d liu.
53
53
1.2.4. Cc h thng khai ph d liu
Pattern evaluation module
Thnh phn ny lm vic vi cc o (v cc
ngng gi tr) h tr tm kim v nh gi cc
mu sao cho cc mu c tm thy l nhng
mu c quan tm bi ngi s dng.
Thnh phn ny c th c tch hp vo thnh
phn Data mining engine.
54
54
1.2.4. Cc h thng khai ph d liu
User interface
Thnh phn h tr s tng tc gia ngi s
dng v h thng khai ph d liu.
Ngi s dng c th ch nh cu truy vn hay tc v
khai ph d liu.
Ngi s dng c th c cung cp thng tin h tr
vic tm kim, thc hin khai ph d liu su hn
thng qua cc kt qu khai ph trung gian.
Ngi s dng cng c th xem cc lc c s d
liu/kho d liu, cc cu trc d liu; nh gi cc mu
khai ph c; trc quan ha cc mu ny cc dng
khc nhau.
55
55
1.2.4. Cc h thng khai ph d liu
Cc c im c dng kho st mt h
thng khai ph d liu
Kiu d liu
Cc vn h thng
Ngun d liu
Cc tc v v phng php lun khai ph d liu
Vn gn kt vi cc h thng kho d liu/c s
d liu
Kh nng co gin d liu
Cc cng c trc quan ha
Ngn ng truy vn khai ph d liu v giao din
ha cho ngi dng
56
56
1.2.4. Cc h thng khai ph d liu
Mt s h thng khai ph d liu:
Intelligent Miner (IBM)
Microsoft data mining tools (Microsoft SQL
Server 2000/2005/2008)
Oracle Data Mining (Oracle 9i/10g/11g)
Enterprise Miner (SAS Institute)
Weka (the University of Waikato, New Zealand,
www.cs.waikato.ac.nz/ml/weka)

57
57
1.2.4. Cc h thng khai ph d liu
Phn bit cc h thng khai ph d liu vi
Cc h thng phn tch d liu thng k
(statistical data analysis systems)
Cc h thng hc my (machine learning
systems)
Cc h thng truy hi thng tin (information
retrieval systems)
Cc h c s d liu din dch (deductive
database systems)
Cc h c s d liu (database systems)

58
58
1.3. ngha v vai tr ca khai ph d liu
Data Collection and Database Creation
(1960s and earlier)
Database Management Systems
(1970s-early 1980s)
Advanced Database Systems
(mid-1980s-present)
Advanced Data Analysis: Advanced Data Analysis:
Data Warehousing and Data Mining Data Warehousing and Data Mining
(late 1980s (late 1980s- -present) present)
Web-based Database Systems
(1990s-present)
New Generation of Integrated Data
and Information Systems
(present-future)
S tin ha ca
cng ngh
h c s d liu
59
59
1.3. ngha v vai tr ca khai ph d liu
Cng ngh hin i trong lnh vc qun l
thng tin
Hin din khp ni (ubiquitous) v c tnh n
(invisible) trong nhiu kha cnh ca i sng
hng ngy
Lm vic, mua sm, tm kim thng tin, ngh ngi,
c p dng trong nhiu ng dng thuc nhiu
lnh vc khc nhau
H tr cc nh khoa hc, gio dc hc, kinh t
hc, doanh nghip, khch hng,
60
60
1.4. ng dng ca khai ph d liu
Trong kinh doanh (business)
Trong ti chnh (finance) v tip th bn
hng (sales marketing)
Trong thng mi (commerce) v ngn hng
(bank)
Trong bo him (insurance)
Trong khoa hc (science) v y sinh hc
(biomedicine)
Trong iu khin (control) v vin thng
(telecommunication)

61
61
1.5. Tm tt
Khai ph d liu l qu trnh khm ph ra cc mu
c quan tm t lng ln d liu.
Mu kt qu khai ph c l nhng mu th hin tri thc nu chng d hiu,
hp l vi mt mc chc chn, hu dng, v mi i vi ngi dng.
Lng ln d liu t cc c s d liu truyn thng/hin i, kho d liu, hay
t cc ngun thng tin khc (spatial, time series, text, multimedia, web, ).
Cc tc v khai ph d liu bao gm khai ph m t lp/khi nim (c trng
ha v phn bit ha d liu), khai ph lut kt hp/tng quan, phn lp, d
on, gom cm, phn tch xu hng, phn tch lch v phn t bin, phn
tch tng t,
Nm thnh t c bn c t mt tc v khai ph d liu: d liu c th s c
khai ph, loi tri thc s t c, tri thc nn, cc o, v cc k thut biu
din/trc quan ha tri thc.
Bn thnh phn c bn ca mt gii thut khai ph d liu: cu trc mu hay m
hnh, hm t s, phng php tm kim v ti u ha, chin lc qun l d liu.
62
62
1.5. Tm tt
Khai ph d liu c xem nh l mt phn ca qu trnh khm
ph tri thc.
Qu trnh khm ph tri thc l mt chui lp gm cc bc: lm
sch d liu, tch hp d liu, chn la d liu, bin i d liu,
khai ph d liu, nh gi mu, v biu din tri thc.
Nhiu lnh vc khc nhau c lin quan vi khai ph d liu: cng
ngh c s d liu, l thuyt thng k, hc my, khoa hc thng
tin, trc quan ha,
Cc vn lin quan: phng php lun khai ph d liu, vn
tng tc ngi dng, kh nng co gin d liu v hiu sut, vn
x l lng ln cc kiu d liu khc nhau, vn khai thc
cc ng dng khai ph d liu cng nh s nh hng x hi ca
chng.
63
63
Hi & p
Hi & p

You might also like