You are on page 1of 9

Thut ton K-Means vi bi ton phn cm d liu

Nguyn Vn Chc chuc1803@gmail.com


1.Gii thiu v k thut phn cm trong Khai ph d liu (Clustering Techniques in Data
Mining)
Phn cm l k thut rt quan trng trong khai ph d liu, n thuc lp cc phng
phpUnsupervised Learning trong Machine Learning. C rt nhiu nh ngha khc nhau v k thut
ny, nhng v bn cht ta c th hiu phn cm l cc qui trnh tm cch nhm cc i tng cho vo
cc cm (clusters), sao cho cc i tng trong cng 1 cm tng t (similar) nhau v cc i tng khc
cm th khng tng t (Dissimilar) nhau.
Mc ch ca phn cm l tm ra bn cht bn trong cc nhm ca d liu. Cc thut ton phn cm
(Clustering Algorithms) u sinh ra cc cm (clusters). Tuy nhin, khng c tiu ch no l c xem l tt
nht nh hiu ca ca phn tch phn cm, iu ny ph thuc vo mc ch ca phn cm nh:
data reduction, natural clusters, useful clusters, outlier detection
K thut phn cm c th p dng trong rt nhiu lnh vc nh:
Marketing: Xc nh cc nhm khch hng (khch hng tim nng, khch hng gi tr, phn loi
v d on hnh vi khch hng,) s dng sn phm hay dch v ca cng ty gip cng ty
c chin lc kinh doanh hiu qu hn;
Biology: Phn nhm ng vt v thc vt da vo cc thuc tnh ca chng;
Libraries: Theo di c gi, sch, d on nhu cu ca c gi;
Insurance, Finance: Phn nhm cc i tng s dng bo him v cc dch v ti chnh, d
on xu hng (trend) ca khch hng, pht hin gian ln ti chnh (identifying frauds);
WWW: Phn loi ti liu (document classification); phn loi ngi dng web (clustering
weblog);

Cc k thut phn cm c phn loi nh sau (xem hnh)

2. Thut Ton K-Means
K-Means l thut ton rt quan trng v c s dng ph bin trong k thut phn cm. T tng chnh
ca thut ton K-Means l tm cch phn nhm cc i tng (objects) cho vo K cm (K l s cc
cm c xc inh trc, K nguyn dng) sao cho tng bnh phng khong cch gia cc i tng
n tm nhm (centroid ) l nh nht.
Thut ton K-Means c m t nh sau


Thut ton K-Means thc hin qua cc bc chnh sau:
1. Chn ngu nhin K tm (centroid) cho K cm (cluster). Mi cm c i din bng cc tm ca
cm.
2. Tnh khong cch gia cc i tng (objects) n K tm (thng dng khong cch Euclidean)
3. Nhm cc i tng vo nhm gn nht
4. Xc nh li tm mi cho cc nhm
5. Thc hin li bc 2 cho n khi khng c s thay i nhm no ca cc i tng
V d minh ha thut ton K-Mean:
Gi s ta c 4 loi thuc A,B,C,D, mi loi thuc c biu din bi 2 c trng X v Y nh sau. Mc ch
ca ta l nhm cc thuc cho vo 2 nhm (K=2) da vo cc c trng ca chng.

Bc 1. Khi to tm (centroid) cho 2 nhm. Gi s ta chn A l tm ca nhm th nht (ta tm
nhm th nht c1(1,1)) v B l tm ca nhm th 2 (to tm nhm th hai c2 (2,1)).


Bc 2. Tnh khong cch t cc i tng n tm ca cc nhm (Khong cch Euclidean)

Mi ct trong ma trn khong cch (D) l mt i tng (ct th nht tng ng vi i tng A, ct th
2 tng ng vi i tng B,). Hng th nht trong ma trn khong cch biu din khong cch gia
cc i tng n tm ca nhm th nht (c1) v hng th 2 trong ma trn khong cch biu din
khong cch ca cc i tng n tm ca nhm th 2 (c2).
V d, khong cch t loi thuc C=(4,3) n tm c1(1,1) l 3.61 v n tm c2(2,1) l 2.83 c tnh
nh sau:


Bc 3. Nhm cc i tng vo nhm gn nht

Ta thy rng nhm 1 sau vng lp th nht gm c 1 i tng A v nhm 2 gm cc i tng cn li
B,C,D.
Bc 5. Tnh li ta cc tm cho cc nhm mi da vo ta ca cc i tng trong nhm. Nhm
1 ch c 1 i tng A nn tm nhm 1 vn khng i, c1(1,1). Tm nhm 2 c tnh nh sau:


Bc 6. Tnh li khong cch t cc i tng n tm mi


Bc 7. Nhm cc i tng vo nhm

Bc 8. Tnh li tm cho nhm mi



Bc 8. Tnh li khong cch t cc i tng n tm mi

Bc 9. Nhm cc i tng vo nhm

Ta thy G
2
= G
1
(Khng c s thay i nhm no ca cc i tng) nn thut ton dng v kt qu
phn nhm nh sau:

Thut ton K-Means c u im l n gin, d hiu v ci t. Tuy nhin, mt s hn ch ca K-Means
l hiu qu ca thut ton ph thuc vo vic chn s nhm K (phi xc nh trc) v chi ph cho thc
hin vng lp tnh ton khong cch ln khi s cm K v d liu phn cm ln.
3. Trin khai ng dng phn cm vi phn mm WeKa
Trong v d ny, ti s gii thiu cch xy dng mt KnowledgeFlow trin khai k thut phn cm da
trn thut ton K-Means trn Data Mining Software WeKa.
D liu dng phn cm trong v d ny l d liu dng phn loi khch hng ca ngn hng (file d
liu bank.arff). bank.arff gm c 11 thuc tnh v 600 khch hng (instances). Di y l cu trc v
phn b d liu ca bank.arff
Cc bn c th Down file bank.arff ti y:

Nhim v ca chng ta l dng thut ton K-Means phn nhm cc khch hng vo K nhm (trong v
d ny K=5) da vo s tng t (similar) trn11 thuc tnh ca h.
Ta xy dng mt KnowledgeFlow trong WeKa nh sau:


Thit lp cc tham s cho thut ton K-Means nh s cm (trong v d ny K=5), Cch tnh khong cch
(trong v d ny dng khong cch Euclidean),


Kt qu phn cm chi tit nh sau:

PS. The next topic is SOM (Self Organizing Maps) in Clustering Techniques. All comments please send to
chucnv@ud.edu.vn.

You might also like