Professional Documents
Culture Documents
H CH MINH
TRNG I HC KHOA HC T NHIN
KHOA CNG NGH THNG TIN
N MN HC
X L NGN NG T NHIN
ti:
Text Categorization
Phn Loi Vn Bn (Chng 16)
Da trn ti liu:
TP.HCM 01/2011
MC LC
1.
Tm tt n ................................................................................................... 1
2.
2.2
2.3
2.3.1
2.3.2
2.4
Tin x l vn bn ..................................................................................... 6
2.5
2.5.1
2.5.2
2.6
3.
nh gi b phn lp................................................................................. 9
2.6.1
Macro-Averaging ............................................................................... 11
2.6.2
Micro-Averaging ................................................................................ 11
3.1.1
nh l ............................................................................................... 12
3.1.2
3.1.3
3.2
3.2.1
3.2.2
3.2.2.1
3.2.2.2
3.2.2.3
3.2.3
3.3
3.2.3.1
3.2.3.2
3.2.3.3
Cross-validation .......................................................................... 28
3.2.3.4
3.3.1
Entropy .............................................................................................. 29
3.3.1.1
3.3.1.2
3.3.2
5.
V d ........................................................................................... 20
3.3.2.1
3.3.2.2
3.3.2.3
Mt s k hiu :............................................................................ 31
3.3.2.4
M hnh ....................................................................................... 31
3.3.2.5
3.3.2.6
1. Tm tt n
: DxC Boolean
nu d thuc v lp c
nu d khng thuc v lp c
l vector
v lp c2 l lp vi cc vn bn c ( )
.H cc
Ng liu hun luyn : kho ng liu thu thp t nhiu ngun khc
nhau.
Giai on phn lp
2.4 Tin x l vn bn
Vn bn trc khi c vector ho, tc l trc khi s dng, cn phi c
tin x l. Qu trnh tin x l s gip nng cao hiu sut phn loi v gim
phc tp ca thut ton hun luyn.
Tu vo mc ch b phn loi m chng ta s c nhng phng php tin
x l vn bn khc nhau, nh :
Chuyn vn bn v ch thng
Loi b cc k t c bit bit([ ],[.], [,], [:], [], [], [;], [/], [[]], [~], [`], [!],
[@], [#], [$],[%],[^],[&],[*],[(],[)]), cc ch s, php tnh ton s hc
l mt vector n chiu o
) l
Boolean weighting
tf*idf weighting
Entropy weighting
Hnh 2.7: Ba gi tr trong cch tnh trng s thut ng (t) thng dng
C ba thng tin c s dng trong cch tnh trng s bng tn sut t l :
term frequency (tfij s ln sut hin ca t wi trong vn bn dj), document
frequency (dfi s vn bn c cha t wi), collection frequency (cfi s ln sut hin
ca t wi trong c tp ng liu). Trong ,
))
2.6 nh gi b phn lp
Sau khi tm c h cc tham s ti u cho b phn lp (hay c th ni l
b phn lp c hun luyn xong), nhim v tip theo l cn phi nh gi
(kim tra) b phn lp cho kt qu nh th no? Tuy nhin, qu trnh kim tra
phi c thc hin trn mt tp ng liu khc vi tp ng liu hun luyn, cn
c gi vi ci tn l tp ng liu kim tra (a test set). Vic kim tra b phn lp
l mt s nh gi trn mt tp ng liu cha c bit v th l s o lng,
nh gi duy nht cho bit kh nng thc s ca mt b phn lp.
n gin, ta s xem xt mt b phn lp nh phn (phn hai lp). Nhng
b phn lp thng c nh gi bng cch lp mt bng thng k sau :
Trong ,
v sai li (Error) c
lp. Tuy nhin, khi nh gi b phn lp, thng ngi ta ch xem xt nhng i
tng thuc v lp v c phn lp ng, cn nhng i tng khng thuc v
lp thng s t c quan tm. Do , mt s o khc c nh ngha.
Cc o bao gm :
Fallout ( loi b) :
Trong :
P l chnh xc Precision
R l bao ph Recall
).
10
Macro-Averaging
| |
| |
| |
Trong : |C| l s lp cn phn loi.
2.6.2
Micro-Averaging
| | (
|
| | (
11
)
|
nay theo phng php thng k : thut ton Nave Bayes, Cy quyt nh,
Maximun Entropy Modeling v KNN. Phn ny cng trnh by mt s v d v
cch thc hin cc phng php phn loi.
nh l
( | ) ( )
( )
Trong :
Theo nh l Bayes :
12
( | ) ( )
( )
( | )
Theo tnh cht c lp iu kin :
( | )
| )
| )
| )
| )
| ))
Trong :
3.1.2
Thut ton
Bc 2 :
| ))
13
Bc 1 :
Tnh cc xc sut P(Ci)
Vi C1 = yes
P(C1) = P(yes) = 9/14
Vi C2 = no
P(C2) = P(no) = 5/14
P(hot|no) = 2/5
P(cold|yes) = 3/9
P(cold|no) = 1/5
P(mild|yes) = 4/9
P(mild|no) = 2/5
p dng thut ton Nave Bayes vo phn loi vn bn, ta cn thc hin
cc bc tin x l v vector ho cc vn bn trong tp hun luyn. Cc phng
php tin x l v vector ho c trnh by nhng phn trc. Tuy nhin,
do thut ton Nave Bayes da trn xc sut vn bn v xc sut c trng, do
phng php ny, chng ta s s dng phng php vector ho bng cch
m tn sut t (Word frequency weighting).
15
( )
| )
hoc
(
| )
Bc 2 : Phn lp
| |
( ( )
( (
| ) |
))
Docs
Var
Bit
Chip
Log
Class
Doc1
42
25
56
Math
Doc2
10
28
45
Comp
Doc3
11
25
22
Comp
Doc4
33
40
48
Math
Doc5
28
32
60
Math
Doc6
22
30
Comp
Bc hun luyn :
Tnh xc xut cc lp Ci trong tp hun luyn
(
(
(
|
|
)
)
Lp C2 = Math
Tng = 388
(
17
Xc nh lp cho vn bn mi ?
Tnh cc xc xut :
(
) , (
(
) , (
(
)
)
Kt qu :
Vn bn Docnew thuc v lp Math do max(Pnew )= 598,62
Khi nim
Cy quyt nh l mt cu trc cy vi :
18
3.2.2
Bt u t nt n biu din tt c cc mu
Ngc li, dng o thuc tnh chn thuc tnh s phn tch tt
nht cc mu vo cc lp
19
Trong :
o S : tp cc mu th (tp hun luyn)
o c : l phn lp trong mu th
o pi : xc sut (t l) cc mu th thuc phn lp ci
( )
( )
Trong :
o Value(A) : tp cc gi tr c th cho thuc tnh A.
o Sv : tp con ca S m A nhn gi tr v.
3.2.2.3 V d
Xt li v d v phn lp cho quyt nh chi Tennis c nu phng
php Nave Bayes nh sau :
20
( )
*
( )
Trong :
21
(
(
(
)
)
)
22
3.2.3
Gi
( )
Trong
: tn sut ca t i trong vn bn j
: chiu di ca vn bn j
Nu t i khng xut hin trong vn bn th wij s c gn l 0
v c lm trn l 5.
23
biu din vn bn. phn ny, chng ta khng tp trung chi tit vo tm hiu
ny, nu cn c chi tit c th xem phn 5.3.3 ca cun sch.
24
Trong :
| |
Trong :
( )
( )
( |
Lp li nhng bc tnh ton trn cho tng nhnh tri v phi ca cy tip
tc pht trin cy cho n khi cc node u xc nh r vn bn ti node thuc
v lp no. Nhng node ch cha vn bn thuc v mt lp chnh l iu kin
dng ca thut ton v c gi l node l hay iu kin dng ca thut ton.
26
Ta sau : phng php tip cn sau khi hon thnh cy v sau tita
cy.
3.2.3.3 Cross-validation
Trong thc t, nu ta chia b ng liu hun luyn lm 2 phn : 80% hun
luyn v 20% xc nhn ta cy th chnh xc hun luyn s b gim v s
chnh xc trong ta cy quyt nh cng khng m bo. V vy, ti u ho vn
ny, ngi ta thng dng qu trnh Cross-validation.
Cross-validation l qun trnh xc nhn cho. Tc l l khng ch to 1 cy
quyt nh, m thc hin nhiu cy. Mi ln ta cng thc hin trn b ng liu
nhng vi mi 20% khc nhau ca b ng liu. C th hiu mt cch n gin l
b ng liu gm 100%. Mi ln, ta ly 80% hun luyn to cy, v 20% ta cy.
Ln khc ta cng thc hin nh vy nhng v 20% l nhng d liu khc trong
b ng liu. V vy, ta c th thc hin 5 ln hun luyn (5-fold cross-validation).
Sau khi c c 5 cy quyt nh c ta, ta s tnh ra kch thc trung
bnh ca chng. Vi kch thc ny, ta s hun luyn li cy vi 100% d liu, v
28
Nhn xt :
Mt phn phi xc sut cng u th tnh khng chc chn cng ln =>
entropy cng cao
nh l cc i ca Entropy :
Ta c : H(p1,p2,,pM)<=log(M)
Trong : ng thc xy ra khi v ch khi p1=p2==pM=1/M, khi entropy
t gi tr cc i.
3.3.2 p dng vo phn loi vn bn
3.3.2.1 Biu din vn bn
biu din vn bn, chng ta vn s dng phng php c trnh by
trn. Trong th nghim ny, tc gi vn biu din vn bn bng mt vector vi
20 chiu l 20 t c o 2 cao nht, mi chiu l trng s ca t c o
bng o tf.idf
(
30
), vi
tr trong tp phn lp C.
(
S : tp hun luyn
( ) : xc sut ca x
3.3.2.4 M hnh
C nhiu m hnh khc nhau cho vic phn lp bng Entropy cc i. y,
chng ta s s dng m hnh Loglinear c cng thc nh sau :
Trong :
K : s lng hm c trng c.
31
( )
Vn ca m hnh l xc nh cc trng s
ti u v dng chng
) v
). Sau
) (
Trong ,
( | )
(
(
( ))
( ))
) (
32
( )
nh sau :
1. Khi to tp trng s {
( )
}.
Khi to n=1.
2. Tnh xc sut cho tng vector vn bn vi tng lp trong tp phn lp
( ) (
cng thc :
( )
3. Tnh gi tr k vng
( )
33
( | ) (
( )
Cho mt vector vn bn mi
).
) v
(
(
( ))
( ))
34
35