You are on page 1of 38

I HC QUC GIA TP.

H CH MINH
TRNG I HC KHOA HC T NHIN
KHOA CNG NGH THNG TIN

NGUYN MINH THNH 10 12 042

N MN HC
X L NGN NG T NHIN
ti:

Text Categorization
Phn Loi Vn Bn (Chng 16)

Da trn ti liu:

Foundations Of Statistical Natural Language


Processing
Christopher D.Manning, Hinrich Schutze

TP.HCM 01/2011

MC LC
1.

Tm tt n ................................................................................................... 1

2.

Bi ton phn loi vn bn............................................................................... 2


2.1

Gii thiu ................................................................................................... 2

2.2

Pht biu bi ton ...................................................................................... 2

2.3

M hnh tng qut ...................................................................................... 3

2.3.1

Giai on hun luyn........................................................................... 4

2.3.2

Giai on phn lp .............................................................................. 5

2.4

Tin x l vn bn ..................................................................................... 6

2.5

Phng php biu din vn bn ................................................................ 7

2.5.1

M hnh khng gian vector .................................................................. 7

2.5.2

Khi nim trng s............................................................................... 7

2.6

3.

nh gi b phn lp................................................................................. 9

2.6.1

Macro-Averaging ............................................................................... 11

2.6.2

Micro-Averaging ................................................................................ 11

Cc phng php phn loi vn bn ............................................................. 12


3.1

Thut ton Nave Bayes........................................................................... 12

3.1.1

nh l ............................................................................................... 12

3.1.2

Thut ton ......................................................................................... 13

3.1.3

p dng trong phn loi vn bn ....................................................... 15

3.2

Cy quyt nh (Decision Tree)................................................................ 18

3.2.1

Khi nim .......................................................................................... 18

3.2.2

Thut ton xy dng cy ................................................................... 19

3.2.2.1

Thut ton ID3 ............................................................................ 19

3.2.2.2

Cc o trong thut ton :........................................................ 20

3.2.2.3
3.2.3

3.3

p dng vo phn loi vn bn ......................................................... 23

3.2.3.1

Biu din vn bn ....................................................................... 23

3.2.3.2

Giai on hun luyn .................................................................. 24

3.2.3.3

Cross-validation .......................................................................... 28

3.2.3.4

Giai on phn lp ..................................................................... 29

M hnh xc xut Entropy ti i (Maximum Entropy Modeling) .............. 29

3.3.1

Entropy .............................................................................................. 29

3.3.1.1

Khi nim .................................................................................... 29

3.3.1.2

Entropy ca bin ngu nhin ...................................................... 30

3.3.2

5.

V d ........................................................................................... 20

p dng vo phn loi vn bn ......................................................... 30

3.3.2.1

Biu din vn bn ....................................................................... 30

3.3.2.2

Hm c trng v rng buc ...................................................... 31

3.3.2.3

Mt s k hiu :............................................................................ 31

3.3.2.4

M hnh ....................................................................................... 31

3.3.2.5

Th tc hun luyn Generalized iterative scaling........................ 32

3.3.2.6

Giai on phn lp ..................................................................... 34

Ti liu tham kho .......................................................................................... 35

1. Tm tt n

Phn ny trnh by s lc v bi ton Phn loi vn bn c


cp n trong cun sch Foundations Of Statistical Natural
Language Processing v cc phng php thc thi bi ton phn
loi vn bn theo phng php thng k.

Phn loi vn bn l mt vn quan trng trong lnh vc x l ngn ng.


Nhim v ca bi ton ny l gn cc ti liu vn bn vo nhm cc ch cho
trc. y l mt bi ton rt thng gp trong thc t in hnh nh : mt nh
chuyn phn tch th thng chng khon, anh ta cn phi tng hp rt nhiu ti
liu, bi vit v th trng chng khon c v a ra phn on ca mnh.
Tuy nhin, anh ta khng th c tt c cc bi vit, bi bo hay cc ti liu ri
phn loi chng u l ti liu chng khon sau anh ta mi c k chng cho
mc ch ca anh ta. L do ca vn ny l bi v s lng bo vit, bi bo
hin nay rt nhiu, c bit l trn internet, nu c ht c tt c ti liu
th s mt rt nhiu thi gian. Mt v d khc trong thc t l vic phn loi spam
mail. Khi mt mail c gi n hp th, nu ngi dng phi c tt c cc
mail th s tn rt nhiu thi gian v spam mail rt nhiu. V vy, cn c mt h
thng phn loi u l spam mail v u l mail tt.
gii bi ton ny c rt nhiu phng php c a ra nh : thut
ton Nave Bayes, K-NN (K-Nearest-Neighbor), Cy quyt nh (Decision Tree),
Mng Neuron nhn to (Artificial Neural Network) v SVM (Support Vector
Machine). Mi phng php u cho kt qu kh tt cho bi ton ny, tuy nhin
c c s so snh y , cc phn sau chng ta s i vo chi tit tng
phng php.
n nu ra chi tit cc bc thc hin bi ton Phn Loi Vn Bn trong
lnh vc x l ngn ng t nhin v mt s cch tip cn gii quyt bi ton
cng nhng kt qu t c da trn mt s nhng v d th nghim ca tc
gi trong cun sch ny.

2. Bi ton phn loi vn bn

Phn ny trnh by v chi tit cc bc thc hin bi ton phn loi

vn bn nh m hnh biu din, cc o cng nh cc phng php


nh gi kt qu thc hin bi ton phn loi vn bn.

2.1 Gii thiu


Nh trnh by trn, bi ton phn loi vn bn l mt bi ton quan trng
trong x l ngn ng. C kh nhiu bi ton phn loi trong lnh vc x l ngn
ng t nhin nh : gn nhn t loi (POS tagging), kh nhp nhng ngha t
vng (Word Sense Disambiguation) v gn nhn ng tnh t (Prepositional
Phrase Attachment)
Mi bi ton phn loi u c cc i tng thao tc khc nhau v mc tiu
phn loi khc nhau. Trong bi ton gn nhn t loi (POS tagging) v kh nhp
nhng ngha t vng (Word Sense Disambiguation), th t c xem l i tng
ni dung cn thao tc (mc t). Trong gn nhn ng tnh t (Prepositional
Phrase Attachment) th mt ng l i tng ni dung cn thao tc. Cn trong bi
ton phn loi vn bn th mt vn bn (document hay text) l i tng ni dung
cn thao tc.

Hnh 2.1: Cc bi ton phn loi trong x l ngn ng t nhin

2.2 Pht biu bi ton


Bi ton phn loi vn bn c th c pht biu nh sau : Cho trc mt tp
vn bn D={d1,d2,,dn} v tp ch c nh ngha C={c1,c2,,cn}.
Nhim v ca bi ton l gn lp di thuc v cj c nh ngha. Hay ni
cch khc, mc tiu ca bi ton l i tm hm :
2

: DxC Boolean

nu d thuc v lp c

nu d khng thuc v lp c

Trong cc th nghim ca tc gi, ng s dng cc bi vit tin tc ca


hng tin Reuters, cng nh danh sch cc ch tin tc ca hng ny.

Hnh 2.2: V d v mt bn tin tc ca Reuters


Cc ch tin tc ca Hng Reuters vi hn 100 ch v s lng bi vit
(vn bn) tc gi s dng trong cc th nghim ny l 12,902 bi vit.

2.3 M hnh tng qut


C rt nhiu hng tip cn bi ton phn loi vn bn c nghin cu
nh: tip cn bi ton phn loi da trn l thuyt th, cch tip cn s dng l
thuyt tp th, cch tip cn thng k Tuy nhin, tt c cc phng php trn
u da vo cc phng php chung l my hc l : hc c gim st, hc
khng gim st v hc tng cng.

Vn phn loi vn bn theo phng php thng k da trn kiu hc c


gim st c c t bao gm 2 giai on : giai on hun luyn v giai on
phn lp.
2.3.1

Giai on hun luyn

Chng ta c mt tp hun luyn, mi phn t trong tp hun luyn c gn


vo mt hoc nhiu lp m chng ta s th hin chng bng mt m hnh m ho
(c trnh by chi tit Phng php biu din vn bn). Thng thng, mi
phn t trong tp hun luyn c th hin theo dng ( ). Trong ,

l vector

biu din cho vn bn trong tp hun luyn.


Sau , chng ta nh ngha mt lp m hnh v mt th tc hun luyn. Lp
m hnh l h cc tham s ca b phn loi, th tc hun luyn l mt gii thut
(hay thut ton) chn ra mt h cc tham s ti u cho b phn loi. Nhng
lm th no nh gi c h cc tham s l ti u ? Cu hi ny s c
trnh by trong phn nh gi b phn lp.

Hnh 2.3: M hnh giai on hun luyn


u vo : ng liu hun luyn v thut ton hun luyn
u ra : m hnh phn lp (b phn lp classifier)
Mt v d v mt h cc tham s cho b phn loi nh phn : ( )

y, b phn loi nh phn ch phn loi cho 2 lp. Chng ta gi lp c1 l lp


vi cc vn bn c ( )

v lp c2 l lp vi cc vn bn c ( )

tham s cn xc nh l vector v ngng


Chi tit giai on hun luyn b phn lp

.H cc

Hnh 2.4: Chi tit giai on hun luyn


Trong :

Ng liu hun luyn : kho ng liu thu thp t nhiu ngun khc
nhau.

Tin x l : chuyn i ti liu trong kho ng liu thnh mt hnh thc


ph hp phn loi.

Vector ho : m ho vn bn bi mt m hnh trng s (chi tit phn


Phng php biu din vn bn).

Trch chn c trng : loi b nhng t (c trng) khng mang


thng tin khi ti liu nhm nng cao hiu sut phn loi v gim
phc tp ca thut ton hun luyn.

Thut ton hun luyn : Th tc hun luyn b phn lp tm ra h


cc tham s ti u.

nh gi : bc nh gi hiu sut (cht lng) ca b phn lp (chi


tit trong phn nh gi b phn lp).

Th tc hun luyn s c thc thi lp nhiu ln tm h cc tham s ti


u sau mi ln lp. Tuy nhin, do ban u h cc tham s c gn vi mt gi
tr khi to, do nu gi tr khi to ban u c gn sai th kt qu ti u ca
h cc tham s c th ch l ti u cc b.
2.3.2

Giai on phn lp

Sau khi hon thnh giai on hun luyn, m hnh phn lp s c p


dng cho cc vn bn mi cn phn loi.

Hnh 2.5: M hnh giai on phn lp


Chi tit giai on phn lp

Hnh 2.6: M hnh giai on phn lp

2.4 Tin x l vn bn
Vn bn trc khi c vector ho, tc l trc khi s dng, cn phi c
tin x l. Qu trnh tin x l s gip nng cao hiu sut phn loi v gim
phc tp ca thut ton hun luyn.
Tu vo mc ch b phn loi m chng ta s c nhng phng php tin
x l vn bn khc nhau, nh :

Chuyn vn bn v ch thng

Loi b du cu (nu khng thc hin tch cu)

Loi b cc k t c bit bit([ ],[.], [,], [:], [], [], [;], [/], [[]], [~], [`], [!],
[@], [#], [$],[%],[^],[&],[*],[(],[)]), cc ch s, php tnh ton s hc

Loi b cc stopword (nhng t xut hin hu ht trong cc vn bn)


khng c ngha khi tham gia vo phn loi vn bn.

2.5 Phng php biu din vn bn


Mt trong nhng nhim v u tin trong vic x l phn loi vn bn l chn
c mt m hnh biu din vn bn thch hp. Mt vn bn dng th (dng
chui) cn c chuyn sang mt m hnh khc to thun li cho vic biu
din v tnh ton. Tu thuc vo tng thut ton phn loi khc nhau m chng ta
c m hnh biu din ring. Mt trong nhng m hnh n gin v thng c
s dng trong nhim v ny l m hnh khng gian vector. Mt vn bn trong
nhim v ny c biu din theo dng , vi

l mt vector n chiu o

lng gi tr ca phn t vn bn.


2.5.1

M hnh khng gian vector

M hnh khng gian vector l mt trong nhng m hnh c s dng rng


ri nht cho vic tm kim (truy hi) thng tin. Nguyn nhn chnh l bi v s n
gin ca n.
Trong m hnh ny, cc vn bn c th hin trong mt khng gian c s
chiu ln, trong mi chiu ca khng gian tng ng vi mt t trong vn bn.
Phng php ny c th biu din mt cch hnh tng nh sau : mi vn bn D
c biu din di dng
(

(vector c trng cho vn bn D). Trong ,

), v n l s lng c trng hay s chiu ca vector vn bn,

trng s ca c trng th i (vi 1 i n).


Nh vy, nu trong kho ng liu ca qu trnh hun luyn nhiu vn bn, ta k
hiu Dj, l vn bn th j trong tp ng liu, v vector

) l

vector c trng cho vn bn Dj, v xij l trng s th i ca vector vn bn


2.5.2

Khi nim trng s

Mt vn quan trng na trong vic biu din mt vn bn l tnh trng


s cho vector c trng ca vn bn. C nhiu cch khc nhau tnh trng s
ny nh :

Word frequency weighting

Boolean weighting

tf*idf weighting

Entropy weighting

Tuy nhin, n gin cho vn ny, chng ta s ch xem xt cch tnh


Word frequency weighting (trng s tn sut t) v tf*idf, mt cch n gin l
m s t trong vn bn. Tuy nhin vn c nhiu cch khc nhau tnh trng
s dng ny.

Hnh 2.7: Ba gi tr trong cch tnh trng s thut ng (t) thng dng
C ba thng tin c s dng trong cch tnh trng s bng tn sut t l :
term frequency (tfij s ln sut hin ca t wi trong vn bn dj), document
frequency (dfi s vn bn c cha t wi), collection frequency (cfi s ln sut hin
ca t wi trong c tp ng liu). Trong ,

Thng tin c nm bt bi term frequency l s ni bt ca thng tin (hay


t) trong mt vn bn. Term frequency cng cao (s ln xut hin cng nhiu
trong vn bn) th l t miu t tt cho ni dung vn bn. Gi tr th hai,
document frequency, c th gii thch nh l mt b ch nh ni dung thng tin.
Mt t c tp trung ng ngha thng xy ra nhiu ln trong mt vn bn nu
n cng xut hin trong tt c cc vn bn khc. Nhng t khng c tp trung
ng ngha tri u ng nht trong tt c cc vn bn.
Hy xem xt mt v d sau, kho ng liu ca bo New York Times, v hai t
try v insurance c thng k nh sau :

Hai t try v insurance c gi tr

gn nh nhau. Nhng ngc li, vi gi tr

, t insurance ch xut hin trong hu nh ch mt na kho ng liu. iu ny


gii thch l bi v, t try c th c s dng trong hu ht cc ch , nhng t
insurance ch c dng m ch n mt khi nim nh m ch lin quan n
mt s lng nh cc ch . Mt tnh cht na ca t c tp trung ng ngha
l, nu chng xut hin trong mt vn bn th chng s xut hin vi ln.
8

th hin trng s phn nh ht thng tin ca t, thng ta s kt hp c


hai loi trng s l
c gi l

trong mt n v chung. Dng biu din trng s ny

. Cng thc kt hp hai gi tr trng s :


(

))

Trong , N l tng s vn bn. Biu thc th nht p dng cho cc t c


xut hin trong vn bn, cn biu thc th hai cho cc t khng xut hin trong
vn bn.

2.6 nh gi b phn lp
Sau khi tm c h cc tham s ti u cho b phn lp (hay c th ni l
b phn lp c hun luyn xong), nhim v tip theo l cn phi nh gi
(kim tra) b phn lp cho kt qu nh th no? Tuy nhin, qu trnh kim tra
phi c thc hin trn mt tp ng liu khc vi tp ng liu hun luyn, cn
c gi vi ci tn l tp ng liu kim tra (a test set). Vic kim tra b phn lp
l mt s nh gi trn mt tp ng liu cha c bit v th l s o lng,
nh gi duy nht cho bit kh nng thc s ca mt b phn lp.
n gin, ta s xem xt mt b phn lp nh phn (phn hai lp). Nhng
b phn lp thng c nh gi bng cch lp mt bng thng k sau :

Trong ,

a : l s lng i tng thuc v lp ang xt v c b phn lp


gn vo lp.

b : l s lng i tng khng thuc v lp ang xt nhng c b


phn lp gn vo lp.

c : l s lng i tng thuc v lp ang xt nhng c b phn


lp loi khi lp.

d : l s lng i tng khng thuc v lp ang xt v c b phn


lp loi khi lp.

nh gi cht lng b phn lp, hai n v o lng quan trng l


ng n (accuracy) c o bng cng thc
tnh bng cng thc

v sai li (Error) c

. o ny phn nh y cht lng ca b phn

lp. Tuy nhin, khi nh gi b phn lp, thng ngi ta ch xem xt nhng i
tng thuc v lp v c phn lp ng, cn nhng i tng khng thuc v
lp thng s t c quan tm. Do , mt s o khc c nh ngha.
Cc o bao gm :

Precision ( chnh xc) :

Recall ( bao ph, y ) :

Fallout ( loi b) :

Tuy nhin, trong mt s trng hp thc t, nu tnh precision v recall


ring r s cho kt qu khng cn i. Do , thun tin, ngi ta kt hp hai
o ny vo mt n v o tng qut duy nht. lm iu ny, ngi ta s
dng n v o lng F c nh ngha nh sau :

Trong :

P l chnh xc Precision

R l bao ph Recall

l mt h s xc nh s cn bng ca quyt nh v bao ph.


Gi tr =5 thng c chn cho s cn bng gia P v R. Vi gi tr
ny o c tnh n gin l

).

Nhng o trn c dng nh gi cho nhng b phn lp nh phn


(phn hai lp). Tuy nhin, trong thc t, thng cc b phn lp phi phn chia
nhiu lp, chnh v vy nh gi tng th ton b cc lp phn loi, sau khi lp

10

bng thng k cho tng lp, hai phng php na c p dng nh gi


l micro-averaging v macro-averaging.
2.6.1

Macro-Averaging

y l phng php tnh trung bnh cc o precious v recall ca tng lp.


Cc lp sau khi lp bng thng k v tnh cc o precious v recall cho
tng lp. Cc o ny s c tnh trung bnh li.
Cng thc tnh macro-averaging :
| |

| |

| |

| |
Trong : |C| l s lp cn phn loi.
2.6.2

Micro-Averaging

y l phng php tnh trung bnh cc kt qu thng k ca tng lp. Cc


lp sau khi lp bng thng k. Cc bng ny s c cng ny li tng ng
theo tng . Sau , s tnh o Precision v Recall cho bng thng k ln .
Cng thc micro-averaging :
|

| | (
|
| | (

11

)
|

3. Cc phng php phn loi vn bn

Phn ny trnh by mt s phng php phn loi vn bn ph bin hin

nay theo phng php thng k : thut ton Nave Bayes, Cy quyt nh,
Maximun Entropy Modeling v KNN. Phn ny cng trnh by mt s v d v
cch thc hin cc phng php phn loi.

3.1 Thut ton Nave Bayes


3.1.1

nh l

y l thut ton c xem l n gin nht trong cc phng php. B


phn lp Bayes c th d bo cc xc sut l thnh vin ca lp, chng hn xc
sut mu cho trc thuc v mt lp xc nh. Chng gi nh cc thuc tnh l
c lp nhau (c lp iu kin lp).
Thut ton Nave Bayes da trn nh l Bayes c pht biu nh sau :
( | )

( | ) ( )
( )

Trong :

Y i din mt gi thuyt, gi thuyt ny c suy lun khi c c


chng c mi X.

P(X) : xc xut X xy ra (Xc sut bin duyn ca X).

P(Y) : xc xut Y xy ra (iu kin tin nghim ca Y).

P(X|Y) : xc xut X xy ra khi Y xy ra (xc sut c iu kin, kh nng


ca X khi Y ng).

P(Y|X) : xc sut hu nghim ca Y nu bit X.

p dng trong bi ton phn loi, cc d kin cn c :

D: tp d liu hun luyn c vector ho di dng


(

Ci : tp cc ti liu ca D thuc lp Ci vi i={1,2,3,}

Cc thuc tnh x1,x2,xn c lp xc sut i mt vi nhau.

Theo nh l Bayes :
12

( | ) ( )
( )

( | )
Theo tnh cht c lp iu kin :
( | )

| )

| )

| )

| )

Khi , lut phn lp cho cc ti liu mi Xnew = {x1,x2,,xn} l:


( ( ) (

| ))

Trong :

P(Ci) : c tnh da trn tn sut xut hin ti liu trong tp hun


luyn.

P(xk|Ci) : c tnh t nhng tp thuc tnh c tnh trong qu


trnhun luyn.

3.1.2

Thut ton

Cc bc thc hin thut ton Nave Bayes:


Bc 1 :

Hun luyn Nave Nayes(da vo tp d liu


o Tnh xc sut P(Ci)
o Tnh xc sut P(xk|Ci)

Bc 2 :

Xnew c gn vo lp c gi tr ln nht theo cng thc


( ( ) (

| ))

Xt mt v d kinh in l v d d on xem quyt nh ca ngi chi c i


chi Tennis hay khng vi cc iu kin v thi tit c bit trc. Trong v
d ny, ta c mt bng d liu hun luyn nh sau :

13

Bc 1 :
Tnh cc xc sut P(Ci)

Vi C1 = yes
P(C1) = P(yes) = 9/14

Vi C2 = no
P(C2) = P(no) = 5/14

Tnh xc sut P(xk|Ci)

Vi thuc tnh Outlook : c cc gi tr sunny, overcast, rain


P(sunny|yes) = 2/9
P(sunny|no) = 3/5
P(overcast|yes) = 4/9
P(overcast|no) = 0/5
P(rain|yes) = 3/9
P(rain|no) = 2/5

Vi thuc tnh Temp : c cc gi tr Hot, Cold, Mild


P(hot|yes) = 2/9
14

P(hot|no) = 2/5
P(cold|yes) = 3/9
P(cold|no) = 1/5
P(mild|yes) = 4/9
P(mild|no) = 2/5

Vi thuc tnh Humidity : c cc gi tr Normal,High


P(normal|yes) = 6/9
P(normal|no) = 1/5
P(high|yes) = 3/9
P(high|no) = 4/5

Vi thuc tnh Wind : c cc gi tr Weak, Strong


P(weak|yes) = 6/9
P(weak|no) = 2/5
P(strong|yes) = 3/9
P(strong|no) = 3/5

Bc 2 : Phn lp Xnew = {sunny, cool, high, strong}


Tnh cc xc sut
P(yes)*P(Xnew|yes) = 0.005
P(no)* P(Xnew|no) = 0.021
Xnew thuc vo lp No
3.1.3

p dng trong phn loi vn bn

p dng thut ton Nave Bayes vo phn loi vn bn, ta cn thc hin
cc bc tin x l v vector ho cc vn bn trong tp hun luyn. Cc phng
php tin x l v vector ho c trnh by nhng phn trc. Tuy nhin,
do thut ton Nave Bayes da trn xc sut vn bn v xc sut c trng, do
phng php ny, chng ta s s dng phng php vector ho bng cch
m tn sut t (Word frequency weighting).

15

Sau khi vector ho cc vn bn, ta cn thc hin rt chn cc c trng


cho cc vn bn hun luyn. Ta cng c rt nhiu cch thc hin rt chn c
trng nh s dng cc o (xem thm trong sch [1]), s dng Heuristic, s
dng t in
Sau khi rt chn c trng, ta s thc hin thut ton hun luyn. Ta c
th tm tt cc bc nh sau :
Bc 1 : Hun luyn

T tp hun luyn, ta rt trch tp t vng (cc c trng)

Tnh xc sut P(Ci) v P(xk|Ci)


|

( )

docsi : s ti liu ca tp hun luyn thuc lp ci.

: s ti liu c trong tp hun luyn.


(

| )

hoc
(

| )

(lm mn vi lut Laplace)

n : tng s t i mt khc nhau ca lp ci.

nk : tng s t xk trong tp t vng trong lp Ci.

|Texti|: tng s t vng (khng phn bit i mt) trong lp Ci.

Bc 2 : Phn lp

| |

( ( )

( (

| ) |

))

positions : tp t vng trong b hun luyn.

Xt v d : ta c tp ti liu hun luyn sau khi vector ho (s dng


phng php n gin m s ln xut hin) v rt trch c trng nh sau :

B t vng (c trng) : var, bit, chip, log


16

Docs

Var

Bit

Chip

Log

Class

Doc1

42

25

56

Math

Doc2

10

28

45

Comp

Doc3

11

25

22

Comp

Doc4

33

40

48

Math

Doc5

28

32

60

Math

Doc6

22

30

Comp

Bc hun luyn :
Tnh xc xut cc lp Ci trong tp hun luyn
(

Tnh xc xut P(xk|Ci)


Lp C1 = Comp
Tng = 208
(

(
(

|
|

)
)

Lp C2 = Math
Tng = 388
(

17

Bc phn lp : cho vn bn c vector c trng sau


(

Xc nh lp cho vn bn mi ?
Tnh cc xc xut :
(

) , (
(

) , (
(

)
)

Kt qu :
Vn bn Docnew thuc v lp Math do max(Pnew )= 598,62

3.2 Cy quyt nh (Decision Tree)


3.2.1

Khi nim

Cy quyt nh l mt cu trc cy vi :

Mi nt trong (internal node) ng vi mt php kim tra trn mt thuc


tnh.

Mi nhnh biu din mt kt qu ca php kim tra.

Cc nt l (leaf node) biu din cc lp hay cc phn b lp.

Nt cao nht trong cy l nt gc (root node).

Hnh 3.1: Cy quyt nh cho phn lp mua my tnh hay khng ?

18

3.2.2

Thut ton xy dng cy

3.2.2.1 Thut to|n ID3


Sn chung v quy np trn cy quyt nh :
1. Chn thuc tnh tt nht theo mt o la chn cho trc.
2. M rng cy bng thm cc nhnh mi cho tng thuc tnh.
3. Sp xp cc mu hun luyn vo nt l.
4. Nu cc mu c phn lp r th dng ngc li lp li cc bc 1-4
cho cc nt l.
5. Ta cc nt l khng n nh.

Hnh 3.2: Mt v d chi tit v cy quyt nh


Chi tit chin lc xy dng cy quyt nh theo thut ton ID3

Bt u t nt n biu din tt c cc mu

Nu cc mu thuc v cng mt lp, nt tr thnh nt l v c gn


nhn bng lp

Ngc li, dng o thuc tnh chn thuc tnh s phn tch tt
nht cc mu vo cc lp

Mt nhnh c to cho tng gi tr ca thuc tnh c chn v cc


mu c phn hoch theo

Dng quy cng mt qu trnh to cy quyt nh

Tin trnh kt thc ch khi bt k iu kin no sau y l ng

19

o Tt c cc mu cho mt nt cho trc u thuc v cng mt


lp.
o Khng cn thuc tnh no m mu c th da vo phn
hoch xa hn.
3.2.2.2 C|c o trong thut to|n :

Entropy : c trng cho h tp ca (tinh khit) ca mt tp bt k


cc mu th.
( )

Trong :
o S : tp cc mu th (tp hun luyn)
o c : l phn lp trong mu th
o pi : xc sut (t l) cc mu th thuc phn lp ci

Information Gain : o s gim st mong mun ca Entropy gy ra bi


mt thuc tnh A.
(

( )

( )

Trong :
o Value(A) : tp cc gi tr c th cho thuc tnh A.
o Sv : tp con ca S m A nhn gi tr v.
3.2.2.3 V d
Xt li v d v phn lp cho quyt nh chi Tennis c nu phng
php Nave Bayes nh sau :

20

Ta gi cc mu thuc lp Yes l lp dng v cc mu thuc lp No l lp


m, vy ta c 9 mu dng v 5 mu m, k hiu [9+,5-].
Theo thut ton ID3, ta c nt u tin th hin tt c cc mu ca tp hun
luyn.

Tnh Entropy cho nt S :


( )
Tnh Information Gian cho cc thuc tnh trong tp hun luyn :
(

( )
*

( )

Trong :
21

Tnh tng t cho cc thuc tnh cn li, ta c kt qu nh sau :


(

(
(
(

)
)
)

Da vo kt qu trn, ta s chn thuc tnh Outlook lm iu kin phn tch


cy.

Lm tip theo cho hai tp con :


Ssunny={D1, D2, D3, D9, D11}
Srain={D4, D5, D6, D10, D14}
Cui cng ta c cy quyt nh hon chnh :

22

S dng cy quyt nh phn lp vn bn mi sau :


Xnew = {sunny, cool, high, strong}
p dng cy phn lp, t nt gc l thuc tnh outlook , Xnew c gi tr Sunny,
ta s r nhnh tri ca cy. n nt Humidity, Xnew c gi tr high, tip tc r
nhnh tri => Xnew thuc lp No.
p dng vo phn loi vn bn

3.2.3

3.2.3.1 Biu din vn bn


Nh trnh by, c nhiu phng php biu din (vector ho) vn bn.
Trong phm vi ti liu ang tham kho, tc gi s dng mt cch n gin
biu din vn bn nh sau :
(

Gi

) l vector biu din vn bn, v wij l cc trng s

ca cc c trng trong vn bn. Trng s ca cc c trng c tnh theo cng


thc sau :
(

( )

Trong

: tn sut ca t i trong vn bn j
: chiu di ca vn bn j
Nu t i khng xut hin trong vn bn th wij s c gn l 0

V d, trong mt vn bn t profit xut hin 6 ln, v chiu di ca vn bn l


89 t, th trng s cho t profit l

v c lm trn l 5.

Mc d c phng php biu din vn bn, tuy nhin vn l s dng


bao nhiu c trng v nhng c trng no biu din cho vn bn . Cch
n gin l ta chn nhng t xut hin nhiu nht trong vn bn lm c trng
v s lng chn phi hp l. Tuy nhin, trong phn loi vn bn, khng phi
nhng t xut hin nhiu trong vn bn l nhng c trng tt. Do , ta cn phi
s dng mt o tt hn quyt nh chn xem c trng no.

23

Trong chng ny, tc gi s dng o 2 (c l chi-square test). 20


2
c trng c o =

cao nht c s dng

biu din vn bn. phn ny, chng ta khng tp trung chi tit vo tm hiu
ny, nu cn c chi tit c th xem phn 5.3.3 ca cun sch.

Hnh 3.3: Mt v d biu din vn bn


3.2.3.2 Giai on hun luyn
Nh trnh by phn thut ton, xy dng cy phn lp, chng ta s i
t nt gc cha tt c cc vn bn cn phn loi. Trong th nghim ca tc gi,
ng dng 7681 vn bn hun luyn. V mc tiu l phn lp vn bn thuc
v ch earning. Trong c, 2304 vn bn thuc ch cn phn loi, cn
li l 5377 vn bn khng thuc nhm ch .
Biu din nt gc nh sau :

24

Trong :

P(c|n1) : xc sut mt vn bn ti node 1 thuc v lp c. Xc sut c


tnh bng cng thc
( | )

| |

( |C| s lng lp)

m rng nt gc ta s tnh li thng tin cho tng c trng ca vn


bn. Tuy nhin, trong v d phn trn, cc thuc tnh ca tp mu c gi tr ri
rc, cn trong tp mu ng liu vn bn, cc c trng c biu din bng mt
lng lin tc. Vy lm th no chn mt gi tr lm gi tr phn tch tp
ng liu ? Vn ny vn cha c mt li gii c th. Trong thc t, ngi ta vn
dng mt gii thut Heuristic tm ra gi tr ti u cho tng thuc tnh. Cng thc
tnh li thng tin cho mt c trng trong tp ng liu :

Trong :

G(a,y) : li thng tin ca c trng a vi gi t phn tch l y.

H(t) : Entropy ti node cha (node ang xt).

pL : t l cc phn t c truyn qua node tri.

pR : t l cc phn t c truyn qua node phi.

H(tL) : Entropy ca node tri.

H(tR) : Entropy ca node phi.

V d, trong tp ng liu trn, ta chn c trng cts, v ta chn c gi tr


phn tch ti u cho c trng ny l 2. Khi , nhnh tri s l nhnh cha
nhng vn bn c gi tr <2 v nhnh phi cha nhng vn bn c gi tr >=2. Ta
s tnh li thng tin cho c trng cts ng vi gi tr 2 nh sau :
Entropy node 1 :
( )

Entropy node tri :


25

( )

Entropy node phi :


( )

Suy ra, li thng tin c trng cst vi gi tr 2 l :


(

( )

( |

Tnh tng t cho cc c trng cn li trong vn bn, ta s thy rng li


thng tin ca cts l cao nht. Do , cts s c ch lm c trng phn tch
vi gi tr l 2. Cy quyt nh c hnh thnh tng ln nh sau :

Lp li nhng bc tnh ton trn cho tng nhnh tri v phi ca cy tip
tc pht trin cy cho n khi cc node u xc nh r vn bn ti node thuc
v lp no. Nhng node ch cha vn bn thuc v mt lp chnh l iu kin
dng ca thut ton v c gi l node l hay iu kin dng ca thut ton.

26

Hnh 3.4: Mt cy quyt nh n gin


Tuy nhin, sau khi cy quyt nh c xy dng y , chng ta phi thc
hin ct ta cy trnh tnh trng qu khp cho cy. Tnh trnh qu khp c th
hiu l cy ch c th nhn din ng cho tp d liu cn i vi nhng tp
khc s kh c th phn loi ng. C 2 phng php tip cn cho vic ct ta
cy : ta trc v ta sau.

Ta trc : phng php tip cn ngng xy dng cy trc khi n t


ti im phn loi cc d liu hun luyn.

Ta sau : phng php tip cn sau khi hon thnh cy v sau tita
cy.

Mc du phng php u tin c v chnh xc hn nhng cch tip cn th


2 ca vic xn sau cho thy thnh hu dng hn trong thc t. iu ny l do
nhng kh khn trong tip cn u tin ca vic c tnh chnh xc khi ngng xy
dng cy.
Bt chp ca vic la chn phng php xn trc hay xn sau, mt cu hi
t ra l tiu chun no s c dng xc nh kch thc cui cng ca cy.
Mt vi cch tip cn nh : Xn gim bt li (reduced-error-pruning, Quinlan
1987) v xn sau (C4.5, Quinlan 1993). V nh gi vic ta cy, ta cn c mt
b ng liu nh gi, c gi l b ng liu xc nhn (validation data set). B
27

ng liu ny c rt ra t b hun luyn. Trong th nghim ca tc gi vi tng


cng 9603 bi vit, ng x dng 80% (7681) hun luyn v 20% nh gi
vic ta nhnh.

Xn gim bt li : xem xt tng node trn cy v quyt nh cho vic


ct ta, ct mt node cy con bao gm loi b cy con c gc ti nt ,
lm cho n mt node l v gn n phn lp ph bin nht ca cc mu
hun luyn lin kt vi node . Cc nt ch c xn khi kt qu thc
hin khng ti t hn so vi bn gc. Vic ct c lp li, lun lun
chn nt m sau khi loi b lm tng chnh xc cy quyt nh. Ct
ta cc nt c tip tc cho n khi vic ct ta l c hi (gim
chnh xc ca cy).

Xn sau : Chuyn i vic hc cy quyt nh thnh mt tp cc quy


tc tng ng bng cch to ra quy tc cho tng con ng t
node gc n mt node l. Xn mi quy tc bng cch loi b bt c
iu kin no c kt qu lm tng tnh chnh xc ca cy. Sp xp cc
quy tc theo tnh chnh xc, v xem xt chng trong khi phn loi cho
cc trng hp tip theo.

3.2.3.3 Cross-validation
Trong thc t, nu ta chia b ng liu hun luyn lm 2 phn : 80% hun
luyn v 20% xc nhn ta cy th chnh xc hun luyn s b gim v s
chnh xc trong ta cy quyt nh cng khng m bo. V vy, ti u ho vn
ny, ngi ta thng dng qu trnh Cross-validation.
Cross-validation l qun trnh xc nhn cho. Tc l l khng ch to 1 cy
quyt nh, m thc hin nhiu cy. Mi ln ta cng thc hin trn b ng liu
nhng vi mi 20% khc nhau ca b ng liu. C th hiu mt cch n gin l
b ng liu gm 100%. Mi ln, ta ly 80% hun luyn to cy, v 20% ta cy.
Ln khc ta cng thc hin nh vy nhng v 20% l nhng d liu khc trong
b ng liu. V vy, ta c th thc hin 5 ln hun luyn (5-fold cross-validation).
Sau khi c c 5 cy quyt nh c ta, ta s tnh ra kch thc trung
bnh ca chng. Vi kch thc ny, ta s hun luyn li cy vi 100% d liu, v

28

xn ta cy ho va vi kch thc trung bnh . Sau cng chng ta s thu c


cy quyt nh cui cng. Qu trnh cui cng l s dng mt b ng liu test mi
nh gi chnh xc ca cy quyt nh.

Hnh 3.5: S dng tp ng liu trong Cross-validation


3.2.3.4 Giai on ph}n lp
T cy quyt nh, chng ta s rt ra c cc lut bng cch i t node gc
theo mt nhnh n node l. Cc lut s c p dng phn lp.

3.3 M hnh xc xut Entropy ti i (Maximum Entropy Modeling)


3.3.1 Entropy
3.3.1.1 Kh|i nim
Entropy thng tin l mt khi nim m rng ca entropy trong nhit ng lc
hc sang cho l thuyt thng tin (Claude E. Shannon, 1948)
Entropy l mt i lng dng o lng thtin khng chc chn
(uncertaincy) ca mt bin c hay mt phn phi ngu nhin cho trc.
V d :

Mt dng ch lun ch c cc k t "a" s c entropy bng 0, v k t


tip theo s lun l "a".

Mt dng ch ch c hai k t 0 v 1 ngu nhin hon ton s c


entropy l 1 bit cho mi k t.
29

3.3.1.2 Entropy ca bin ngu nhin


Entropy ca mt bin ngu nhin X cng l gi tr mong i ca cc ngc
nhin ca cc gi tr m X c th nhn.
Xt mt bin ngu nhin X c phn phi :

Cng thc tnh Entropy ca X :

Nhn xt :

Mt phn phi xc sut cng lch nhiu (c xc xut rt nh v rt ln)


th tnh khng chc chn cng nh entropy cng thp

Mt phn phi xc sut cng u th tnh khng chc chn cng ln =>
entropy cng cao

nh l cc i ca Entropy :
Ta c : H(p1,p2,,pM)<=log(M)
Trong : ng thc xy ra khi v ch khi p1=p2==pM=1/M, khi entropy
t gi tr cc i.
3.3.2 p dng vo phn loi vn bn
3.3.2.1 Biu din vn bn
biu din vn bn, chng ta vn s dng phng php c trnh by
trn. Trong th nghim ny, tc gi vn biu din vn bn bng mt vector vi
20 chiu l 20 t c o 2 cao nht, mi chiu l trng s ca t c o
bng o tf.idf
(

30

3.3.2.2 H{m c trng v{ r{ng buc


Trong m hnh Maximum Entropy, chng ta s dng d liu hun luyn to
cc rng buc. Mi rng buc biu din mt c tnh ca d liu hun luyn. Mi
rng buc c nh ngha bng mt hm c trng. Dng chung ca hm c
trng nh sau :
(

Trong nhim v phn loi ny, chng ta nh ngha hm c trng trn cp gi


tr (

), vi

l vector biu din mt vn bn trong tp hun luyn v c l mt gi

tr trong tp phn lp C.
(

Trong : wij l trng s ca t th i trong vn bn j.


y chng ta s dng hm trng nh phn th hin s c hay khng c
ca t trong vn bn. Tuy nhin, chng ta c th s dng hm c trng th hin
ln ca trng s.
3.3.2.3 Mt s k hiu :

S : tp hun luyn

( ) : xc sut thc nghim ca x trong tp S

( ) : xc sut ca x

: gi tr k vng thc nghim ca c trng


: gi tr k vng ca c trng

3.3.2.4 M hnh
C nhiu m hnh khc nhau cho vic phn lp bng Entropy cc i. y,
chng ta s s dng m hnh Loglinear c cng thc nh sau :

Trong :

K : s lng hm c trng c.
31

( )

: trng s (gi tr) ca hm c trng


: l hng s chun ho

Vn ca m hnh l xc nh cc trng s

ti u v dng chng

phn lp. Trong trng hp n gin nht, phn lp cho mt vn bn mi,


da vo b

hun luyn, chng ta tnh

) v

). Sau

chng ta s chn lp c xc sut ln hn lm phn lp cho xnew.


3.3.2.5 Th tc hun luyn Generalized iterative scaling
Generalized iterative scaling (GIS) l mt th tc tm kim phn b xc
sut p sao cho entropy t gi tr cc i ca dng m hnh LogLinear. Mc chnh
ti u sao cho tho tp rng buc :

ca gii thut l tm ra b trng s

Trong , gi tr k vng c nh ngha nh sau :


(

) (

Cng thc tnh xp x nh sau :


( | ) (

Trong ,
( | )

(
(

( ))
( ))

V gi tr k vng thc nghim c nh ngha nh sau :

) (

Trong , N l s vn bn trong tp hun luyn. Ch rng hai gi tr k vng


ca cc hm c trng ny khng phi l gi tr nh phn, m l gi tr thc o
kh nng ca cc rng buc.
Thut gii i hi tng gi tr ca cc hm c trng ng vi mi vector
mi lp c phi l mt hng s C.

32

ho thnh yu cu ny, chng ta nh ngha C l gi tr ln ca cc gi tr


ca cc hm c trng

( )

Tuy nhin, khng phi tt c cc tng gi tr cc hm c trng u bng


hng s C, do , chng ta add thm vo tp cc hm c trng mt hm c
trng

nh sau :

Th tc hun luyn c thc hin nh sau :


( )

1. Khi to tp trng s {

} ( vi 1 i K+1, v (1) l tp trng s th 1

ng vi ln khi to v s tng dn ln ng vi tng ln lp). C th


khi to vi bt k gi tr no, tuy nhin thng chng ta s khi to vi
gi tr 1, tc l {

( )

}.

Tnh gi tr k vng thc nghim ca cc hm c trng

Khi to n=1.
2. Tnh xc sut cho tng vector vn bn vi tng lp trong tp phn lp
( ) (

), s dng tp trng s i ti thi im ang xt v tnh theo

cng thc :
( )

3. Tnh gi tr k vng

( )

(vi 1 i K+1) cho cc hm c trng theo

cng thc xp x trnh by trn

33

( | ) (

4. Cp nht gi tr cho tp trng s i


(

( )

5. Nu cc trng s hi t th dng chng trnh, ngc li tng n ln 1


v quay li bc 2.
Kt qu sau khi chy gii thut l tp cc trng s I .
3.3.2.6 Giai on ph}n lp
(

Cho mt vector vn bn mi

).

S dng tp m hnh trng s hun luyn, tn hai xc sut P(c|


P(c|

) v

) theo cng thc :


( | )

(
(

( ))
( ))

Sau khi tnh hai xc sut, lp no c xc sut cao hn s l lp cho vn bn


mi.

34

5. Ti liu tham kho


[1] Christopher D.Manning, Hinrich Schutze, Foundations of Statistical
Natural Language Processing, 1999, The MIT Press.
[2] H Vn Qun, Khoa CNTT, H Bch Khoa TPHCM, Bi Ging L Thuyt
Thng Tin.
[3] Kostas Fragos, Yannis Maistros, Christos Skourlas, A Weighted
Maximum Entropy Language Model for Text Classification.
[4] Kamal Nigam, John Laerty, Andrew McCallum, Using Maximum Entropy
for Text Classication.

35

You might also like