You are on page 1of 10

TAP CH PHAT TRIEN KH&CN, TAP 12, SO 07 - 2009

M HNH BIU DIN VN BN THNH TH


Nguyn Hong T Anh, Nguyn Trn Kim Chi, Nguyn Hng Phi Trng i hc Khoa hc T nhin, HQG HCM
(Bi nhn ngy 09 thng 04 nm 2008, hon chnh sa cha ngy 26 thng 09 nm 2008)

TM TT: Biu din vn bn l mt bc tin x l rt quan trng trong nhiu lnh vc nh khai thc d liu vn bn, truy vn thng tin, x l ngn ng t nhin. Bi bo ny trnh by tng quan m hnh biu din vn bn thnh th. M hnh th c th gi li cc thng tin cu trc nh v tr, th t xut hin v s gn nhau ca t, trong khi chng b loi b trong m hnh khng gian vect truyn thng. Chng ti xy dng th nghim h thng phn lp vn bn ting Vit da trn m hnh biu din vn bn thnh th. T kho: M hnh th, biu din vn bn, phn lp vn bn. 1. GII THIU Hin nay, chng ta dng cc m hnh biu din gii quyt hu ht nhng vn lin quan n vn bn. Chng ng vai tr trung gian gia ngn ng t nhin dng vn bn v chng trnh x l trong cc lnh vc khai thc d liu vn bn, truy vn thng tin, x l ngn ng t nhin. Sau khi c ti th hin, vn bn tr thnh nhng cu trc d liu trc quan, n gin v c th x l c. V vy, cc m hnh biu din khng ngng pht trin, hm cha c nhiu hn nhng suy ngh m con ngi mun din t, ng thi nng cao hiu qu s dng. M hnh biu din vn bn truyn thng nh: m hnh ti t v khng gian vect l cc m hnh c s dng ph bin nht. M hnh khng gian vect [7] biu din vn bn nh mt vect c trng ca cc thut ng (t) xut hin trong ton b tp vn bn. Trng s cc c trng thng c tnh qua o TF*IDF. Tuy nhin, m hnh ny khng nm bt c cc thng tin cu trc quan trng nh trt t xut hin ca cc t, vng ln cn ca t, v tr xut hin ca t trong vn bn. gii quyt cc hn ch trn, m hnh th c xut v c nh gi c nhiu tim nng v tn dng c cc thng tin quan trng v cu trc m m hnh ti t v khng gian vect b qua. M hnh th biu din vn bn, c th l m hnh th khi nim (Conceptual Graphs_ CGs), c John F. Sowa trnh by ln u tin vo nm 1976 [9]. Hin nay, m hnh th khng ngng pht trin da trn tng ca m hnh CGs, c ng dng vo dy rng cc bi ton lin quan n x l vn bn v tr nn kh phong ph. Khi ng dng vo tng loi bi ton khc nhau, cc thnh phn thch hp nht trong vn bn tr thnh nh ca th v mi quan h hiu qu nht gia cc nh c chn xy dng cnh ca th. nh ca th c th biu din cu, t, hay cu kt hp t. Cnh c th dng th hin nhng mi quan h khc nhau gia cc nh nh: trt t xut hin, tn s ng hin, v tr xut hin, tng ng. Mc ch ca bi bo ny l nghin cu, h thng cc bin th ca m hnh biu din vn bn bng th nhm cung cp cho ngi c ci nhn tng quan v m hnh ny. Bn cnh , chng ti cng p dng th nghim m hnh biu din vn bn bng th vo bi ton phn lp vn bn ting Vit . Cc phn tip theo ca bi bo c t chc nh sau. Phn 2 gii thiu tng quan m hnh biu din vn bn bng th. Phn 3 gii thiu h thng phn lp vn bn s dng m hnh th kt hp thut ton khai thc th con ph bin. Phn 4 trnh by kt qu thc nghim ca h thng v cui cng l phn kt lun.

Trang 5

Science & Technology Development, Vol 12, No.07 - 2009 2. M HNH HA VN BN THNH TH Hin nay, trn th gii c mt s cng trnh x l vn bn s dng m hnh th. Cc m hnh th tng i a dng v mi m hnh mang nt c trng ring. Sau qu trnh nghin cu v tng hp, chng ti xin gii thiu mt s m hnh th biu din vn bn chnh c nhng c tnh khi qut sau. Mi th l mt vn bn hoc biu din cho tp vn bn. nh ca th c th l cu, hoc t, hoc kt hp cu v t. Cnh ni gia cc nh l v hng hoc c hng, th hin mi quan h trong th. Nhn nh thng l tn s xut hin ca nh. Cn nhn cnh l tn mi lin kt khi nim gia 2 nh, hay tn s xut hin chung ca 2 nh trong mt phm vi no , hay tn vng m nh xut hin. V d trong bi ton rt trch thng tin, nh l t [11] hay t kt hp cu [14], cnh th hin tn s ng hin. Trong bi ton phn lp vn bn, nh l t, cnh th hin trt t xut hin ca t hay v tr xut hin ca t trong vn bn [1] [5] [8]. Cn trong bi ton tm tt vn bn th nh l cu, cnh th hin s tng ng gia cc cu [6]. Do t lu gi c nhiu thng tin cu trc nht nn m hnh th s dng nh l t c nghin cu su hn v c nhiu bin th nht. Chng ti tng hp cc m hnh th chnh v phn thnh cc nhm nh sau: M hnh th s dng nh l t trong vn bn (k hiu t s 1 10). M hnh th s dng mng ng ngha (m hnh s 1, 2, 3). u im ca nhm m hnh ny l m hnh ho vn bn mt cch trc quan, logic, th hin c quan h ng ngha gia cc khi nim v cho kt qu truy vn thng tin chnh xc hn. M hnh th khng s dng mng ng ngha (m hnh s 4 10). Nhm m hnh ny khai thc c cc thng tin cu trc ca vn bn (th t xut hin, v tr, vng ln cn ca t trong vn bn) nhanh chng, n gin v khng ph thuc vo mng ng ngha nn d dng ci t cc ng dng phn lp, gom cm. M hnh th s dng nh l cu (m hnh s 11). Th mnh ca m hnh ny l kh nng lu tr mi lin kt gia cc cu, th t xut hin cu v h tr tt cho qu trnh trch chn cu quan trng ca vn bn a vo bn tm tt bng tip cn khng gim st. M hnh th s dng nh l cu v t (m hnh s 12). M hnh ny tn dng c mi lin quan gia t vi cu, cng nh s ng hin ca t trong cu tng hiu qu ca bi ton rt trch thng tin vn bn. Chng ti tm tt nhng c trng chnh v lnh vc ng dng c bn ca cc m hnh biu din vn bn bng th trong bng 1. Trong cc m hnh c gii thiu trn, c nhng m hnh c m rng t m hnh khc. V d nh th dng chun l m hnh m rng ca th n gin, th khong cch n l m hnh m rng ca th khong cch n n gin vi nhn cnh l v tr ca t trong cu trc vn bn. Sau y, chng ti s trnh by chi tit mt s m hnh i din vi nh biu din t. l m hnh th khi nim, th hnh sao, th tn s xut hin v hng, th n gin, th khong cch n n gin.

Trang 6

TAP CH PHAT TRIEN KH&CN, TAP 12, SO 07 - 2009 Bng 1. M t cc m hnh biu din vn bn bng th
M hnh Tn ring ca m hnh th khi nim _ CGs CGs ci tin v hng th khi nim ci tin th hnh sao nh S loi ngha nh T 2 Cnh Nhn ngha Lin kt khi nim Lin kt khi nim Lin kt khi nim Lin kt t v nh cu trc trung tm Lin kt t xut hin chung trong cu trc T a xut hin ngay trc t b Gia t a trc t b c t hn n t Gia t a trc t b c t hn n t T a xut hin ngay trc t b T a xut hin ngay trc t b Hng Nhn Lnh vc ng dng Truy vn thng tin, thit k CSDL Tm kim thng tin trn Web Gom cm vn bn Phn loi email Tm kim thng tin trn Web Phn lp, gom cm vn bn Phn lp vn bn Phn lp vn bn Phn lp, gom cm vn bn Phn lp vn bn

Khng

Khng

2 3

T T T / cu trc

1 1

Khng Khng C (tn s xut hin) C (tn s xut hin) C (tn t) Khng

Khng C

Khng C (cu trc ng php) C (v tr t trong cu trc vn bn) C (tn s xut hin chung ) Khng

Khng

th tn s v hng th n gin th khong cch n n gin th khong cch n th dng chun th tn s th nh l cu th song phng

Khng

Khng C (s t gia a v b + 1) C (v tr t trong cu trc vb) C ( tn s 2 t xut hin lin tip)

Khng C (tn t) C (tn s xut hin )

10

11

Cu

C Lin kt hai (trng s cu c t nh) chung Khng T xut hin trong cu

12

Cu, t

C ( Tm tt vn tng t bn gia 2 cu) C (tn s Rt trch Khng xut hin ca thng tin t trong cu) C/ Khng

2.1. M hnh th khi nim (Conceptual Graphs - CGs) M hnh th khi nim s dng mng ng ngha biu din vn bn thnh th. Mi t trong vn bn l mt khi nim v c biu din bng nh hnh vung. nh hnh oval th hin mi quan h gia cc khi nim. Cc nh hnh vung c ni vi nhau da trn mi quan h trong mng ng ngha v qua trung gian l nh hnh oval. u im ca CGs l m hnh ho vn bn mt cch trc quan, chnh xc v logic. im hn ch ca CGs l kh phc tp, i hi phn tch ng ngha su, chuyn bit v phi ph thuc vo lnh vc.

Trang 7

Science & Technology Development, Vol 12, No.07 - 2009 V d 1: Ta c cu: Jonh is going to Boston by bus.

Hnh 1. V d m hnh th khi nim [15]

M hnh th khi nim biu din cu trn nh trong hnh 1. Trong : cc khi nim l [Go], [Person: John], [City: Boston] v [Bus], cc mi quan h l (Agnt) tc nhn, (Dest) ni n v (Inst) phng tin. 2.2. M hnh th hnh sao Trong th hnh sao, nh trung tm l nt khi qut cu trc ca vn bn. Sau khi nh trung tm c xc lp, cc nh cn li s c trin khai. Ngoi nh trung tm, cc nh cn li biu din t trong vn bn. nh thuc khu vc no trong vn bn s c cnh ni t nh n nh trung tm. Cnh ni gia cc nh c gn nhn, th hin mi quan h gia cc nh. V d khi chng ta m hnh ho mt vn bn th nhn ca cnh c th l: tiu , cha nh trong hnh 2. Th mnh ca m hnh th hnh sao khi p dng vo bi ton phn lp ni chung v c bit trong phn loi email l nm bt c cc thng tin cu trc ca email (phn tiu , phn ni dung), mi quan h gia t vi cc phn cu trc (ng hin ca t trong cc phn tiu , ni dung, ...).
nhit cnh bo tiu tiu ton cu cha nng ln Hnh 2. V d m hnh th hnh sao cha kh hu cha Vn bn cnh bo cha cha ton cu

2.3. M hnh th v hng s dng tn s xut hin Trong m hnh th v hng s dng tn s xut hin, nh v cnh u c gn nhn, nhn ca nh v cnh l tn s xut hin ca nh v cnh tng ng. Nhn nh l tn s xut hin ca t trong vn bn. Cnh c ni gia hai nh nu hai t xut hin chung trong tp hp (cu hoc nhm t hoc trang) v c tn s xut hin chung ln hn ngng cho php. Nhn cnh l tn s xut hin chung ca 2 t trong tp hp. Hnh 3 l v d m hnh th v hng s dng tn s xut hin. u im ca m hnh l khai thc c mi quan h gia t

Trang 8

TAP CH PHAT TRIEN KH&CN, TAP 12, SO 07 - 2009 vi t trong cu trc vn bn, cng nh tn s xut hin ca t v h tr cho qu trnh tm kim thng tin nhanh chng.

Hnh 3. V d m hnh th v hng s dng tn s xut hin [11]

2.4. M hnh th c hng, cnh khng gn nhn M hnh ny cn c gi l m hnh th n gin [8]. Mi nh biu din mt t ring bit v ch xut hin mt ln trn th (ngay c khi t xut hin nhiu ln trong vn bn). Nhn nh l duy nht v l tn ca t. Sau bc tin x l vn bn, nu t a ng ngay trc t b s c cnh ni t nh a n nh b (khng k cc trng hp phn cch bi du cu). im mnh ca m hnh l lu tr c cc thng tin cu trc nh th t xut hin, v tr ca t trong vn bn v lm tng hiu qu ca bi ton phn lp cng nh gom cm vn bn. V d 2: Ta c cu sau :Microsoft s gii thiu h iu hnh Vista v trng by cc cng ngh b tr c xy dng ci tin h iu hnh. Hnh 4 l m hnh biu din vn bn trn sau khi qua bc loi b bt h t v cc t c trng s thp.
gii thiu h iu hnh ci tin

Vista

xy dng

Hnh 4. V d m hnh th n gin

2.5. M hnh th c hng, cnh khng gn nhn, cnh l khong cch n gia hai t trong vn bn M hnh ny cn c tn gi khc l m hnh khong cch n n gin. Trong cch biu din ny, ngi dng cung cp tham s n. Thay v ch quan tm t A trc tip ngay trc t B, ta cn ch n n t ng trc t B. Cnh c xy dng gia hai t khi gia chng c s t xut hin nhiu nht l (n-1) t (ngoi tr trng hp cc t c phn cch bi cc du cu). u im ca m hnh l tn dng c mi quan h gia cc t, vng ln cn ca t trong cu v c th p dng vo bi ton phn lp vn bn.

Trang 9

Science & Technology Development, Vol 12, No.07 - 2009 V d 3: Ta c cu sau: Cnh ng la xanh bt ngt. Vi n=2, hnh 5 l m hnh biu din cu trn.
cnh ng la

bt ngt

xanh

Hnh 5. V d m hnh th khong cch n n gin

Cc m hnh cn li l bin th ca cc m hnh trn vi cc khc bit c m t trong bng 1. 3. H THNG PHN LP VN BN TING VIT Phn lp vn bn l qu trnh gn vn bn vo mt hoc nhiu ch xc nh trc. Phn lp vn bn ting Vit l mt lnh vc nghin cu quan trng, c quan tm trong thi gian gn y. Ting Vit khc vi ting Anh ch ranh gii gia cc t khng phi ch l nhng khong trng v n i hi phi x l tch t trc. Bn thn bi ton tch t trong ting Vit l bi ton kh. Kh khn th hai l cha c kho d liu chun cho ting Vit nh Reuter, NewGroups, c th so snh kt qu phn lp. Gn y, c mt s tin trin ng k trong bi ton phn lp vn bn ting Vit [3] [10]. Tuy nhin, cc cng trnh nghin cu ny u da trn m hnh khng gian vect. Nhm tn dng cc u im ca m hnh th, chng ti xy dng th nghim h thng phn lp vn bn ting Vit da vo m hnh th biu din vn bn v s dng thut ton khai thc th con ph bin xc nh c trng cho tng ch . trnh ph thuc vo bi ton tch t v v n v t c to thnh bi mt hay nhiu ting [2], chng ti s dng ting lm nh ca th. Trong qu trnh hun luyn, u vo ca h thng l tp vn bn hun luyn D = {d1, d2, , dn} phn chia theo ch v tp ch C = { c1, c2, , cr}. Trong qu trnh phn lp, vn bn mi s c xc nh ch da trn s tng t vi cc c trng. Hnh 6 l m hnh chnh ca h thng phn lp. Trong : - (b): M hnh ho vn bn trong D thnh tp th G = {g1, g2, , gn}. Chng ti dng m hnh th n gin vi mi ting l mt nh trong th. Vi u im ca m hnh th, nu chng ta tch ting m khng cn tch t th vn lu gi c cu trc ca t trong vn bn. - (c): Trong tng ch , chng ta tm tp th con ph bin c tn s xut hin ln hn ngng ph bin ti thiu minsupp. Chng ti s dng thut ton gSpan [12] tm cc th con ph bin do y l thut ton c nh gi l nhanh v c th bin i ph hp vi m hnh th c hng. Nhim v phc tp nht trong bi ton khai thc th con ph bin l vn ng cu th, c phc tp NP khi nhn nh khng duy nht. Tuy nhin, vi m hnh biu din vn bn bng th n gin v nhn nh l duy nht th phc tp ca thut ton gim xung cn O(n2). - (d): Tng hp th con trong tt c cc ch , ta c tp th con ph bin S = {s1, s2, , sm}

Trang 10

TAP CH PHAT TRIEN KH&CN, TAP 12, SO 07 - 2009 - (e): Xy dng vect c trng cho tng ch v l vect nh phn m chiu thng qua tp S. Nu th con ph bin thuc S xut hin trong tp th con ph bin ca ch th c trng tng ng ca vect nhn gi tr 1 v ngc li. Chng ta xy dng c tp vect c trng nh phn F = {f1, f2, , fr}. - (g): Vn bn mi c biu din thnh th, sau chuyn thnh vect nh phn v0 c m chiu tng ng vi m th con ph bin ca tp S. Chng ti s dng phng php so khp vi o Dice [4] tnh khong cch gia vect v0 v vect c trng ch . Vn bn mi thuc ch cho o c gi tr ln nht. Cng thc tnh o Dice gia vect c trng ch v vect v0 :

Dice(v0 , f j ) =

2 v0 f j v0 + f j

(1)

Trong : fj F, |v0 |, |fj|: tng s c trng mang gi tr 1 ca v0, fj.

Hnh 6. S h thng phn lp vn bn

4. KT QU TH NGHIM nh gi m hnh biu din vn bn bng th, chng ti thu thp b d liu bao gm 2500 tp tin vn bn (l tm tt bi bo ly t mt s bo in t nh VnExpress1, TuoiTre Online2, ThanhNien Online3). B d liu bao gm 6 ch nh trong bng 2. Sau khi tin x l vn bn (gm cc bc nh tch cu, tch ting, loi b h t) chng ti thu c trung bnh 40 nh/ th.
1 2

http://www.vnexpress.net http://www.tuoitre.com.vn 3 http://www.thanhnien.com.vn

Trang 11

Science & Technology Development, Vol 12, No.07 - 2009 nh gi kt qu phn lp, chng ti s dng cc ch s ph (recall), chnh xc (precision) v ch s cn bng gia 2 o trn - F1 [13]. Chng ti s dng phng php nh gi cho (k-fold validation) chy th nghim trn my tnh Pentium 1.5G v b nh 256MB. Bng 2. Tp d liu hun luyn STT 1 2 3 4 5 6 Tn ch X Hi Khoa Hc Th Thao Kinh Doanh Vn Ha Sc kho S vn bn 400 350 450 450 400 450

Kt qu th nghim c trnh by trong bng 3 vi thi gian hun luyn trung bnh l 2.8 giy/ vn bn v thi gian thc hin phn lp tnh t thi im tin x l vn bn mi cho n khi phn lp hon tt trung bnh l 0.9 giy / vn bn. Bng 3. Kt qu th nghim (5-fold validation) Tn ch X Hi Khoa Hc Th Thao Kinh Doanh Vn Ha Sc kho Trung bnh ph (Recall) 0.79 0.705 0.86 0.866 0.8 0.702 0.787 chnh xc (Precision) 0.915 0.8 0.946 0.843 0.941 0.85 0.888 o F1 0.848 0.75 0.901 0.854 0.856 0.769 0.833

Chng ti ci t thut ton k-lng ging gn nht (k-NN) trn m hnh khng gian vect vi o Cosine [7] so snh vi m hnh biu din vn bn bng th ca chng ti. Hnh 7 l th so snh kt qu phn lp theo tng m hnh trn cc ch . M hnh biu din vn bn bng th cho kt qu phn lp tt hn.
M hnh vect
1.0 0.8 o F1 0.6 0.4 0.2 0.0

M hnh th

X Hi

Khoa Hc

Th Thao

Kinh Doanh

Vn Ha Sc kho

Hnh 7. Kt qu phn lp theo ch

Trang 12

TAP CH PHAT TRIEN KH&CN, TAP 12, SO 07 - 2009 5. KT LUN Bi bo nghin cu v tng hp cc m hnh biu din vn bn thnh th. Chng ti xy dng th nghim h thng phn lp vn bn ting Vit da trn m hnh biu din vn bn bng th. M hnh th cho php lu tr cc thng tin cu trc quan trng ca vn bn nh v tr, s ng hin hay th t ca t. Kt qu th nghim cho thy m hnh th cho kt qu phn lp tt hn m hnh khng gian vect truyn thng. nh gi chnh xc hn na, chng ti d kin s thu thp v xy dng b d liu th nghim ln. ng thi, chng ti d kin s th nghim p dng cc loi m hnh th khc nhau vo bi ton phn lp xc nh loi m hnh ph hp nht.

GRAPH BASED MODEL FOR TEXT REPRESENTATION


Nguyen Hoang Tu Anh, Nguyen Tran Kim Chi, Nguyen Hong Phi University of Science, VNU-HCM ABSTRACT: Text representation models are very important pre-processing step in various domains such as text mining, information retrieval, natural language processing. In this paper we summarize graph-based text representation models. Graph-based model can capture structural information such as the location, order and proximity of term occurrence, which is discarded under the standard text vector representation models. We have tested this graph model in Vietnamese text classification system. Keyword: Graph model, text representation, text classification. TI LIU THAM KHO [1]. Aery M., INFOSIFT: adapting graph mining techniques for document classification, University of Texas at Arlington, 12/2004. [2]. inh in, X l Ngn ng t nhin, NXB i hc Quc gia Tp. HCM, (2004). [3]. Phc, Nghin cu ng dng tp ph bin v lut kt hp vo bi ton phn loi vn bn ting Vit c xem xt ng ngha, Tp ch Pht trin Khoa hc & Cng ngh, Tp 9, s 2, pp.23-32, (2006). [4]. Khreisat L., Arabic Text Classification Using N-Gram Frequency Statistics _ a Comparative Study, WORLDCOMP06 DMIN06, (2006). [5]. Markov A., Last M., A Simple, Structure-Sensitive Approach for Web Document Classification, Proc. of AWIC 2005, LNAI 3528, pp. 293-298, (2005). [6]. Mihalcea R., Tarau P., TextRank: Bringing Order into Texts, Proc. of EMNLP04, pp.404-411, (2004). [7]. Salton G., Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, MA, (1989). [8]. Schenker A., Last M., Bunke H., Kandel A, Classification Of Web Documents Using Graph Matching, International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition, Vol.18, No.3, pp. 475-479, (2004).

Trang 13

Science & Technology Development, Vol 12, No.07 - 2009 [9]. Sowa J.F., Conceptual Graphs for a DataBase Interface, IBM Journal of Research and Development 20(4), 336357, July, (1976). [10]. Thanh V. Nguyen, Hoang K. Tran, Thanh T.T. Nguyen, Hung Nguyen, Word Segmentation for Vietnamese Text Categorization, Poster Proc. of RIVF06, pp.113-118, (2006). [11]. Tomita J., NakawataseH., Ishii M., Graph-based Text Database for Knowledge Discovery, Poster Proc. of WWW04, pp. 454455, (2004). [12]. Yan X., Han J., gSpan: Graph-Based Substructure Pattern Mining, Proc. of IEEE ICDM02, pp.721-723, (2002). [13]. Yang Y., Liu X., A re-examination of text categorization methods, Proc. of ACM SIGIR99, pp. 42-49, (1999). [14]. Zha H., Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering, Proc. of ACM SIGIR02, pp113-200, (2002). [15]. http://www.jfsowa.com/cg/cgexamp.htm

Trang 14

You might also like