You are on page 1of 74

I HC QUC GIA H NI TRNG I HC CNG NGH

Nguyn Th Thu Hng

PHNG PHP PHN CM TI LIU WEB V P DNG VO MY TM KIM

LUN VN THC S

H Ni 2007

I HC QUC GIA H NI TRNG I HC CNG NGH

Nguyn Th Thu Hng

PHNG PHP PHN CM TI LIU WEB V P DNG VO MY TM KIM

Ngnh: Cng ngh thng tin. M s: 1.01.10

LUN VN THC S

NGI HNG DN KHOA HC: PGS.TS H QUANG THY

H Ni - 2007

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Nhng li u tin
Vi nhng dng ch u tin ny, ti xin dnh gi li cm n chn thnh v su sc nht ti thy gio, tin s H Quang Thy - ngi tn tnh hng dn, ch bo v to cho ti nhng iu kin tt nht t khi bt u cho ti khi hon thnh cng vic ca mnh. ng thi xin cm n tt c nhng ngi thn yu trong gia nh ti cng ton th bn b, nhng ngi lun gip v ng vin ti mi khi vp phi nhng kh khn, b tc. Cui cng, xin chn thnh cm n ng nghip ca ti ti Trung tm CNTT, NHNo&PTNT VN nhng ngi em n cho ti nhng li khuyn v cng b ch gip tho g nhng kh khn, vng mc trong qu trnh lm lun vn.

-1-

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

LI CAM OAN
Ti xin cam oan kt qu t c trong lun vn l sn phm ca ring c nhn, khng sao chp li ca ngi khc. Trong ton b ni dung ca lun vn, nhng iu c trnh by hoc l ca c nhn hoc l c tng hp t nhiu ngun ti liu. Tt c cc ti liu tham kho u c xut x r rng v c trch dn hp php. Ti xin hon ton chu trch nhim v chu mi hnh thc k lut theo quy nh cho li cam oan ca mnh. H Ni, ngy 01 thng 11 nm 2007

Nguyn Th Thu Hng

-2-

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

MC LC
DANH MC CH VIT TT ........................................................................................ 5 DANH MC HNH V, BNG BIU ............................................................................ 6 M U .......................................................................................................................... 7 CHNG 1 - KHI QUT V KHAI PH D LIU WEB ................................... 9 1.1. Khai ph d liu Web ....................................................................................... 9 1.1.1. Gii thiu v Khai ph d liu .................................................................. 9 1.1.2. D liu Web v nhu cu khai thc thng tin ........................................... 11 1.1.3. c im ca d liu Web ...................................................................... 12 1.1.4. Cc hng tip cn khai ph d liu Web .............................................. 13 1.1.5. Nhu cu Phn cm ti liu Web.............................................................. 14 1.2. M hnh tm kim thng tin ............................................................................ 15 1.2.1. Gii thiu ................................................................................................ 15 1.2.2. Quy trnh tm kim thng tin trong h thng .......................................... 15 1.2.3. ng dng phn cm vo h thng tm kim ........................................... 18 1.3. Kt lun chng 1 ........................................................................................... 19 CHNG 2 - THUT TON PHN CM WEB ................................................... 20 2.1. Mt s ni dung c bn v thut ton phn cm ti liu ................................ 20 2.2. Tiu chun nh gi thut ton phn cm ...................................................... 22 2.3. Cc c tnh ca cc thut ton phn cm web .............................................. 24 2.3.1. M hnh d liu....................................................................................... 24 2.3.2. o v s tng t .............................................................................. 27 2.3.3. M hnh phn cm .................................................................................. 29 2.4. Mt s k thut Phn cm Web in hnh ...................................................... 30 2.4.1. Phn cm theo th bc ............................................................................ 30 2.4.2. Phn cm bng cch phn mnh ............................................................. 33 2.5. Cc yu cu i vi cc thut ton phn cm Web ........................................ 35 2.5.1. Tch cc thng tin c trng................................................................... 35 2.5.2. Phn cm chng lp ................................................................................ 36 2.5.3. Hiu sut ................................................................................................. 36 2.5.4. Kh nng kh nhiu ................................................................................ 36 2.5.5. Tnh tng ................................................................................................. 37 2.5.6. Vic biu din kt qu ............................................................................ 37 2.6. Bi ton tch t t ng ting Vit ................................................................. 37 2.6.1. Mt s kh khn trong phn cm trang Web ting Vit ......................... 37 2.6.2. Ting v T trong ting Vit .................................................................. 39 2.6.3. Phng php tch t t ng ting Vit fnTBL ..................................... 39 2.6.4. Phng php Longest Matching ............................................................. 43 2.6.5. Kt hp gia fnTBL v Longest Matching ............................................. 44 2.7. Kt lun chng 2 ........................................................................................... 44 CHNG 3 - THUT TON PHN CM CY HU T V THUT TON CY PHN CM TI LIU ........................................................................................ 45 3.1. Gii thiu v thut ton phn cm trang Web c tnh tng ............................ 45 3.2. Thut ton phn cm cy hu t ..................................................................... 46 3.2.1. M t ....................................................................................................... 46 3.2.2. Thut ton STC ....................................................................................... 47

-3-

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

3.3. Thut ton phn cm s dng cy phn cm ti liu ...................................... 51 3.3.1. Gii thiu ................................................................................................ 51 3.3.2. Trch chn c trng v phn cm ti liu ............................................. 51 3.3.3. Cy phn cm ti liu DC Tree ............................................................ 55 3.4. Kt lun chng 3 ........................................................................................... 60 CHNG 4 - PHN MM TH NGHIM V KT QU THC NGHIM ...... 61 4.1. Gii thiu ........................................................................................................ 61 4.2. Thit k c s d liu ..................................................................................... 62 4.3. Chng trnh th nghim ................................................................................ 65 4.4. Kt qu thc nghim ....................................................................................... 66 4.5. Kt lun chng 4 ........................................................................................... 69

-4-

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

DANH MC CH VIT TT
AHC: Phn cm tch t theo th bc (Agglomerative Hierarchical Clustering) CSDL: C s d liu DF: tn sut xut hin ti liu (Document Frequency) DC-tree: Cy phn cm ti liu (Document Clustering Tree) fnTBL: Hc da trn s bin i (Fast Transformation-based learning) FCM: Fuzzy C-means FCMdd: Fuzzy C-Medoids IR: M hnh tm kim thng tin (Information Retrieval) IDF: tn sut nghch o ti liu (inverse document frequency) KDD: Khai ph tri thc (Knowledge Discovery in Databases) STC: Phn cm cy hu t (Suffix tree clustering) TF: tn sut xut hin (term frequency) UPGMA: (Unweighter Pair-Group Method using Arithmetic averages)

-5-

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

DANH MC HNH V, BNG BIU

-6-

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

M U
World Wide Web l mt kho thng tin khng l vi tim nng c coi l khng c gii hn. Khai ph Web l vn nghin cu thi s trong thi gian gn y, thu ht nhiu nhm nh khoa hc trn th gii tin hnh nghin cu, xut cc m hnh, phng php mi nhm to ra cc cng c hiu qu h tr ngi dng trong vic tng hp thng tin v tm kim tri thc t tp hp cc trang Web khng l trn Internet. Phn cm ti liu Web l mt bi ton in hnh trong khai ph Web, nhm phn hoch tp vn bn thnh cc tp con c tnh cht chung, trong bi ton phn cm cc trang Web l kt qu tr v t my tm kim l rt hu dng [4-6, 8-15, 18, 19, 22, 24]. Nh bit, tp hp cc trang Web p ng mt cu hi tr v t my tm kim ni chung l rt ln, v vy, thut ton phn cm vn bn y cn c c mt tnh cht rt quan trng l tnh "tng" theo ngha thut ton phn cm khng phi thc hin ch trn ton b tp d liu m c th c thc hin theo cch t b phn d liu ti ton b d liu [4, 6, 11, 14, 15, 24]. iu cho php thut ton tin hnh ngay trong giai on my tm kim a cc trang web kt qu v. Lun vn tp trung kho st cc phng php phn cm trong Web c tnh cht tng v thc hin mt s th nghim tch hp cc kt qu nghin cu ni trn vo mt phn mm ti trang Web theo dng my tm kim. ng thi, lun vn trin khai mt s bc u tin trong vic p dng phn cm cho cc trang Web ting Vit. Lun vn xy dng mt phn mm th nghim v tin hnh cc th nghim phn cm Web ting Vit. Ngoi Phn M u, Phn Kt lun v cc Ph lc, ni dung lun vn c chia thnh 4 chng chnh: Chng 1 Khi qut v khai ph d liu Web. Chng ny gii thiu nhng ni dung c bn nht, cung cp mt ci nhn khi qut v Khai ph d liu Web. ng thi, lun vn cng m t s b mt h thng thng tin tm kim v nhu cu phn cm p dng cho h thng ny.

-7-

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Chng 2 Thut ton phn cm Web. Chng ny trnh by mt cch khi qut v cc thut ton phn cm Web, nhng c trng v yu cu i vi cc thut ton phn cm Web. Nhng yu cu v o p dng cho cc thut ton phn cm Web cng c trnh by trong chng ny. Mt s kin thc c bn v ting Vit cng c gii thiu y. Chng 3 Thut ton phn cm cy hu t v thut ton cy phn cm ti liu. Chng ny i su vo phn tch cc thut ton phn cm Web c tnh cht tng. Lun vn tp trung vo hai thut ton phn cm Web c tnh tng l thut ton STC v thut ton phn cm c s dng cu trc cy DC (DC-tree). Chng 4 Phn mm th nghim v kt qu thc nghim. Chng ny trnh by kt qu thc nghim phn cm Web theo phn mm th nghim trn c s thut ton phn cm DC-tree. Chng trnh ci t th nghim c vit trn ngn ng lp trnh C# trn nn tng .Net Framework ca Microsoft s dng SQL Server 2000 lu tr c s d liu. Phn mm hot ng, cho kt qu phn cm, tuy nhin, do thi gian hn ch nn lun vn cha tin hnh nh gi kt qu phn cm mt cch chnh thng. Phn Kt lun trnh by tng hp cc kt qu thc hin lun vn v phng hng nghin cu tip theo v cc ni dung ca lun vn. Lun vn t mt s kt qu kh quan bc u trong vic nghin cu v trin khai cc thut ton phn cm Web c tnh cht tng, tuy nhin, lun vn khng trnh khi nhng sai st. Rt mong c s ng gp kin, nhn xt tc gi c th hon thin c kt qu nghin cu.

-8-

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

CHNG 1 - KHI QUT V KHAI PH D LIU WEB


1.1. Khai ph d liu Web
1.1.1. Gii thiu v Khai ph d liu Khi nim Khai ph d liu (Data Mining) Khai ph d liu c nh ngha nh mt qu trnh cht lc hay khm ph tri thc t mt lng ln d liu. Thut ng Data Mining m ch vic tm mt tp nh c gi tr t mt lng ln cc d liu th. C s phn bit gia khi nim "Khai ph d liu" vi khi nim "Pht hin tri thc" (Knowledge Discovery in Databases - KDD) m theo , khai ph d liu ch l mt bc trong qu trnh KDD. Qu trnh KDD gm mt s bc sau: Lm sch d liu: Loi b nhiu v cc d liu khng cn thit Tch hp d liu: Cc ngun d liu khc nhau tch hp li La chn d liu: Cc d liu c lin quan n qu trnh phn tch c la chn t c s d liu Chuyn i d liu: Cc d liu c chuyn i sang cc dng ph hp cho qu trnh x l Khai ph d liu: L mt trong nhng bc quan trng nht, trong s dng nhng phng php thng minh la chn ra nhng mu d liu. c lng mu: Qu trnh nh gi kt qu thng qua mt o no Biu din tri thc: Biu din cc kt qu mt cch trc quan cho ngi dng.

-9-

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Hnh 1. Cc bc trong KDD


Cc hng tip cn v cc k thut trong khai ph d liu Khai ph d liu c chia nh thnh mt s hng chnh nh sau: M t khi nim (concept description): thin v m t, tng hp v tm tt khi nim. V d: tm tt vn bn. Lut kt hp (association rules): l dng lut biu din tri th dng kh n gin. V d: 50% nhng ngi mua my tnh th cng mua my in. Lut kt hp c ng dng nhiu trong lnh vc knh doanh, y hc, tin-sinh, ti chnh & th trng chng khon, .v.v. Phn lp v d on (classification & prediction): xp mt i tng vo mt trong nhng lp bit trc. V d: phn lp vng a l theo d liu thi tit. Hng tip cn ny thng s dng mt s k thut ca machine learning nh cy quyt nh (decision tree), mng n ron nhn to (neural network), .v.v. Ngi ta cn gi phn lp l hc c gim st (hc c thy). Phn cm (clustering): xp cc i tng theo tng cm (s lng cng nh tn ca cm cha c bit trc. Ngi ta cn gi phn cm l hc khng gim st (hc khng thy). Khai ph chui (sequential/temporal patterns): tng t nh khai ph lut kt hp nhng c thm tnh th t v tnh thi gian. Hng tip cn ny c ng dng nhiu trong lnh vc ti chnh v th trng chng khon v n c tnh d bo cao.

- 10 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

ng dng ca khai ph d liu Khai ph d liu tuy l mt hng tip cn mi nhng thu ht c s quan tm ca rt nhiu nh nghin cu v pht trin nh vo nhng ng dng thc tin ca n. Chng ta c th lit k ra y mt s ng dng in hnh [7,16]: Phn tch d liu v h tr ra quyt nh (data analysis & decision support) iu tr y hc (medical treatment) Text mining & Web mining Tin-sinh (bio-informatics) Ti chnh v th trng chng khon (finance & stock market) Bo him (insurance) Nhn dng (pattern recognition) .v.v. 1.1.2. D liu Web v nhu cu khai thc thng tin S pht trin nhanh chng ca mng Internet v Intranet sinh ra mt khi lng khng l cc d liu dng siu vn bn (d liu Web). Cng vi s thay i v pht trin hng ngy hng gi v ni dung cng nh s lng ca cc trang Web trn Internet th vn tm kim thng tin i vi ngi s dng li ngy cng kh khn. C th ni nhu cu tm kim thng tin trn mt c s d liu phi cu trc (bao gm d liu vn bn) c pht trin ch yu cng vi s pht trin ca Internet. Thc vy vi Internet, con ngi lm quen vi cc trang Web cng vi v vn cc thng tin. Trong nhng nm gn y, Intrnet tr thnh mt trong nhng knh v khoa hc, thng tin kinh t, thng mi v qung co. Mt trong nhng l do cho s pht trin ny l gi c thp cn tiu tn khi cng khai mt trang Web trn Internet. So snh vi nhng dch v khc nh mua bn hay qung co trn mt t bo hay tp ch, th mt trang Web "i" chi ph r hn rt nhiu m li c cp nht nhanh chng hn ti hng triu ngi dng khp mi ni trn th gii. C th ni khng gian Web nh l cun t in Bch khoa ton th. Thng tin trn cc trang Web a dng v mt ni dung cng nh hnh thc. C th ni Internet nh mt x hi o, n bao gm cc

- 11 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

thng tin v mi mt ca i sng kinh t, x hi c trnh by di dng vn bn, hnh nh, m thanh,... Tuy nhin cng vi s a dng v s lng ln thng tin nh vy ny sinh vn qu ti thng tin. Ngi ta khng th tm t kim a ch trang Web cha thng tin m mnh cn, do vy i hi cn phi c mt trnh tin ch qun l ni dung ca cc trang Web v cho php tm thy cc a ch trang Web c ni dung ging vi yu cu ca ngi tm kim. Cc tin ch ny qun l d liu trang Web nh cc i tng phi cu trc. Hin nay chng ta lm quen vi mt s cc tin ch nh vy, l Yahoo, Google, Alvista, ... Mt khc, gi s chng ta c cc trang Web v cc vn Tin hc, Th thao, Kinh t-X hi v Xy dng...Cn c vo ni dung ca cc ti liu m khch hng xem hoc download v, sau khi phn lp cc yu cu nh th ca khch hng, chng ta s bit c khch hng hay tp trung vo ni dung g trn trang Web ca chng ta, m t chng ta s b sung thm nhiu cc ti liu v cc ni dung m khch hng quan tm. Ngc lai, v pha khch hng, sau khi c phc v ph hp yu cu, khch hng s hng s quan tm ti h thng ca chng ta hn. T nhng nhu cu thc t trn, phn lp v tm kim trang Web vn l bi ton thi s v cn c pht trin nghin cu. Nh vy, chng ta c th hiu rng khai ph Web nh l vic trch chn ra cc thnh phn c quan tm hay c nh gi l c ch cng cc thng tin tim nng t cc ti nguyn hoc cc hot ng lin quan ti World-Wide Web [25, 26]. Mt cch trc quan c th quan nim khai ph Web l s kt hp gia Khai ph d liu, X l ngn ng t nhin v Cng ngh Web: Khai ph web = Khai ph d liu + X l ngn ng t nhin + World Wide Web. 1.1.3. c im ca d liu Web * Web dng nh qu ln t chc thnh mt kho d liu phc v Khai ph d liu. * phc tp ca trang Web ln hn rt nhiu so vi nhng ti liu vn bn truyn thng khc.

- 12 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

* Web l mt ngun ti nguyn thng tin c thay i cao * Web phc v mt cng ng ngi dng rng ln v a dng * Ch mt phn rt nh ca thng tin trn Web l thc s hu ch 1.1.4. Cc hng tip cn khai ph d liu Web Nh phn tch v c im v ni dung cc siu vn bn trn, t khai ph d liu Web cng s tp trung vo cc thnh phn c trong trang Web. chnh l: 1. Khai ph ni dung trang Web (Web Content mining) Khai ph ni dung trang Web gm hai phn: a. Web Page Content Ngha l s s dng ch cc t trong vn bn m khng tnh n cc lin kt gia cc vn bn. y chnh l khai ph d liu Text (Textmining) b. Search Result Tm kim theo kt qu. Trong cc my tm kim, sau khi tm ra nhng trang Web tho mn yu cu ngi dng, cn mt cng vic khng km phn quan trng, l phi sp xp kt qu theo th t d gn nhau vi ni dung cn tm kim. y cng chnh l khai ph ni dung trang Web. 2. Web Structure Mining Khai ph da trn cc siu lin kt gia cc vn bn c lin quan. 3. Web Usage Mining a. General Access Partern Tracking: Phn tch cc Web log khm ph ra cc mu truy cp ca ngi dng trong trang Web. b. Customize Usage Tracking: Phn tch cc mu truy cp ca ngi dng ti mi thi im bit xu hng truy cp trang Web ca tng i tng ngi dng ti mi thi im khc nhau

- 13 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Lun vn ny tp trung ch yu vo ni dung khai ph ph ni dung trang Web v nh hng vo phn cm tp trang web l kt qu tm kim ca cc my tm kim. 1.1.5. Nhu cu phn cm ti liu Web Mt trong nhng bi ton quan trng trong lnh vc khai ph Web l bi ton phn cm Web. Phn cm Web - ni mt cch khi qut - l vic t ng sinh ra cc "cm" (lp) ti liu da vo s tng t ca cc ti liu. Cc lp ti liu y l cha bit trc, ngi dng c th ch yu cu s lng cc lp cn phn loi, h thng s a ra cc ti liu theo tng tp hp, tng cm, mi tp hp cha cc ti liu tng t nhau. Phn cm Web hiu mt cch n gin - l phn cm trn tp cc ti liu c ly t Web. C hai tnh hung phn cm ti liu. Tnh hung th nht l vic phn cm trn ton b mt CSDL c sn gm rt nhiu ti liu Web. Thut ton phn cm cn tin hnh vic phn cm ton b tp d liu thuc CSDL . Tnh hung ny thng c gi l phn cm khng trc tuyn (offline). Tnh hung th hai thng c p dng trn mt tp ti liu nh l tp hp cc ti liu do my tm kim tr v theo mt truy vn ca ngi dng. Trong trng hp ny, gii php phn cm c tin hnh kiu phn cm trc tuyn (on-line) theo ngha vic phn cm tin hnh theo tng b phn cc ti liu nhn c. Khi , thut ton phi c tnh cht gia tng tin hnh phn cm ngay khi cha c ti liu v phn cm tip theo khng cn phi tin hnh vi d liu c phn cm trc . Do tp ti liu trn Web l v cng ln cho nn cch phn cm trc tuyn l thch hp hn v phi i hi tnh "gia tng" ca thut ton phn cm. Qu trnh x l truy vn v kt qu phn hng c phn hi t cc my tm kim ph thuc vo vic tnh ton tng t gia truy vn v cc ti liu. Mc d cc truy vn lin quan phn no n cc ti liu cn tm, nhng n thng qu ngn v d xy ra s nhp nhng. Nh bit, trung bnh cc truy vn trn Web ch gm hai n ba t do gy nn nhp nhng. Chng hn, truy vn star dn n s nhp nhng rt cao, cc ti liu ly c lin quan n astronomy, plants, animals, popular media and sports figures tng t gia cc ti liu ca mt truy t n nh vy l c s khc nhau rt ln. V l ,

- 14 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

nu my tm kim phn cm cc kt qu theo tng ch th ngi dng c th nhanh chng hiu kt qu truy vn hoc tm vo mt ch xc nh.

1.2. M hnh tm kim thng tin


1.2.1. Gii thiu Vi s pht trin nhanh chng ca cng ngh tin hc, khi lng thng tin lu tr trn my tnh ngy cng nhiu, v vy cn c cc h thng tm kim thng tin (IR: Information Retrieval) cho php ngi dng tm kim mt cch chnh xc v nhanh nht cc thng tin m h cn trn kho d liu khng l ny, trong , Internet chnh l mt kho d liu nh th. Mc tiu ca h thng tm kim l cung cp cng c tr v cho ngi dng cc ti liu trong kho d liu c lin quan ti cu truy vn [3,23,25,26]. l nhu cu chung ca hu ht cc ngn ng v ting Vit ca chng ta cng phi l mt ngoi l. Khc vi cc ngn ng khc, ting Vit c nhiu c im ring bit v rt kh x l bng my tnh, nn cc ti lin quan n cc h thng tm kim ting Vit cn rt t. M nhu cu tm kim ti liu trn kho tng kin thc ca ngi Vit l rt ln. 1.2.2. Quy trnh tm kim thng tin trong h thng Quy trnh ca mt h thng tm kim thng tin nh sau [3,23,26]: Ngi dng mun xem nhng ti liu lin quan ti mt ch no . Ngi dng cung cp mt m t v ch di dng cu truy vn T cu truy vn ny h thng s lc ra nhng cm t ch mc Nhng cm t ch mc ny s c so khp vi nhng cm t ch mc ca cc ti liu c x l trc . Nhng ti liu no c mc lin quan cao nht s c tr v cho ngi s dng. Mc ch ca h thng tm kim thng tin l tm kim v hin th cho ngi dng mt tp cc thng tin tho mn nhu cu ca h. Chng ta nh ngha

- 15 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

chnh xc cho thng tin cn thit l cu truy vn (query), v cc thng tin c chn l ti liu (documents). Mi cch tip cn trong IR bao gm hai thnh phn chnh (1) cc k thut biu din thng tin (cu truy vn, ti liu), v (2) phng php so snh cc cch biu din ny. Mc ch l t ng quy trnh kim tra cc ti liu bng cch tnh ton tng quan gia cc cu truy vn v ti liu. Quy trnh ny c nh gi l thnh cng khi n tr v cc kt qu ging vi cc kt qu c con ngi to ra khi so snh cu truy vn vi cc ti liu. C mt vn thng xy ra i vi h thng tm kim l nhng t m ngi dng a ra trong cu truy vn thng khc xa nhng t trong tp ti liu cha thng tin m h tm kim. Trng hp nh th gi l paraphrase problem (vn v din gii). gii quyt vn ny, h thng to ra cc hm biu din x l cc cu truy vn v cc ti liu mt cch khc nhau t ti mt tng thch no .

Hnh 2. M hnh h thng tm kim thng tin Gi min xc nh ca hm biu din cu truy vn q l Q, tp hp cc cu truy vn c th c; v min gi tr ca n l R, khng gian thng nht biu din

- 16 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

thng tin. Gi min xc nh ca hm biu din ti liu d l D, tp hp cc ti liu; v min gi tr ca n l R2. Min xc nh ca hm so snh c l R R v min gi tr ca n l [0,1] l tp cc s thc t 0 n 1. Trong mt h thng tm kim l tng: c(q(query),d(doc)) = j(query,doc), query Q, doc D, khi j: Q D [0,1] biu din vic x l ca ngi dng gia cc mi quan h ca 2 thng tin, c tnh da trn mt tiu chun no (v d: s ging nhau v ni dung hay s ging nhau v kiu,...). Hnh 2 minh ho mi quan h ny. C hai kiu h thng tm kim: tm kim da trn so khp chnh xc v da trn sp xp. M hnh trn y c th m t c hai cch tip cn nh th. Trong h thng tm kim da trn so khp chnh xc, min gi tr ca c c gii hn hai la chn l 0 v 1, v n c chuyn sang nh phn quyt nh liu 1 ti liu c tho biu thc bool c xc nh bi cu truy vn hay khng? Cc h IR da trn s so khp chnh xc thng cung cp cc ti liu khng sp xp tho mn cu truy vn ca ngi s dng, hu ht cc h thng tm kim hin nay u dng cch ny. Cch hot ng chi tit ca h thng s c m t phn sau. i vi h thng IR da trn sp xp, th cc ti liu s c sp xp theo th t gim dn v mc lin quan. C 3 loi h thng tm kim da trn sp xp: ranked Boolean, probabilistic v similarity base. Trong 3 cch ny th min gi tr ca c l [0, 1], tuy nhin chng khc nhau cch tnh gi tr trng thi tm kim (retrieval status value): Trong h thng da trn ranked Boolean gi tr ny l mc m thng tin tho mn biu thc Bool c ch ra bi cc thng tin cn li. Trong h thng da trn probabilistic, khi nim ny hi khc mt cht, gi tr ny l xc sut m thng tin c lin quan n mt cu truy vn. Rt nhiu h thng tm kim da trn xc sut c thit k chp nhn cu truy vn c din t bng ngn ng t nhin hn l mt biu thc bool. Trong h thng tm kim da trn s ging nhau, gi tr trng thi tm kim c tnh bng cch tnh mc ging nhau ca ni dung thng tin. Trong cc h thng tm kim da trn s so khp chnh xc, vic nh gi h thng ch yu da trn vic nh gi mc lin quan. Gi s j l gi tr nh

- 17 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

phn v c cho trc. Ni cch khc, ta gi s rng cc ti liu hoc c hoc khng c lin quan n cu truy vn, v lin quan gia ti liu v cu truy vn do con ngi xc nh l chnh xc. Theo gi nh ny, tnh hiu qu ca cc h thng tm kim da trn so khp chnh xc c nh gi da trn hai i lng thng k l chnh xc ( precision) v hi tng (recall). chnh xc l t l cc ti liu c chn, cc ti liu thc s c lin quan n cc thng tin m ngi dng cn, hi tng l t l ti liu c lin quan c sp xp chnh xc theo lin quan bi h thng tm kim. Ni cch khc, chnh xc bng 1 tr i t l cnh bo sai, trong khi hi tng o mc hon chnh ca vic tm kim. V hai o nh gi ny cng s c cp chi tit trong phn tiu chun nh gi phn cm cho thut ton phn cm pha sau. Vic nh gi tnh hiu qu ca h thng tm kim da trn sp xp l phc tp hn. Mt cch tnh hiu qu ph bin cho cc h thng ny chnh xc trung bnh. N c tnh bng cch chn 1 tp ln hn cc ti liu u danh sch c gi tr hi tng gia 0 v 1. Phng php thng c s dng l phng php tnh da trn 5,7,11 im theo hi tng. chnh xc sau s tnh cho tng tp mt. Quy trnh s c lp li cho tng cu truy vn, v tng ng vi mi chnh xc trung bnh s cho mt hi tng. Mi gi tr trung bnh ca nhng s ny sau s c tnh ton v ghi nhn nh mt c trng ca h thng. chnh xc trung bnh cng ln th cng tt, v vic so snh ch thc s c ngha khi chng ta s dng cng mt tp ti liu v cu truy vn. Tuy nhin chnh xc trung bnh cng lm gim i mc thay i ca cc cu truy vn c cc c tnh khc nhau (v d nh s lng ti liu c lin quan khc nhau). Hn th na, cc ti liu c lin quan thng tp trung u danh sch sp xp nn thng thng chnh xc s gim mi khi tp ti liu c m rng tng hi tng. 1.2.3. ng dng phn cm vo h thng tm kim Nh vy, vi vic phn tch nhu cu phn cm i vi cc ti liu Web, khi ta xy dng mt h thng tm kim th ng thi ta cng s tin hnh tch hp module phn cm vo h thng ny. Vic phn cm vn bn nh mt phng thc t chc cc d liu tr li khc gip ngi s dng thay v phi xem xt chn lc danh sch di cc vn bn theo th t tm kim cc vn bn lin

- 18 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

quan th ch cn xem xt trong cc lnh vc m ngi s dng quan tm m thi. Nh vy h thng tm kim s tr nn hu dng hn cho ngi s dng.

1.3. Kt lun chng 1


S pht trin ca Internet dn n nhu cu tm kim, khai thc, t chc, truy cp v duy tr thng tin i vi ngi s dng thng xuyn hn. Nhng ngi s dng cc my tm kim Web thng b bt buc xem xt chn lc thng qua mt danh sch th t di ca cc mu thng tin vn bn c tr tr li bi cc my tm kim. Yu cu phn cm ti liu, c th hn l ti liu Web tr thnh bi ton cho cc nh khoa hc nghin cu v gii quyt. Sau y chng ta s nghin cu tip cc vn lin quan ti bi ton phn cm nu trn.

- 19 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

CHNG 2 - THUT TON PHN CM WEB


2.1. Khi qut v cc thut ton phn cm ti liu
Nh trnh by chng 1, phn cm ti liu v ang c nghin cu nh l mt cch ci tin hiu nng cho cch my tm kim bng cch phn cm trc ton b tp hp. M. Steinbach v cc ng tc gi [4] cung cp mt s ni dung khi qut v cc thut ton phn cm ti liu. Theo cc tc gi [4], rt nhiu thut ton phn loi ti liu xut hin trong cc ti liu. Cc thut ton Agglomerative Hierarchical Clustering (AHC Phn cm tch t c th bc) c s dng thng xuyn nht. Nhng thut ton ny thng l chm khi c p dng vi mt tp ln cc ti liu. Cc phng thc lin kt n (Single-link) v trung bnh nhm (group-average) thng c phc tp thi gian khong O(n2) trong khi lin kt y thng mt khong O(n3). C nhiu iu kin kt thc cho cc thut ton AHC c a ra, nhng chng thng l c da trn cc cc quyt nh cng. Nhng thut ton ny rt nhy cm vi cc iu kin dng khi thut ton trn li nhiu phn cm tt, kt qu c th l v ngha i vi ngi dng. Trong lnh vc phn cm web nhng kt qu ca cc cu truy vn c th l cc k nhiu (theo s lng, di, kiu v quan h vi ti liu), vic nhy cm vi cc iu kin dng rt d dn n cc kt qu ngho nn. Mt thuc tnh na ca phn cm Web l chng ta thng xuyn nhn c nhiu phn ko cn thit. l mt kiu nhiu c th gy gim nh hng ca cc tiu ch ngng thng c s dng hin nay. Cc thut ton phn cm c thi gian tuyn tnh l cc ng c vin cho yu cu v tc i vi cc phn cm online [11]. Nhng thut ton ny bao gm thut ton K-Means c phc tp thi gian l O(nkT), trong k l s lng ca cc phn cm v T l s lng chu trnh lp v phng thc Single Pass O(nK) vi K l s lng phn cm c to ra. Mt im mnh ca K-Means l khng ging vi cc thut ton AHC, n c th hot ng trn cc phn cm chng cho. Bt li chnh ca n l n c coi nh l hiu qu nht khi cc phn cm c to ra gn nh

- 20 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

lm trn xp x trn n v o c c s dng. iu ny c ngha l khng c l do tin rng nhng ti liu nn c phn loi vo cc phn cm xp x. Phng thc Single Pass cng gp phi vn ny cng nh gp phi s ph thuc th t v c xu hng a ra cc phn cm ln. Theo [4,11], y l mt thut ton phn cm tng ni ting nht. Buckshot v Fractionation l 2 thut ton phn cm nhanh, thi tuyn tnh do Cutting pht trin nm 1992 [4]. Factionation l mt s xp x vi AHC vi vic tm kim cho hai phn cm gn nhau nht khng c thc hin mt cch tng th thay vo l thc hin mt cch cc b hoc trong cc vng gii hn. Thut ton ny hin nhin s vp phi cng nhc im vi AHC cc iu kin dng c on v hiu nng thp khi c nhiu phn khng lin quan. Buckshot l mt gii thun K-Means vi vic cc phn cm trung tm c to ra bi vic p dng phn cm AHC vi mt tp mu cc ti liu. Vic s dng tp mu l c ri ro khi c th c ngi c hng th vi cc phn cm nh m c th khng c trong cc mu. Tuy nhin, tuy l cc thut ton nhanh song chng khng phi l thut ton phn cm tng. Tt c cc thut ton c ni trn coi mt ti liu l mt tp cc t v khng phi mt tp cc t c th t, do c mt i cc thng tin quan trng. Cc cm t c s dng t lu cung cp cc ch mc t trong cc h thng IR. Vic s dng cc phn t t vng v cc cm t c c php c a ra lm tng kh nng d on m khng cn n vic phn tch li ti liu. Cc cm t c sinh ra bi cc phng thc thng k n gin v ang c s dng mt cch thnh cng. Nhng nhng phng php trn cha c p dng rng ri trong vic phn cm ti liu. Ngoi ra, thut ton s dng DC-tree [24] (Document Clustering Tree: cy phn cm ti liu) c th phn cm cc ti liu m khng cn tp hun luyn. Vi DC-tree, mt i tng d liu a vo khng bt buc phi chn vo mc(v tr) thp khi khng tn to mt nt con tng t cho i tng d liu. iu ny ngn cn mt vi d liu khng tng t t vic t cng nhau. Kt qu l thut ton phn cm da trn cu trc DC-tree l n nh vi yu cu a thm ti liu v d chp nhn cc ti liu nhiu.

- 21 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Trn Web, c mt vi n lc kim sot s lng ln ti liu c tr li bi cc my tm kim. Nhiu my tm kim cung cp cc tnh nng tm kim chn lc. V d, AltaVista gi cc t nn c thm hoc loi b khi cu truy vn. Nhng t ny c t chc theo nhm, nhng cc nhm ny khng i din cho cc phn cm ca ti liu. My tm kim Northern Light (www.nlsearch.com) cung cp Custom Search Folders (Cc th mc tm kim quen thuc), cc th mc ny c t tn bng mt t hoc mt t kp v bao gm tt c cc ti liu c cha ci tn . Northern Light khng tit l cch thc s dng to ra cc th mc cng nh chi ph ca n. Trong chng 3, lun vn i su nghin cu hai thut ton phn cm c tnh tng thch hp cho vic phn cm trang Web v hn na l d dng p dng cho phn cm Ting Vitthut ton phn cm cu hu t (STC) v thut ton phn cm s dng DC-Tree.

2.2. Tiu chun nh gi thut ton phn cm


Cc kt qu ca bt c mt thut ton phn cm no cng nn c nh gi s dng mt thc o cht lng thng tin ch ra tt ca cc phn cm kt qu. Vic nh gi ph thuc vo tri thc no ta u tin trong vic phn loi i tng d liu (V d, chng ta gn nhn cc d liu hoc khng c s phn loi d liu). Nu d liu cha c phn loi trc , chng ta cn phi s dng cc tiu chun cht lng bn trong cho php so snh gia cc tp phn cm m khng phi tham kho cc tri thc bn ngoi. Ni theo cch khc, nu d liu c gn nhn, chng ta s dng vic phn loi ny so snh kt qu phn cm vi cc phn loi gc; o ny c bit n nh mt o cht lng ngoi. Chng ta s xem qua hai tiu chun cht lng ngoi l Entropy v F-measure) v mt tiu chun cht lng trong l Overall Similarity. Entropy Mt o cht lng ngoi l entropy, n cung cp mt o v tt cho cc phn cm c ly ra hoc cho cc phn cm ti mt cp ca mt phn cm theo th bc. Entropy cho chng ta bit s ng nht ca mt phn cm. Mt phn cm cng ng nht th entropy ca n cng gim v ngc li. Entropy ca mt phn cm m ch cha mt i tng (cn bng hon ho) l 0.

- 22 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Coi P l mt kt qu phn chia ca mt thut ton phn cm bao gm m phn cm. Vi tt c phn cm j trong P, chng ta cn tnh ton pij , vi pij l kh nng mt thnh vin ca phn cm j thuc vo lp i. Entropy ca mi phn cm j c tnh ton s dng cng thc chun: E j = pij log( pij ) , trong vic tnh
i

tng c thc hin vi tt c cc lp. Tng entropy ca mt tp cc phn cm c tnh ton nh l tng cng entropy ca mi phn cm c tnh ton da theo kch c ca mi phn cm: E P =
Nj E j , trong Nj l kch c ca j =1 N
m

phn cm j v N l tng s lng i tng d liu. Nh ni trn, chng ta cn phi to ra cc phn cm vi cc entropy cng nh cng tt v entropy l mt thc o v ng nht (tng t) ca cc i tng d liu trong phn cm. F-measure o cht lng ngoi th hai l o F (F-measure), mt o gp tng v s chnh xc v kh nng nh li t thng tin thu v. S chnh xc v kh nng nh li ca mt phn cm j i vi lp i c nh ngha l:
P = precision (i, j ) = N ij Ni

R = recall (i, j ) =

N ij Nj

trong Nij l s lng thnh vin ca lp I trong phn cm j, Nj l s lng thnh vin ca phn cm j v Ni l s lng thnh vin ca lp i. o F ca mt lp i c nh ngha l:
F (i ) = 2 PR P+R

Trong cc mi lin h vi lp i, chng ta tm ra gi tr o F ln nht trong cc phn cm j i vi n v gi tr ny l im ca lp i. Gi tr o F ca kt qu phn cm P l trung bnh trng s ca cc o F vi mi lp i.
FP =

(i F (i )) i
i i

- 23 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Trong |i| l s lng i tng trng lp i. Gi tr o F cng cao th vic phn cm cng tt v chnh xc cng ln ca vic gn kt cc lp gc. Overall Similarity Mt o cht lng trong rt hay c s dng l o tng t ton din (Overall Similarity) v c s dng khi khng c bt c thng tin no t bn ngoi nh cc lp gn nhn. o ny phn cm o s kt ni ca cc phn phn cm bng vic s dng trng s tng t ca phn cm trong
1

xS yS

sim( x, y ) , trong S l phn cm c xem xt, v sim(x,y) l tng

t gia 2 i tng x v y.

2.3. Cc c tnh ca cc thut ton phn cm web


Trc khi chng ta phn tch v so snh cc thut ton khc nhau, chng ta cn phi nh ngha mt s thuc tnh ca cc thut ton v tm ra cc vng vn ca cc thuc tnh ny. Phn tch v cc phng php phn cm ti liu Web s c gii thiu ngy sau phn ny. 2.3.1. M hnh d liu Hu ht cc thut ton phn cm u yu cu tp d liu cn c phn cm dng mt tp cc vc t X = {x1, x2, , xn} trong vc t xi, i= 1, , n i din cho mt i tng n l trong tp d liu v c gi l vc t c trng (feature vector). Vic tch lc cc c trng cn thit thng qua vc t c trng ph thuc nhiu vo tng lnh vc. S chiu ca vc t c trng l nhn t ch cht trong thi gian chy ca thut ton cng nh ln ca n. Tuy nhin, mt vi lnh vc mc nh phi chp nhn s chiu ln. Tn ti mt vi phng php lm gim cc vn lin quan n c, nh vic phn tch ngun gc thnh phn. Phng php Krishnapuram [8] c th lm gim vc t c trng 500 chiu thnh vc t 10 chiu; tuy nhin chnh xc ca n cha c kim chng. T by gi ta tp trung vo vic biu din d liu ti liu v lm th no bc tch cc c trng chnh xc. a, M hnh d liu ti liu

- 24 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Hu ht cc phng thc phn cm ti liu s dng m hnh khng gian vc t (Vector Space) biu din cc i tng ti liu. Mi ti liu c biu din bng mt vc t d, trong khng gian vc t, d = {tf1, tf2, , tfn} trong tfi (i=1,,n) l tn sut xut hin (term frequency TF) ca t ti trong ti liu. biu din tt c cc ti liu vi cng 1 tp t, chng ta cn tch tt c cc t tm c trn tng cc ti liu v s dng chng nh vc t c trng ca chng ta. Thnh thong, mt vi phng php c s dng gp tn sut xut hin t v tn sut nghch o ti liu (inverse document frequency TF-IDF). Tn sut xut hin ti liu dfi l s lng ti liu trong tp N ti liu m t ti xut hin. Mt thnh phn tn sut nghch o ti liu (idf) c nh ngha l log(N/dfi). Trng s ca t ti trong ti liu c nh ngh l wi= tfi log(N/dfi) [24]. c ca vc t c trng l chp nhn c, ch n t c trng s ln nht trong tt c cc ti liu c s dng nh l n c trng. Wong v Fu [24] ch ra rng h c th lm gim s lng t i din bng vic ch chn nhng t m mc hi tng (coverage) trong tp d liu. Mt vi thut ton [9,24] lp li vic s dng cc tn sut xut hin t (hoc trng s t) bng vic s dng vc t c trng nh phn, trong mi trng s t l 1 hoc 0, ph thuc vo t c trong ti liu hay khng. Wong v Fu [24] phn i rng tn sut xut hin t trung bnh trong ti liu web l nh hn 2 (da theo cc th nghim, thng k), v n khng ch ra quan trng thc s ca t, do mt s phi vi trng s nh phn s l thch hp hn vi vng vn ny. Trc khi ni v tch c trng, tp ti liu s c lm sch bng cch loi b cc t dng (stop-word: cc t c tn sut xut hin nhiu nhng khng c ngha nh: v, vi, ) v p dng mt thut ton lm y chuyn i cc mu t khc nhau thnh mt mu chun tng ng.

- 25 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Mt v d v cc stop-word Mt m hnh khc v vn biu din ti liu c gi l N-gram. M hnh N-gram gi s rng ti liu l mt chui cc k t, v s dng mt ca s trt vi kch c n k t qut v tch tt c cc chui n k t lin tip trong ti liu. N-gram l c th chp nhn c vi cc li pht m nh bi v s rm r trong cc kt qu tr v ca n. M hnh ny cng x l c cc vn nh v ph thuc ngn ng khi c s dng vi thut ton lm y. Vn tng t trong phng php tip cn ny c da trn s lng n-gram gia hai ti liu. Cui cng, mt m hnh mi c gii thiu bi Zamir v Etzioni [5] l mt phng php tip cn v cm t. M hnh ny tm kim cc cm hu t gia cc ti liu v xy dng mt cy hu t trong mi nt biu din mt phn ca cm t (mt nt hu t) v gn vi n l cc ti liu cha cm t hu t ny. Phng php tip cn ny r rng l nm c cc thng tin quan h gia cc t, rt c gi tr trong vic tm kim tng t gia cc ti liu. b, M hnh d liu s Mt m hnh trong sng hn v d liu l m hnh s. Da trn ng cnh vn l c nhiu c trng c tch, trong mi c trng c biu din nh l mt khong cc gia cc s. Vc t c trng lun lun trong mt

- 26 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

c chp nhn c, v n ph thuc vo vn ang c phn tch. Cc khong cch c trng thng c bnh thng ha v th mi c trng c tc dng nh nhau khi tnh ton o khong cch. tng t trong trng hp ny l minh bch v vic tnh ton khong cch gia 2 vc t l rt n gin [17]. c, M hnh phn loi d liu M hnh ny thng c tm thy trong cc vn v phn cm c s d liu. Thng th cc thuc tnh ca bng c s d liu l c phn loi v c mt vi thuc tnh l kiu s. Cc phng php tip cn v phn cm da trn thng k c dng lm vic vi kiu d liu ny. Thut ton ITERATE c th coi l mt v d v vic lm vic vi d liu phn loi trn cc d liu thng k [18]. Thut ton K-modes cng c th coi l mt v d tt [19]. d, M hnh d liu kt hp Da vo cc vng vn , thnh thong cc i tng biu din d liu c trng khng c cng kiu. Mt s kt hp gia cc kiu d liu s, phn loi, khng gian hoc text c th c s dng. Trong trng hp ny, vn quan trng l ngh ra mt phng php c th nm gi tt c cc thng tin mt cch hiu qu. Mt quy trnh chuyn i nn c p dng chuyn i t mt kiu d liu ny thnh mt kiu d liu khc. Thnh thong mt kiu d liu khng th p dng vo c, lc thut ton phi c chnh sa lm vic vi cc kiu d liu khc [18]. 2.3.2. o v s tng t Nhn t chnh trong thnh cng ca bt k mt thut ton phn cm no chnh l o v s tng t ca n. c th nhm cc i tng d liu, mt ma trn xp x c s dng tm kim nhng i tng (hoc phn cm) tng t nhau. C mt s lng ln cc ma trn tng tng c cp n trong cc ti liu, y, chng ta ch xem qua mt s ma trn thng thng nht. Vic tnh ton (khng) tng t gia 2 i tng c thc hin thng qua cc hm tnh khong cch (distance), thnh thong cng c th s

- 27 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

dng cc hm tnh v khng tng t (dissimilarity). Vi 2 vc t c trng x v y, cn phi tm ra tng t (hoc khng tng t) gia chng. Mt lp rt hay c s dng ca cc hm khong cch l gia nh cc khong cch Minkowski [7], c m t nh pha di:
x y =
p

x
i =1

yi

Trong x,y Rn. Hm khong cch ny thc ra l m t mt h v s cc khong cch c a ra bi p. Thng s ny gi thit l cc gi tr ln hn hoc bng 1. Mt vi gi tr chung ca p v cc hm khong cch l: p = 1: Khong cch Hamming x y = xi yi
i =1 n

p = 2: Khong cch Euclidean x y =

x
i =1

yi

p = : Khong cch Tschebyshev x y = maxi=1,2,...,n xi yi Mt o tng t hay c dng, c bit l trong phn cm ti liu l o lin quan cosine (cosine correlation) (c s dng trong [4], [15], v [13]), c nh ngha l:
cos( x, y ) = x. y x y

trong . biu th vic nhn vector v ||.|| biu th cho di ca vector. Mt o hay c dng khc l o Jaccard (c s dng trong [8], [9]), c nh ngha l:

d ( x, y ) =
thnh:
d ( x, y ) =

i =1 n

min( xi , yi ) max( xi , yi )

i =1

trong trng hp cc vector c trng nh phn, c th n gin ha


x y x y

- 28 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Cn phi ch rng t khong cch khng c g nhp nhng vi tng t. Nhng t ny l tri ngha vi nhau, cho chng ta bit tng t gia 2 i tng. tng t gim khi khong cch tng. Thm mt im cn ch khc l nhiu thut ton s dng hm khong cch (hoc tng t) tnh ton s tng t gia 2 phn cm, mt phn cm v mt i tng, hai i tng. Vic tnh ton khong cch gia 2 phn cm (hoc cc phn cm v cc i tng) yu cu mt vector c trng i din cho phn cm. Thng th cc thut ton phn cm thng s dng mt ma trn tng t (similarity matrix). Mt ma trn tng t c N N ghi nhn cc khong cch (hoc tng t) gia tng cp i tng. Hin nhin ma trn tng t l mt ma trn i xng do chng ta ch cn lu phn trn bn phi hoc phn di bn tri ca n. 2.3.3. M hnh phn cm Bt c thut ton phn cm no cng tha nhn mt cu trc phn cm no . i khi cu trc phn cm khng thc s r rng ty theo nhu cu ca bn thn thut ton phn cm. V d, thut ton k-means s dng cc phn cm hnh cu (hoc cc phn cm li). l v theo cch k-means tm kim phn cm trung tm v cp nht cc i tng thnh vin. Nu nh khng cn thn, chng ta c th kt thc vic phn cm vi cc phn cm ko di (elongated cluster), trong kt qu l c t phn cm ln v c nhiu phn cm rt nh. Wong v Fu [16] a ra mt gii php gi kch c phn cm trong mt khong no , nhng vic gi kch c phn cm trong mt khong no khng phi bao gi cng ng thc hin. Mt m hnh ng tm kim cc phn cm khng thch hp vi cu trc ca chng l CHAMELEON, c a ra bi Karypis [13]. Ty theo vn , chng ta c th c cc phn cm tch ri (disjoint) hoc cc phn cm chng cho (overlapping). Trong ng cnh phn cm ti liu thng mong mun c cc phn cm chng cho bi v ti liu c xu hng c nhiu hn mt ch (v d mt ti liu c th cha thng tin v ua t v cc cng ty t). Mt v d khc v vic to ra cc phn cm chng cho l h thng cy hu t (STC) c a ra bi Zamir v Etzionin [5]. Mt cch khc to ra cc phn cm chng cho l phn cm m trong cc i tng c th

- 29 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

thuc vo cc phn cm khc nhau da vo cc cp khc nhau ca t cch thnh vin [8].

2.4. Mt s k thut Phn cm Web in hnh


K thut phn cm c chia thnh 2 nhm chnh: Phn cm theo th bc v phn cm bng cch phn mnh. 2.4.1. Phn cm theo th bc Cc k thut phn cm theo th bc a ra mt chui cc phn chia lng vo nhau vi mt phn cm gc trn cng v cc phn cm n ca cc i tng n l pha di. Cc phn cm cp trn cha cc phn cm pha di chng theo th bc. Kt qu ca thut ton phn cm theo th bc c th xem nh mt cy, c gi l mt dendogram (Hnh 3).

Hnh 3: Mt v d dendogram ca phn cm s dng phn cm c th bc

Ty thuc vo nh hng ca vic xy dng th t, chng ta c th ch ra cc phng thc ca phn cm theo th bc: tch t (Agglomerative) hay chia x (Divisive). Phng thc tch t c s dng trong hu ht cc phn cm theo th bc. a, Phn cm tch t theo th bc (AHC) Phng thc ny bt u vi tp cc i tng l cc phn cm n l, tip , ti mi bc kt ni 2 phn cm ging giau nht vi nhau. Qu trnh ny

- 30 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

c lp li cho n khi s lng phn cm cn li t n mt ngng cho php hoc l nu cn phi hon thnh ton b th bc th qu trnh ny s tip tc cho n khi ch cn 1 phn cm. Phn cm tch t lm vic theo m hnh tham n (greedy), trong cp nhm ti liu c chn cho vic tch t l cp m c coi l ging nhau nht theo mt s tiu chun no . Phng thc ny tng i n gin nhng cn phi nh ngha r vic tnh khong cch gia 2 phn cm. C 3 phng thc hay c dng nht tnh ton khong cch ny c lit k pha di. Phng thc kt ni n (Single Linkage Method): tng t gia 2 phn cm S v T c tnh ton da trn khong cch ngn nht (minimal) gia cc thnh phn nm trong cc phn cm tng ng. Phng thc ny cn c gi l phng php phn cm lng ging gn nht (nearest neighbour).
T S = min xT x y
yS

Phng thc kt ni ton b (Complete Linkage Method): tng t gia 2 phn cm S v T c tnh ton da trn khong cch ln nht (maximal) gia cc thnh phn thuc vo cc phn cm tng ng. Phng thc ny cn c gi l phng php phn cm lng ging xa nht (furthest neighbour).
T S = max xT x y
yS

Phng thc kt ni trung bnh (Average Linkage Method): tng t gia 2 phn cm S v T c tnh ton da trn khong cch trung bnh (average) gia cc thnh phn ca cc phn cm tng ng. Phng thc ny xt tt c cc cp khong cch cc i tng trong cc 2 phn cm. Phng thc ny cn c gi l UPGMA (Unweighter Pair-Group Method using Arithmetic averages )
T S =

xT yS

x y

S .T

- 31 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Karypis [13] phn i cc phng thc trn v cho rng chng s dng mt m hnh tnh ca cc lin kt v gn gi ca d liu, v a ra mt m hnh ng trnh c nhng vn trn. H thng c gi l CHAMELEON, ch gp 2 phn cm nu s lin kt v gn gi ca cc phn cm l c quan h mt thit vi s lin kt v gn gi bn trong cc phn cm. Cc k thut cht ng thng s dng thi gian c (n2) v c trng ca n l xem xt tt c cc cp phn cm c th. H thng Phn tn/Tp hp (Scatter/Gather) c gii thiu trong cun Cutting [15], s dng mt nhm tch t trung bnh tm kim cc phn cm ht nhn (seed) s dng cho thut ton chia phn cm. Tuy nhin, trnh thi gian chy bnh phng, h ch s dng n vi mt v d nh ca cc ti liu phn cm. Ngoi ra, phng thc trung bnh nhm c gii thiu trong Steinbach [4] c coi l tt hn hu ht cc phng thc o tng t khc do tnh n nh ca n. b, Phng php phn cm chia x cp bc Nhng phng thc ny lm vic t trn xung di, bt u vi vic coi ton b cc tp d liu l mt phn cm v ti mi bc li phn chia mt phn cm cho n khi ch cn nhng phn cm n ca cc i tng ring l cn li. Chng thng khc nhau bi 2 im: (1) phn cm no c phn chia k tip v (2) lm th no phn chia. Thng th mt tm kim ton din c thc hin tm ra phn cm phn tch da trn mt vi tiu chun khc nhau. Mt cch n gin hn c th c s dng l chn phn cm ln nht chia tch, phn cm c tng t trung bnh t nht hoc s dng mt tiu chun da trn c kch c v tng t trung bnh. Trong Steinbach [4] lm mt th nghim da trn nhng chin thut ny v pht hin ra rng s khc nhau gia chng l rt nh, do h sp xp li bng vic chia nh phn cm ln nht cn li. Chi nh mt phn cm cn a ra quyt nh xem nhng i tng no c a vo phn cm con. Mt phng php c dng tm 2 phn cm con s dng k-means tr li kt qu l mt k thut lai ghp c gi l k thut chia ct k-means (bisecting k-means) [4]. Cng c mt cch khc da trn thng k c s dng bng thut ton ITERATE [18], tuy nhin, khng cn thit phi

- 32 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

chia mt phn cm thnh 2 phn cm con, chng ta c th chia n thnh nhiu phn cm con, ty theo kt cu ca cc i tng. 2.4.2. Phn cm bng cch phn mnh Lp thut ton phn cm ny lm vic bng cch nhn ra cc phn cm tim nng cng mt lc trong khi lp li vic cp nht cc phn cm lm ti u mt vi chc nng. Lp cc thut ton ni ting ca n l thut ton K-means v cc bin th ca n. K-means bt u bng vic chn la ngu nhin k phn cm ht nhn, sau a cc i tng vo phn cm c ngha gn n nht. Thut ton lp li vic tnh ton ngha ca cc phn cm v cp thnh vin ca cc i tng mi. Qu trnh x l tip tc cho n mt s ln lp nht nh hoc khi khng cn s thay i no c pht hin trong ngha ca cc phn cm [17]. Cc thut ton K-means c kch c O(nkT) trong T l s lng vng lp. D sao, mt nhc im chnh ca K-means l n gi nh mt cu trc phn cm cu v khng th c p dng vi cc min d liu m cc cu trc phn cm khng phi l hnh cu. Mt bin th ca K-means cho php s chng lp ca cc phn cm l C-means m (FCM: Fuzzy C-means). Thay v c cc quan h thnh vin kiu nh phn gia cc i tng v cc phn cm tiu biu, FCM cho php cc cp khc nhau ca cp thnh vin [17]. Krishnapuram [8] a ra mt phin bn chnh sa ca FCM c coi l Fuzzy C-Medoids (FCMdd) trong cc ngha c thay bng cc ng cnh. Thut ton ny tng i nhanh v c c l O(n2) v c cng hot ng nhanh hn FCM. Do s la chn ngu nhin ca cc phn cm ht nhn nhng thut ton ny, chng i lp vi phn cm c th bc. Do kt qu ca cc ln chy ca thut ton l khng thc s n nh. Mt vi phng php c ci tin bng cch tm ra cc phn cm ht nhn ban u tt sau mi s dng cc thut ton ny. C mt v d rt hay trong h thng Phn chia/Thu thp [15]. C mt cch tip cn gp c vic phn cm phn mnh v phn cm lai ghp l thut ton chia cch K-means (Bisecting K-means) ni phn trc. Thut ton ny l mt thut ton phn chia trong vic phn chia phn cm s dng K-means tm kim 2 phn cm con. Trong Steinbach ch ra

- 33 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

rng hiu sut ca thut ton Bisecting K-means l tuyt vi so vi K-means bnh thng cng nh UPGMA [4] Cn phi ch rng mt c trng quan trng ca cc thut ton c th bc l hu ht u c cp nht theo tnh tng v cc i tng mi c th c a vo cc phn cm lin quan rt d dng bng vic ln theo mt ng dn no ti v tr thch hp. STC [5] v DC- tree [24] l hai v d v cc thut ton ny. Ni theo cch khc cc thut ton phn chia ng lot thng yu cu vic cp nht ng lot v ngha ca cc phn cm v thm ch l cc i tng thnh vin. Vic cp nht c tnh tng l rt cn thit vi cc ng dng hot ng on-line. Mt phng php nhm thi hnh thut ton phn cm l phn hoch tp ti liu vo k tp con hoc cc cm D1, , Dk lm cc tiu khong cch bn trong cm cm

d1 , d 2 D i

( d 1 , d 2 ) hoc lm cc i s tng t bn trong

d 1 , d 2 Di

(d1 , d 2 ) .

Nu mt biu din bn trong ca cc ti liu l c gi tr th biu din ny cng c dng xc nh mt biu din ca cc cm lin quan n cng m hnh. Chng hn, nu cc ti liu c biu din s dng m hnh khng gian vector, mt cm ca cc ti liu c th c biu din bi trng tm (trung bnh) ca cc ti liu vector. Khi mt biu din cm l c gi tr, mt mc tiu c th phn hoch D thnh D1, ,Dk cc tiu ha i ha

d Di

( d , D i ) hoc cc

d Di

( d , D i ) trong Di l biu din vector ca cm i. C th

xem xt ti vic gn ti liu d cho cm i nh vic t mt gi tr Boolean zd,i l 1. iu ny c th pht sinh ra vic phn cm mm ti zd,i l mt s thc t 0 n 1. Trong bi cnh nh vy, ta c th mun tm zd,i cc tiu ha

dDi

(d , Di ) hoc cc i ha

d Di

(d , Di ) .

Vic phn hoch c th thc hin theo hai cch. Bt u vi mi ti liu trong mt nhm ca n v kt hp cc nhm ti liu li vi nhau cho n khi s cc phn hoch l ph hp; cch ny gi l phn cm bottom-up. Cch khc l c

- 34 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

th khai bo s cc phn hoch mong mun v gn cc ti liu vo cc phn hoch; cch ny gi l phn cm top-down. C th xem xt mt k thut phn cm bottom-up da vo qu trnh lp li vic trn cc nhm ca cc ti liu tng t nhau cho n khi t c s cm mong mun, v mt k thut top-down s lm mn dn bng cch gn cc ti liu vo cc cm c thit t trc. K thut bottom-up thng chm hn, nhng c th c s dng trn mt tp nh cc mu khi to cc cm ban u trc khi thut ton top-down tin hnh

2.5. Cc yu cu i vi cc thut ton phn cm Web


Trong cc tho lun trc v cc thut ton phn cm vic cn phi nhn ra cc yu cu cho cc thut ton phn cm ti liu l cn thit, vic ny s gip chng ta thit k ra cc gii php hiu qu v thit thc hn hng ti cc yu cu ny. Tip y l mt danh sch ca cc yu cu ny. 2.5.1. Tch cc thng tin c trng Vn ct li ca bt c vn phn cm no nm hu ht vic la chn cc tp i din ca cc c trng ca m hnh d liu. Tp cc c trng c tch ra cn phi c thng tin n c th biu din d liu thc s ang c phn tch. Ngc li, d thut ton tt n my, n s v dng nu nh s dng nhng c trng khng cha thng tin. Hn na, vic lm gim s lng c trng l rt quan trng v s chiu ca khng gian c trng lun c tc ng n hiu sut ca thut ton. Mt so snh c hon thnh bi Yang v Pedersen [20] v hiu qu ca cc phng php tch c trng trong vic chia loi vn bn ch ra rng phng php ngng tn sut xut hin ti liu (DF) cho nhng kt qu tt hn cc phng thc khc v cng cn t cc x l tnh ton hn. Hn na, nh cp trn, Wong v Fu [24] ch ra rng h c th lm gim s lng t i din bng vic ch chn cc t c ngha trong tp ti liu. M hnh ti liu cng thc s rt quan trng. Hu hu cc m hnh hay c s dng u da trn cc t khc nhau c tch lc t tp tt c cc ti liu v tnh ton tn sut xut hin ca t cng nh tn sut xut hin ca ti liu nh ni phn trc. Mt m hnh ti liu khc l m hnh da trn cm t,

- 35 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

nh m hnh c Zamir v Eztioni [5] a ra trong chng tm kim cc cm t hu t c cng im chung trong ti liu s dng cu trc cy hu t. 2.5.2. Phn cm chng lp Vi tp d liu bt k, c bit l trong lnh vc web, s c xu hng cha mt hoc nhiu ch . Khi phn cm ti liu, vic a nhng ti liu vo cc phn cm lin quan vi n l cn thit, iu ny c ngha l vi ti liu c th thuc vo nhiu hn mt phn cm. Mt m hnh phn cm chng lp cho php vic phn cm ti liu vi nhiu ch ny. C rt t thut ton cho php phn cm chng lp trong c phn cm m [8] v cy hu t [5]. Trong vi trng hp nu vic mi ti liu bt buc phi thuc mt phn cm, mt thut ton khng chng lp s c s dng hoc mt tp ca cc phn cm c lp c th c to ra bi phn cm m sau khi lm r cc mi lin h gia cc phn cm. 2.5.3. Hiu sut Trong lnh vc web, mi mt cu lnh tm kim c th tr v hng trm v thnh thong l hng nghn trang web. Vic phn cm cc kt qu ny trong mt thi gian chp nhn c l rt cn thit. Cn phi ch rng mt vi h thng gii thiu ch phn cm trn cc on tin c tr li trn hu ht cc my tm kim ch khng phi ton b trang web [5]. y l mt chin thut hp l trong vic phn cm kt qu tm kim nhanh nhng n khng chp nhn c vi phn cm ti liu v cc on tin khng cung cp y thng tin v ni dung thc s ca nhng ti liu ny. Mt thut ton phn cm online nn c kh nng hon thnh trong thi gian tuyn tnh nu c th. Mt thut ton offline thng hng ti vic a ra cc phn cm c cht lng cao hn. 2.5.4. Kh nng kh nhiu Mt vn c th xy ra vi nhiu thut ton phn cm l s xut hin ca nhiu v cc d liu tha. Mt thut ton phn cm tt phi c kh nng gii quyt nhng kiu nhiu ny v a ra cc phn cm c cht lng cao v khng b nh hng bi nhiu. Trong phn cm c th bc, v d cc tnh ton khong cch lng ging gn nht v lng ging xa nht, rt nhy cm vi cc d

- 36 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

liu tha do khng nn c s dng nu c th. Phng thc trung bnh kt ni l thch hp nht vi d liu b nhiu. 2.5.5. Tnh tng Mt c trng rt ng quan tm trong cc lnh vc nh web l kh nng cp nht phn cm c tnh tng. Nhng ti liu mi cn phi c a vo cc phn cm tng ng m khng phi phn cm li ton b tp ti liu. Nhng ti liu c chnh sa nn c x l li v a n cc phn cm tng ng nu c th. Tht ng nh rng tnh tng cng hiu qu th hiu sut cng c ci thin. 2.5.6. Vic biu din kt qu Mt thut ton phn cm l tt nu n c kh nng biu din mt s m t ca cc phn cm m n a ra ngn gn v chnh xc vi ngi s dng. Cc tng kt ca phn cm nn c tiu biu v ni dung tng ng ngi s dng c th a ra quyt nh nhanh xem phn cm no m h cm thy quan tm.

2.6. Bi ton tch t t ng ting Vit


i vi ting Anh, ta tch t da vo khong trng. Tuy nhin i vi ting Vit, cng vic ny tng i kh khn. Cu trc ting Vit rt phc tp, khng ch n thun da vo khong trng tch t. Hin nay c rt nhiu cng c tch t ting Vit, mi phng php c u, khuyt im ring v tt nhin, cha th coi rng l phng php tch t no l tt nht. Phn ny trnh by mt vi phng php tch t ting Vit. Nhng trc chng ta s xem xt mt s kh khn trong phn cm trang Web ting Vit. 2.6.1. Mt s kh khn trong phn cm trang Web ting Vit Hin nay, chng ta quen thuc vi rt nhiu cng c h tr vic tm kim thng tin nh Google, Yahoo Search, AltaVista, ... Tuy nhin, y l cng c ca ngi nc ngoi nn chng ch gii quyt tt i vi cc yu cu ca h. Chng ta cng c mt s cng c h tr tm kim thng tin ting Vit nh: Vinaseek, Netnam, ... Cc cng c ny cng tch t ch yu da vo khong

- 37 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

trng nn vic tm kim cng cha c ci thin. Nhn chung, xy dng mt h thng tm kim thng tin Ting Vit, chng ta gp kh khn trong vic tch t Ting Vit v xc nh bng m ting Vit. ng thi cng chnh l kh khn trong vic phn cm cc ti liu bng ting Vit v bc u tin ca phn cm cng chnh l tch t ting Vit [1]. Vn bng m Ting Vit Khng nh ting Anh, ting Vit c rt nhiu bng m i hi phi x l. Mt s cng c tm kim ting Vit h tr bng m rt tt nh Vinaseek, h tr mi bng m (VNI, TCVN3, ViQR, ...) Kh khn trong tch t Ting Vit C th ni tch t l giai on kh khn nht khi xy dng mt h tm kim thng tin Ting Vit v phn cm ti liu Ting vit. i vi ting Anh, vic xc nh t ch n gin da vo khong trng tch t. V d, cu I am a student s c tch thnh 4 t: I, am, a, student. Tuy nhin, i vi Ting Vit, tch da vo khong trng ch thu c cc ting. T c th c ghp t mt hay nhiu ting. T phi c ngha hon chnh v c cu to n nh. Cu Ti l mt sinh vin c tch thnh 4 t: Ti, l, mt, sinh vin. Trong , t sinh vin c hnh thnh t hai ting sinh v vin. Hin nay c rt nhiu phng php c s dng tch t Ting Vt. Tuy nhin, vi s phc tp ca ng php Ting Vit nn cha c phng php no t c chnh xc 100%. V vic la chn phng php no l tt nht cng ang l vn tranh ci. Cc kh khn khc Ting Vit c cc t ng ngha nhng khc m. Cc cng c hin nay khng h tr vic xc nh cc t ng ngha. V vy, kt qa tr v s khng y . Ngc li, c nhng t ng m khc ngha. Cc h thng s tr v cc ti liu c cha cc t c tch trong cu hi m khng cn xc nh chng c thc s lin quan hay khng. V vy, kt qu tr v s khng chnh xc. Mt s t xut hin rt nhiu nhng khng c ngha trong ti liu. Cc t nh: v, vi, nhng, ... c tn s xut hin rt ln trong bt c vn bn no.

- 38 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Nu tm cch tr v cc ti liu c cha nhng t ny s thu c kt qu v ch, khng cn thit. Do , chng ta cn tm cch loi b cc t ny trc khi tm kim. 2.6.2. Ting v T trong ting Vit V mt ng m, ting l m tit. m tit bao gm nhng n v bc thp hn gi l m v. Mi m v c ghi bng mt k t gi l ch. V mt ng ngha, ting l n v nh nht c ngha, nhng cng c mt s ting khng c ngha. V gi tr ng php, ting l n v cu to t. S dng ting to thnh t, ta c hai trng hp sau: - T mt ting: gi l t n. Trng hp ny mt t ch c mt ting. V d nh: ng, b, cha, m, ... - T hai ting tr ln: gi l t phc. Trng hp ny mt t c th c hai hay nhiu ting tr ln. V d nh: x hi, an ninh, hp tc x, ... T l n v nh nht to thnh cu. Trong t cu chng ta dng t ch khng dng ting. 2.6.3. Phng php tch t t ng ting Vit fnTBL tng chnh ca phng php hc da trn s bin i (TBL) l gii quyt mt vn no ta s p dng cc php bin i, ti mi bc, php bin i no cho kt qu tt nht s c chn v c p dng li vi vn a ra. Thut ton kt thc khi khng cn php bin i no c chn. H thng fnTBL (Fast Transformation-based learning) gm hai tp tin chnh [1]: - Tp tin d liu hc (Training): Tp tin d liu hc c lm th cng, i hi chnh xc. Mi mu (template) c t trn mt dng ring bit. V d: tp d liu hc cho vic xc nh t loi ca mt vn bn c th c nh dng nh sau: Cng ty danhtu ISA danhturieng b dongtu

- 39 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

kim tra. dongtu Trong v d ny mi mu gm hai phn: phn u tin l t, phn th hai l t loi tng ng. - Tp tin cha cc mu lut (rule template): Mi lut c t trn mt dng, h thng fnTBL s da vo cc mu lut p dng vo tp d liu hc. V d: chunk_ -2 chunk_-1 => chunk p dng i vi vic xc nh t loi, vi chunk_-2 = ng t, chunk_-1 = s t, chunk = danh t th lut trn c ngha nh sau: nu hai t trc l ng t v s t th chuyn t loi hin hnh thnh danh t. p dng tch t Ting Vit: Ta c th p dng phng php fnTBL tch t Ting Vit, ch cn thay i mt s nh dng cho ph hp. - Xy dng tp tin d liu hc: Tp tin d liu cho vic tch t Ting Vit c dng nh sau: V B sao B cng B ty I ISA B b B t B vo B tnh B trng I ... Cc k t B, I gi l cc chunk c ngha nh sau:

- 40 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Ting c chunk = B ngha l ting bt u mt t (begin) Ting c chunk = I ngha l ting nm trong mt t (inside). - Xy dng tp tin cha cc mu lut: Sau khi tm hiu v t trong Ting Vit, ta xy dng c 3 lut p dng cho tch t ting Vit nh sau: chunk_0 word_0 => chunk chunk_0 word_-1 word_0 => chunk chunk_0 word_0 word_1 => chunk a, Qu trnh hc: (1) T tp d liu hc xy dng t in cc t (2) Khi to cc t (3) Rt ra tp lut bc (1) t tp d liu hc c sn, s dng phng php thng k ta s c t in cc ting (Lexicon). Cc ting c th xut hin trong cc t vi cc chunk khc nhau, ta s ghi nhn li s ln xut hin ca mi ting vi cc chunk tng ng. V d, i vi t cng ty th ting cng c chunk = B nhng trong t ca cng th ting cng c chunk=I. bc (2) t tp d liu hc, to ra tp d liu hc khng c chunk bng cch xo ht cc chunk tng ng. Tp d liu mi ny s c s dng khi to li cc chunk thng dng nht da vo t in. bc (3) so snh tp d liu hc vi tp d liu ang xt, da vo cc mu lut cho, ta s rt ra c cc lut ng vin, ng vi mi lut ng vin ta li p dng vo tp d liu ang xt v tnh im cho n (da vo s li pht sinh khi so snh vi tp d liu hc l tp d liu chun). Chn lut c im cao nht v ln hn mt ngng cho trc a vo danh sch lut c chn. Kt qu ta s c mt tp cc lut c chn. Cc lut c dng nh sau: SCORE: 414 RULE: chunk_0=B word_0=t => chunk=I SCORE: 312 RULE: chunk_0=B word_-1=ca word_0=cng => chunk=I SCORE: 250 RULE: chunk_0=B word_0=ho => chunk=I

- 41 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

SCORE: 231 RULE: chunk_0=B word_0=ng => chunk=I SCORE: 205 RULE: chunk_0=B word_0=nghip => chunk=I SCORE: 175 RULE: chunk_0=B word_-1=pht word_0=trin => chunk=I SCORE: 133 RULE: chunk_0=B word_-1=x word_0=hi => chunk=I SCORE: 109 RULE: chunk_0=B word_-1=u word_0=t => chunk=I SCORE: 100 RULE: chunk_0=B word_0 = th => chunk=I dng 2 ta c lut: nu t hin hnh l cng (word_0=cng) v t trc l ca (word_-1=ca) v chunk ca t hin hnh l B (chunk_0=B) th chuyn chunk ca t hin hnh l I, ngha l ca cng phi l mt t. Ton b qu trnh hc c m t nh sau:

Hnh 4. Qu trnh hc

- 42 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

b, Xc nh t cho ti liu mi (1) Ti liu mi a vo phi c ng dng ging nh tp tin d liu hc, ngha l mi ting trn mt dng. (2) Da vo t in, gn chunk thng dng nht cho cc ting trong ti liu mi. (3) p dng cc lut c c t giai on hc vo ti liu ang xt ta s tch c cc t hon chnh. Giai on xc nh t cho ti liu mi c m t nh sau:

Hnh 5. Giai on xc nh t cho ti liu mi 2.6.4. Phng php Longest Matching Phng php Longest Matching tch t da vo t in c sn [1]. Theo phng php ny, tch t ting Vit ta i t tri qua phi v chn t c nhiu m tit nht m c mt trong t in, ri c tip tc cho t k

- 43 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

tip cho n ht cu. Vi cch ny, ta d dng tch c chnh xc cc ng/cu nh: hp tc|mua bn, thnh lp|nc|Vit Nam|dn ch|cng ho... Tuy nhin, phng php ny s tch t sai trong trng hp nh: hc sinh hc sinh hc c tch thnh hc sinh|hc sinh|hc, mt ng quan ti gii c tch thnh mt|ng|quan ti|gii , trc bn l mt ly nc c tch thnh trc|bn l|mt|ly|nc,... 2.6.5. Kt hp gia fnTBL v Longest Matching Chng ta c th kt hp gia hai phng php fnTBL v Longest Matching c c kt qu tch t tt nht. u tin ta s tch t bng Longest Matching, u ra ca phng php ny s l u vo ca phng php fnTBL hc lut.

2.7. Kt lun chng 2


Trong chng ny, cc ni dung lin quan ti phn cm ti liu Web c trnh by mt cch khi qut nht gip c mt ci nhn tng quan bt tay vo thc hin gii quyt bi ton. ng thi hng gii quyt kh khn khi phn cm ti liu Web ting Vit cng c trnh by. Trn c s , lun vn nghin cu tp trung vo cc thut ton phn cm Web c tnh tng in hnh.

- 44 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

CHNG 3 - THUT TON PHN CM CY HU T V THUT TON CY PHN CM TI LIU


3.1. Gii thiu v thut ton phn cm trang Web c tnh tng
Chng ta ang quan tm gii quyt bi ton phn cm ti liu cho cc trang Web. Theo truyn thng, nhim v phn lp ti liu c tin hnh th cng. gn mt ti liu vi mt lp thch hp, ngi thc hin u tin s phn tch cc ni dung ca ti liu. Bi vy mt s lng ln n lc ca con ngi s b yu cu. c mt vi cng vic nghin cu hng dn vic phn cm t ng vn bn text. Mt hng i l phn lp vn bn text bng cch s dng cc k thut hc my. Tuy nhin, cc thut ton ny da trn mt b v d hun luyn ng v sai cho hc cc lp vn bn. Cht lng ca kt qu cc lp mun cao th phi ph thuc vo cc v d hun luyn ph hp. C rt nhiu thut ng v cc lp trn World Wide Web (hoc ch l Web), v rt nhiu thut ng v khi nim c to ra hng ngy. Tht l khng th c cc chuyn gia trong lnh vc ny nh ngha cc v d hun luyn hc mt ngi phn loi cho tng lp theo cch nh trn. tin hnh x l phn lp ti liu t ng, cc k thut phn cm c s dng. S thu ht ca phn tch phn cm l vic n c th tm thy cc cm trc tip t d liu a vo m khng cn nh vo bt c thng tin no c xc nh trc, chng hn nh cc v d hun luyn cung cp bi cc chuyn gia trong lnh vc. Trong chng ny, lun vn xin trnh by mt s cc thut ton phn cm thch hp cho vic phn cm trang Web bi cc c tnh tng ca chng, c th l thut ton phn cm cy hu t (STC) v thut ton s dng cy phn cm ti liu (DC-Tree).

- 45 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

3.2. Thut ton phn cm cy hu t


3.2.1. M t Cy hu t (hay cn gi l cy PAT hoc gn y c th c gi l cy v tr) l mt cu trc d liu biu din cc hu t ca xu k t nht nh cho php thc hin mt cch c bit nhanh chng nht nhiu php ton quan trng trn xu. Cy hu t cho mt xu k t S l mt cy c cnh v nhn l cc xu, th d hu t eAHC ca S ph hp ch c mt con ng t gc ca cy ti mt l. Do ch c mt cy c s cho cc hu t ca S. Vic xy dng mt cy cho xu k t S mt thi gian v khng gian tuyn tnh vi di ca S. Mi mt ln xy dng, mt vi thao tc c th c thc hin nhanh chng, v d nh vic xc nh v tr mt xu con trong S, xc nh v tr ca mt xu con nu cho php mt s chc chn cc li, xc nh v tr cc xu tng ng cho mt mu cng thc thng thng,... Cc cy hu t cng cung cp mt trong cc gii php c thi gian tuyn tnh cho vn tm xu con thng thng. Tuy nhin vic tng tc s dn ti tng chi ph khng gian b nh do phi lu tr thm cy hu t ca mt xu hn so vic lu tr xu .

Hnh 6. Cy hu t cho xu BANANA Cy hu t cho xu BANANA c thm $ vo cui. C su con ng t gc ti mt l (c ch trn nh cc hp) tng ng vi 6 hu t
A$, NA$,

- 46 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

ANA$, NANA$, ANANA$

v BANANA$. Cc ch s trong cc hp ch ra v tr bt u ca hu t tng ng. Hu t c lin kt bi mi tn ko theo

3.2.2. Thut ton STC Thut ton phn cm cy hu t Suffix Tree Clustering (STC) [5] l mt thut ton phn cm thi gian tuyn tnh da trn vic nhn dng cc cm t chung ca cc vn bn. Mt cm t trong ng cnh ny l mt chui th t ca mt hoc nhiu t. Chng ta nh ngha mt cm c bn (base cluster) l mt tp cc vn bn c chia s mt cm t chung. STC c 3 bc thc hin logic: (1) Lm sch vn bn, (2) nh ngha cc cm c bn s dng mt cy hu t, v (3) kt hp cc cm c bn vo cc cm. Bc 1: Tin x l (Pro-Precessing). Trong bc ny, cc chui ca on vn bn biu din mi ti liu c chuyn i s dng cc thut ton cht (Chng hn nh loi b i cc tin t, hu t, chuyn t s nhiu thnh s t). Phn ra thnh tng cu (xc nh cc du chm cu, cc th HTML). B qua cc t t khng phi l t (chng hn nh kiu s, cc th HTML v cc du cu). Cc chui ti liu nguyn gc c gi li, cng vi cc con tr ti v tr bt u ca mi t trong chui chuyn i n v tr ca n trong chui gc. Vic c cc con tr nhm gip hin th c on vn bn gc t cc nhm t kha chuyn i. Bc 2: Xc nh cc cm c s. Vic xc nh cc cm c s c th c xem xt nh vic to mt ch s ca cc nhm t cho tp ti liu. iu ny c thc hin hiu qu thng qua vic s dng cu trc d liu gi l cy hu t. Cu trc d liu ny c th c xy dng trong thi gian tuyn tnh vi kch c ca tp ti liu, v c th c xy dng tng thm cho cc ti liu ang c c vo. Mt cy hu t ca mt chui S l mt cy thu gn cha ng tt c cc hu t ca S. Thut ton coi cc ti liu nh cc chui ca cc t, khng phi ca cc k t v vy cc hu t cha ng mt hoc nhiu t. M t c th v cy hu t nh sau: 1. Mt cy hu t l cy c gc v c nh hng. 2. Mi node trong c ti thiu 2 con.

- 47 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

3. Mi cnh c gn nhn l mt chui con ca S v chui khc rng. Nhn ca mt node c xc nh thng qua chui ni tip ca cc nhn c gn cho cc cnh t gc ti node . 4. Khng c hai cnh t mt node c gn nhn bt u vi t ging nhau. 5. Vi mi hu t s ca S, tn ti mt suffix-node c nhn l s. Cy hu t ca mt tp cc chui l mt cy thu gn cha ng tt c cc hu t ca tt c cc chui trong tp ti liu. Mi suffix-node c nh du ch ra chui m n thuc v. Nhn ca suffix-node chnh l mt hu t ca chui . phn cm ta s xy dng cy hu t ca tt c cc cu ca tt c cc ti liu trong tp ti liu. Chng hn c th xy dng cy hu t cho tp cc chui l {cat ate cheese, mouse ate cheese too, cat ate mouse too}. - Cc node ca cy hu t c v bng hnh trn - Mi suffix-node c mt hoc nhiu hp gn vo n ch ra chui m n thuc v. - Mi hp c 2 s (s th nht ch ra chui m hu t thuc v, s th hai ch ra hu t no ca chui gn nhn cho suffix-node)

Hnh 7: Cy hu t ca cc chui cat ate cheese, mouse ate cheese too, cat ate mouse too.

Mt s node c bit a f. Mi mt node ny biu din cho mt nhm ti liu v mt nhm t chung c thit t cho tt c ti liu. Nhn ca node biu din nhm t chung. Tp cc ti liu gn nhn suffix-node l k tha ca cc node to bi nhm ti liu. Do , mi node biu din mt cm c s (base

- 48 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

cluster). Ngoi ra, tt c cc cm c s c th (cha 2 hoc nhiu ti liu) xut hin nh cc node trong cy hu t. Bng sau lit k cc node a-> f trong hnh 1 v cc cm c s tng ng.
Bng 1: Su node t hnh 14 v cc cm c s tng ng. Node Phrase Documents a cat ate 1,3 b ate 1,2,3 c cheese 1,2 d mouse 2,3 e too 2,3 f ate cheese 1,2

Mi cm c s c gn mt im s l mt hm ca s lng cc ti liu cm cha ng, v cc t hnh thnh nn nhm t ca n. im s s(B) ca cm c s B vi nhm t P l:


s(B) = |B| . f(|P|) (*)

Trong : |B| l s lng ca cc ti liu trong cm c s B, |P| l s lng cc t c trong nhm t P m c im s khc 0. Vic xt n im s ca nhm t P theo ngha nh sau: Thut ton ci t mt danh sch stoplist bao gm cc t c trng trn internet dng xc nh cc t khc. ( V d previous, java, frames, mail). Cc t xut hin trong danh sch stoplist hay cc t xut hin qu t trong mt nhm t (3 hoc t hn) hay qu nhiu (hn 40% ca tp ti liu) s c gn im s 0 cho nhm t. Hm f trong cng thc (*) thc hin trn cc nhm t n, n l tuyn tnh cho cc nhm t c di t 2 n 6 v l hng s vi cc nhm c di ln hn. Bc 3: Kt ni cc cm c s Cc ti liu c th chia s nhiu hn mt nhm t. Kt qu l, tp hp ti liu ca cc cm c s khc nhau c th trng lp v thm ch l c th l ging nhau. trnh vic c nhiu cc cm gn ging nhau. Ti bc th 3 ny ca thut ton vic trn cc cm c s vi mt s trng lp cao trong tp ti liu ca chng (ch l cc nhm t chung khng xem xt trong bc ny)

- 49 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Thut ton a ra mt o tnh tng t gia cc cm da trn vic trng lp ca tp ti liu ca chng. Gi s c hai cm c s Bm v Bn vi kch c l |Bm| v |Bn| tng ng. V | Bm Bn| th hin s ti liu chung ca c hai cm, tng t gia Bm v Bn l 1 nu: +) | Bm Bn| / |Bm| > 0.5 v +) | Bm Bn| / |Bn| > 0.5 Ngc li, tng t l 0. Hy xem minh ha tip theo ca v d trong Hnh 7. y mi node l cc cm c s. Hai node c ni vi nhau khi tng t l 1. Mt cm c xc nh l cc thnh phn c ghp ni trong th cm c s. Mi mt cm s bao gm tp ca tt c cc ti liu ca cc cm c s trong n.

Hnh 7: th cc cm c s ca v d trong Hnh 6 v bng 1.

Trong v d ny c mt thnh phn kt ni, do c 1 cm. Nu gi s rng t ate c trong danh sch stoplist, th cm c s b s b loi ra bi v n c ch s ca nhm t l 0. V do s c 3 thnh phn kt ni trong th, th hin 3 cm. Chng ta thy rng thi gian ca vic tin x l cc ti liu ti bc 1 ca thut ton STC hin nhin l tuyn tnh vi kch thc tp ti liu. Thi gian ca vic thm cc ti liu vo cy hu t cng tuyn tnh vi kch thc tp ti liu theo thut ton Ukkonen cng nh s lng cc node c th b nh hng bi vic chn ny. Do vy thi gian tng cng ca STC tuyn tnh vi kch thc tp ti liu. Hay thi gian thc hin ca thut ton STC l O(n) trong n l kch thc ca tp ti liu.

- 50 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

3.3. Thut ton phn cm s dng cy phn cm ti liu


3.3.1. Gii thiu Trong thut ton phn cm s dng cy phn cm ti liu, mt ti liu thng thng c biu din bi mt vector c trng. Mt cch c tnh, tng c trng tng ng vi mt t kho hoc cm t xut hin trong tp ti liu. Mi entry ca vector lu mt trng s cho c trng tng ng ca ti liu. Sau khi trch chn cc vector c trng ca cc ti liu, chng ta c th p dng thut ton phn cm trn tp cc vector nh trong phn cm d liu kch thc ln thng thng. Cc lp ti liu kt qu thu c cng vi cc c trng tiu biu (v d cc t kho hoc cm t kha vi h tr ti liu (document support) cho cm) do trnh by cho ngi s dng. Trong lun vn ny, ti xin gii thiu mt cu trc cy gi l DC-tree (Document Clustering Tree: Cy phn cm ti liu) c th phn cm cc ti liu m khng cn tp hun luyn [24]. Vi DC-tree, mt i tng d liu a vo khng bt buc phi chn vo mc (v tr) thp khi khng tn to mt nt con tng t cho i tng d liu. iu ny ngn cn mt vi d liu khng tng t t vic t cng nhau. Kt qu l thut ton phn cm da trn cu trc DCtree l n nh vi yu cu a thm ti liu v d chp nhn cc ti liu nhiu. Phng thc ny c th hu ch trong mt s cch: (1) Cho vic tin x l trong vic phn lp trang Web ngi s dng c th chn lp thch hp trc khi tm kim, vic ny gip ch vic tm kim tr nn c trng tm hn v hiu qu hn. (2) Cho vic phn lp trc tuyn online, khi s lng ln cc kt qa tr li t mt tm kim, K thut ny c th phn lp cc kt qu v cung cp tt hn hng dn cho ngi s dng trong cc tm kim trong tng lai. (3) Cho vic phn lp trang Web c tnh tng sau khi cp nht trn kho d liu. 3.3.2. Trch chn c trng v phn cm ti liu Nhim v u tin l nhn bit mt phng php trch chn c trng tt thch hp cho mi trng Web. Trong phn ny, lun vn trnh by mt phng

- 51 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

php trch chn c trng. Ngoi ra, ti liu v s biu din phn cm ti liu cng s c m t. Cui cng, phng php c lng cht lng phn cm cng s c trnh by.

a, Trch chn c trng ti liu


Phng php trch chn c trng cho thut ton phn cm ti liu Web c a ra khng ph thuc vo tn xut xut hin t. Phng php ny cn bng cc yu t khc nhau t c s kt hp tt nht gia hi tng v s cc c trng s dng cho biu din ti liu. Trong vn ca chng ta phm vi phn cm mc tiu gip trong vic ly thng tin trong vic tm kim bng cch thu hp phm vi tm kim. Trong mt vin cnh, ngi s dng c th khng mun qu nhiu phn cm trong kt qu. ng thi, cc cm qu ln hoc qu nh l khng c mong mun. Cc cm qu ln khng th gip thu hp phm vi tm kim. Cc cm qa nh c th lm tng tng s cc cm,v n c th thm ch gy nn trng thi nhiu. Tham s k c s dng thit lp mt s xp x trn c ca cm. Do s cc phn cm l xp x N/k, trong N l tng s cc ti liu. Phng php c xut bao gm cc bc sau: 1. Ly ngu nhin mt tp con ca cc ti liu vi c m t tp sao lc. 2. Trch tp cc t c xut hin t nht mt ln trong cc ti liu. Xo cc t kt thc v kt ni cc t vo cng mt gc bng cch s dng k thut lp y. 3. m tn xut ti liu ca cc t c trch trong bc 2. 4. t lower=k v upper=k 5. Ly tt c cc t vi tn xut ti liu trong gi tr t lower v upper. 6. Kim tra nu coverage ( hi tng) ca cc t l ln hn ngng nh ngha trc. Nu vy, dng. Nu khng, t lower=lower-1 v upper=upper+1, v quay li bc 5. trch chn cc c trng tiu biu t cc ti liu, chng ta la chn ngu nhin mt tp cc ti liu mu cho bc trch chn c trng trong bc 1. Mt vi th nghim [24] ch ra rng phng php trch chn c trng ny c th trch ra mt tp cc c trng tt cho phn cm ti liu Web. Mt danh sch cc

- 52 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

t kt thc thng c s dng xo cc t t c ngha. K thut lp y thng c s dng kt ni cc t ny trong dng tng t. Bi v cc vector c trng ngn nht dn ti thi gian phn cm ngn hn, bc 4 v 6 c gng lm nh nht s cc c trng v thu c hi tng hp l cho cc c trng. Tha nhn ngi s dng mun cm kt qu bao gm khong k ti liu.Trong trng hp l tng, mt c trng cho mt cm s xut hin ch trong cm v do tn xut ti liu ca ca c trng l k. Bi vy, u tin chng ta chn cc c trng vi tn xut ti liu l bng k, bng cch thit lp lower v upper bng k trong bc 4. Khong gi tr {lower, upper} l tng ln mt cch lp li trong bc 6 bo m bo ph cho tp c trng kt qu. Chng ta thy rng N/k ch l mt hng dn phng on, s lng thc t cc phn cm ca kt qu phn cm c th khng ging nh N/k. Phng php cng s dng mt ngng hi tng m bo rng cc c trng c chn c hi tng. Vi cc th nghim ([24]), chng ta thy rng 0.8 l gi tr ngng hi tng kh tt.

b, Biu din ti liu


Trong thut ton ca chng ta, mt ti liu (Di) c biu din theo dng sau: Di=(Wi,IDi), trong IDi l s nhn dng ti liu c th c s dng ly ti liu (Di), v Wi l vector c trng ca ti liu: Wi=(wi1,wi2,...,win). Do n l s cc c trng c trch chn, v wij l trng s ca c trng th j, trong j {1,2,..,n}. Trong thut ton ca chng ta, s sp xp trng s nh phn c s dng. l, wij =1 nu Di bao gm c trng th j, ngc li, wij =0. Nh cp ti phn trch chn c trng pha trn, mt trang Web in hnh khng bao gm nhiu t m tn xut xut hin ca mt t khng biu th s quan trng trong thc t ca t ny. Bi vy, lc trng s nh phn l thch hp nht cho phm vi vn ca chng ta.

c, Phn cm ti liu (DC)


Mt gi tr phn cm ti liu (DC- Document Cluster) l mt b ba thng tin m chng ta duy tr bi mt tp cc ti liu trong cng mt cm: (1) s cc ti liu (2) tp cc nhn dng ti liu

- 53 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

(3) vector c trng ca phn cm nh ngha1: (DC) Cho N ti liu trong mt phn cm: {D1,D2,...DN}, gi tr DC ca mt nt c nh ngha nh mt b ba: DC = (N,ID,W), trong N l s lng cc ti liu trong cm, ID l tp cc nhn dng ti liu ca cc ti liu trong cm, v d ID={ID1,ID2,...IDN}, v W l vector c trng ca cm ti liu, v d W=(w1,w2,...,wn), trong wj= wij , v n l s cc c trng c
i =1 N

trch chn. B ba ny khng ch ra tng hp tn sut ti liu trong cm, nhng c th s dng nh gi s ging nhau gia hai cm. B sau cung cp mt cch linh hot kt ni hai cm thnh mt v cho ra gi tr DC cho cm kt hp. B [24] (Php cng) Cho DC1 = (N1,ID1,W1) and DC2= (N2,ID2,W2) l b gi tr DC ca hai cm ti liu tch ri, trong tch ri c ngha l mt ti liu khng thuc v nhiu hn mt cm ti cng mt thi im. Khi b gi tr DC mi, DCnew, ca cm c hnh thnh bng cch kt hp hai cm tch bit l: DCnew = (N1+N2, ID1 ID1, W1+W2), trong W1+W2= (w11+w21,w12+w22,...,w1n+w2n), v n l s cc c trng c trch chn. d, Cc k thut nh gi nh gi cht lng ca kt qu vic phn cm, chng ta chn k thut nh gi F-Measure ( o lng F) [23]. Chi tit ca phng php nh gi c m t nh sau: Cho tng topic c gn nhn bng tay T trong tp ti liu, gi s rng mt phn cm X tng ng vi topic c hnh thnh. N1= s cc ti liu ca topic T trong phn cm X N2=s cc ti liu trong phn cm X N3= tng s cc ti liu ca topic T P=Precision(X,T)=N1/N2 R=Recall(X,T)=N1/N3 F-measure cho topic T c ng ngha nh sau:

- 54 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

F(T)=

2 PR P+R

Vi nh gi cao vi mt topic T, chng ta quan tm phn cm vi o F-measure cao nht phn cm C cho T, v o F-measure tr thnh im s cho topic T. o overall F-measure[22] cho cy kt qu phn cm l gi tr trung bnh ca F-measure cho tng topic T: Overall_F_Measure=

T M

( T F (T ))
T M

trong M l tp cc topic, |T| l s cc ti liu ca topic T, v F(T) l F-Measure cho topic T. 3.3.3. Cy phn cm ti liu DC Tree Trong phn ny xin gii thiu mt thut ton phn cm ti liu Web bng phng tin l cy phn cm ti liu (Document Cluster -DC-tree). Trong DCtree, mi nt c th c quan tm nh mt phn cm ti liu. Cu trc cy c s dng hng dn cch a i tng ti liu vo mt phn cm ti liu (DC) thch hp ti cc nt l. N l tng t vi B+-tree [2] trong cc bn ghi ch s ti cc nt l bao gm cc con tr tr ti cc i tng d liu, nhng n khng l cy c chiu cao cn bng. Cu trc ny c thit k bi v vic gn mt ti liu vo mt phn cm ch yu cu duyt qua mt s lng nh cc nt. Mt DC-tree l mt cy vi 4 tham s: h s nhnh (B), hai ngng tng t (S1 v S2, trong 0 S1 , S2 1) v s nh nht con ca mt nt (M). Mt nt khng phi l l ca ton b cc ch mc ca B c dng (DCi, Childi), trong i=1, 2,..., B, Childi l mt con tr ti nt con th i hoc mt ti liu, v DCi l gi tr DC ca phn cm con tiu biu cho nt con th i hoc mt ti liu ca n. V th, mt nt khng phi l l m t mt cm cu to nn tt c cc cm con c m t bi ch mc ca n.

- 55 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Mt nt l DC ca ton b ch mc B l mt ch mc c dng (DCi, Doci), trong i {1, 2, ..., B}, Doci l mt con tr ti mt ti liu hoc mt tp ti liu, v DCi l ch mc DC ca cm con tng ng. Gi tp ti liu di mt con tr l mt nt l ti liu ( document leaf node), phn bit vi n trong nt l cy (tree leaf node) hoc DC leaf node (xem hnh 8). Mt nt l DC cng m t mt cm cu to nn tt c cc cm con c m t bi cc ch mc DC ca n. Cy DC cho php mt ch mc a ti liu vo, chn vo mt nt l ti liu mi ti cc mc khc nhau ca cy. V th, Cy DC khng phi l mt cy c chiu cao cn bng. Hnh 8 biu din mt v d cy DC vi chiu cao l 2, B=3, M=2. Ch rng cy l khng cn bng. Trong vic xy dng cy, hai ngng c s dng:

Hnh 8. V d ca mt cy DC Ngng S1: ngn chn kt qu phn cm ti liu km cht lng (v d: Cc ti liu trong cc lp khc nhau c a vo cng 1 cy con hoc phn cm) c gy ra bi th t chn ti liu, S1 c s dng quyt nh ti liu n E c th c chuyn ti cp tip theo hay khng trong qu trnh chn ti liu. Nu tn ti mt ti liu con ca nt hin ti m tng t gia ti liu ny v ti liu sp c a vo ln hn S1, ti liu mi s c chuyn n nt con tng ng. Ngc li, ti liu mi s c thm vo nt hin ti nh mt nt l mi.

- 56 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Ngng S2: Do cy DC c s dng cho vic phn cm ti liu m khng s dng nh ch mc, do khng cn thit phi p mi nt l tr n mt ti liu n. lm gim thi gian chn, ti liu mi c th c gp vi mt nt l, nu tng t ca n ln hn mt ngng S2. Vic gp nt ny s dng php gp c m t trong b 1, n gip cho gim bt vic chn nt v cc thao tc phn chia v th m thi gian chn c th gim i. C th coi cy DC l mt th hin ca tp d liu v mi ti liu nt l khng phi l mt im d liu n m l mt phn cm ca cc im d liu (mi nt l c th c nhiu im d liu min l tha mn ngng S2). Vi nh ngha v cy DC nh trn, c ca cy s l mt hm da trn cc ngng S1 v S2. Nu chng ta t S1 = 0 v S2 l mt s no ln hn 1, cy DC s tng t nh mt cy cn bng nh cy B+ hoc mt cy R [21]. Vic xa d liu cng nh thut ton trn l tng t nh ca cy B+. A. Chn Sau y l thut ton chn mt i tng ti liu vo cy DC. i tng ti liu y c th l mt ti liu n hoc mt phn cm ca cc ti liu c biu din bi mt nhm DC (E). Nu nh i tng ti liu l mt ti liu n, u tin n s c gi vo trong mt nhm DC (E). Thut ton chn c tin hnh theo nhng bc sau y: 1. Nhn dng nt l thch hp: Bt u t gc, E duyt xung di cy DC bng vic chn nt con gn nht vi gi tr tng t ln hn S1. Nu nt con ny khng tn ti, E c chn vo nh mt nt l ti liu vo mt nhm rng ca nt. Nu khng c nhm rng no, vic chia nh nt cn c thc hin. 2. Thay i nt l: Khi chng ta ang mt nt l ca cy DC, chng ta tm ra nhm nt l gn nht vi E, k hiu l Li v kim tra xem n c th gp vi E m khng vi phm yu cu v ngng tng t S2 khng. Nu khng vi phm, nhm cha Li s c dng gp. Ch rng mt nhm DC ca mt phn cm mi c th c tnh t nhng nhm DC cho Li v E da vo b 1. Ngc li, E s c cng vo nt l. Nu c khong trng trong nt l thm

- 57 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

c nhm ny, coi nh chng ta hon thnh, ngc li, chng ta phi chia nh nt l. 3.Thay i ng dn t nt l n gc: Sau khi E c chn vo mt nt l, chng ta phi cp nht li cc nhm khng phi l l trn ng t gc n nt l. Do cha c vic chia nh, vic ny c thc hin bng cch thm cc nhm DC tng ng vi vic thm vo E. Vic chia mt nt l yu cu chng ta phi chn mt nhm khng phi l l mi vo nt cha, vic ny tng ng vi vic to ra mt nt l mi. Nu nh nt cha c khong trng mc ny c th chn vo th ti tt c cc mc ln hn, chng ta ch cp cp nht cc nhm DC tng ng vi vic thm vo E. Mt cch tng quan, chng ta c th phi chia nh c nt cha v thm ch c nt gc. Nu nt gc b chia nh, cao ca cy s c tng thm 1 v mt gc mi s c to ra. B. Chi nh nt thm mt nhm mi vo mt nt y cha B nhm, cn thit phi chia tp B+1 nhm thnh 2 nt. S chia s ny nn c hon thnh sao cho tng t gia 2 nt mi s l nh nht v tng t gia cc ti liu trong cng mt nt s l ln nht. Chng ta s s dng mt thut ton chia nh mt tp B+1 nhm vo 2 tp. Cch n gin nht l to ra tt c cc tp c th v chn tp tt nht. Tuy nhin, s lng tp ny c th l rt ln, xp x 2B-1 . Pha di l mt thut ton chia nt s dng thut ton chn. Thut ton chia nt ny tng t nh phng php c s dng trong cy R: 1. Chn mt ht nhn cho mi nhm: Mi mi cp nhm E1 v E2, tnh ton tng t gia chng. Chn ra cp c tng t thp nht nh l cc nhn t u tin ca 2 tp. trnh hiu ng tht nt, chng ta nn chn ra cp c s lng ti liu l ln nht. 2. Kim tra iu kin kt thc: Nu tt c cc nhm c a vo cc tp, dng y. Nu mt tp c rt t nhm th tt c cc nhm cha c xt s c a vo n tha mn s nhm nh nht M. 3. Chn tp tng ng: Vi mi nhm E cha trong tp no, tnh ton tng t gia E v mt nhm ht nhn ca mi tp. a nhm vo tp c gi tr tng t ln nht vi n. Quay li bc 2.

- 58 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

C. Xa v trn nt Thut ton xa d liu l tng t nh trong cy B+. Nu s lng cn li ca cc nhm ln hn hoc bng s lng nhm ti thiu M no sau khi loi b mt nhm, vic xa nt c hon thnh. Ngc li vic trn nt s c s dng. iu ny c ngha l nt s c trn vi vi cc anh em ca n. Hn na, vic trn nt l cn thit khi mt nhm xa l vi nt cha. Cng vic ny c nhn ra ti nt gc v cao ca cy c th b gim xung nu cn thit. D. Nhn dng cc phn cm th v Qu trnh nhn dng phn cm bt u t gc ca cy. Mt thut ton tm kim breath-first c p dng khm ph ra cc phn cm th v. Mt phn cm th v c nh ngha l mt phn cm m cha cc nt c trng tiu biu v kch c trong mt khong nh trc. Chng ta c th s dng cc gi tr chn di (lower) v chn trn (upper) c tm thy trong phng thc bc tch c trng ca chng ta quyt nh khong gii hn ca c phn cm. Gi s l v u l chn di v chn trn ca c phn cm, th l v u c th c quyt nh bng cng thc sau: (1) l = lower
N m

v (2) u = upper

N m

Trong N l c ca tp d liu v m l c ca tp d liu mu c s dng trong qu trnh bc tch c trng. Phm vi ny cng c th c iu chnh th cng t c mt kt qu phn cm tt. Mt khi chng ta nhn din c mt phn cm th v, cc phn cm con trong cc nt con ca n s khng cn phi duyt na. Mt c trng tiu biu c nh ngha l mt c trng c s h tr trong phn cm. C ngha l tn sut ti liu ca cc nt c trng tiu biu phi ln hn mt ngng nh trc no . Chng ta c th gi ngng ny l ngng tiu biu. Cc c trng ny sau s c s dng lm i din cho phn cm.

- 59 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

3.4. Kt lun chng 3


Chng ny trnh by chi tit hai thut ton phn cm c tnh tng l STC v DC-tree. ng thi a ra cc nhn xt cho tng thut ton, lun vn a ra nhn xt thut ton phn cm thch hp i vi cc ti liu Web p dng vo my tm kim. Chng trnh ci t th nghim cho thut ton v vic nh gi kt qu thut ton s c trnh by chng tip theo.

- 60 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

CHNG 4 - PHN MM TH NGHIM V KT QU THC NGHIM


4.1. Gii thiu
Trong phm vi ca lun vn ny, ti p dng thut ton phn cm ti liu s dng cu trc DC-tree vo chng trnh th nghim ca mnh. thc nghim kt qu ca phn cm DC Tree, ti th hin thut ton ny bng ngn ng lp trnh C# trn nn tng .Net Framework ca Microsoft s dng SQL Server 2000 lu tr c s d liu. Cc chc nng chnh ca chng trnh bao gm: - Lp d liu t in Da trn tng phn cm s dng cm t, chng trnh xy dng mt h thng t in phc v cho thut ton tch t Longest Matching. Ban u, cc t ny c xy dng da trn cc t ly t d liu t in Vit-Anh ti ngun http://www.stardict.org. Cc d liu ny c th c b sung, sa cha dn dn nng cao hiu qu ca phn cm. - Ly d liu t Internet D liu phn cm s c ly t Internet mt cch c lp vi vic phn cm. Chng trnh s c nh ngha sn mt ngng n cho vic ly d liu t Internet. iu ny c ngha l, sau khi ngi qun tr cung cp cho chng trnh mt URL, chng trnh t ng ly ni dung trang web t URL ny v sau phn tch ni dung trang web, tm cc URL khc nm trong trang web ny. Qu trnh trn c lp li vi URL tm c cho n khi su n c tha mn. Nh th vi su n ph hp, ta c th ly c ton b ni dung ca mt trang Web. - Tch t v phn cm Chc nng ny cho php chng trnh tch t v phn cm cc d liu mi c ly v.

- 61 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Trong chc nng ny, c 3 bc c thc hin: Bc 1: Tch t s dng thut ton Longest Matching vi t in dng sn Bc 2: Tch t s dng thut ton fnTBL t d liu tr v t thut ton Longest Matching. Bc 3: Phn cm da trn thut ton DC-Tree s dng hm tnh tng t da trn cc cm t tch c. - Tm kim trn kt qu phn cm Vic tm kim ny s c p dng mt thut ton bao gm 2 bc: Bc 1: Tnh tng t ca chui tm kim vi cc c trng ca cc phn cm, nu tng t ln hn mt ngng S1 no , ta s p dng bc 2 cho phn cm . Bc 2: Tm kim cc ti liu trong phn cm c tng t cao hn mt ngng S2 vi chui tm kim.

4.2. Thit k c s d liu


C s d liu ca chng trnh c thit k nh trong hnh pha di. Trong chc nng ca cc bng c m t nh sau: Bng: Dictionary y l bng cha d liu t in ting Vit Tn trng PhraseID Phrase PhraseDescription Kiu d liu Int Nvarchar Ntext M t L kha chnh ca bng. L cm t cn lu tr L m t ca cm t cn lu tr. Hin ti cha c s dng. Cha d liu ng ngha ting Anh sau khi c convert t t in StarDict.

- 62 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Bng Documents y l bng cha cc ti liu c chng trnh ly v Tn trng DocID Source Kiu d liu Int Nvarchar M t L kha chnh ca bng. a ch ngun ca ti liu gc. Dng nh ch mc, trnh trng lp ti liu. L trch on ca ti liu, phc v cho vic phn cm. Cho bit ti liu ny c tch t hay cha. Cho bit ti liu c phn cm hay cha.

Snipet

Ntext

IsTokenized

Bit

IsClustered

Bit

Bng DocumentIndex y l bng lin kt gia cc ti liu v d liu t in. Tn trng DocIndexID PhraseID Kiu d liu Int Int M t L kha chnh ca bng. Kha ngoi, lin kt n bng Dictionary Kha ngoi, lin kt n bng Documents Cho bit tng t/tn sut ca t kha trong ti liu da trn mt hm tnh tng t.

DocID

Int

Score

Float

- 63 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Bng Nodes Cha cc nt ca cy DC Tn trng NodeID NodeParentID ClusterID Kiu d liu Int Int Int M t L kha chnh ca bng. Cha nt cha ca cy Cho bit phn cm ca nt thuc vo.

Bng Node-Document V mt nt c th cha nhiu ti liu v ngc li, mt ti liu c th nm trn nhiu nt. Bng ny th hin mi quan h nhiu-nhiu ny Tn trng Kiu d liu M t L kha chnh ca bng. Kha ngoi, lin kt n bng Nodes Kha ngoi, lin kt n bng Documents

NodeDocumentID Int NodeID DocID Int Int

Bng Clusters Cha cc phn cm tm c Tn trng ClusterID Kiu d liu Int M t L kha chnh ca bng. Cho bit s th t ca phn cm.

- 64 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Di y l s lin kt thc th gia cc bng:

Hnh 7. S lin kt thc th ca chng trnh thc nghim 4.3. Chng trnh th nghim
p dng cc nghin cu v l thuyt phn cm, trong chng trnh th nghim ca chng ti, mi mt bc thc hin s c tch thnh tng phn ring. Tng ng vi cc chc nng chnh m t trn, chng trnh bao gm bn module chnh: T in, Ly d liu, Phn cm, Tm kim.

- 65 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

- Module T in: hin th tt c cc t c trong t in Vit. Vi d liu ban u c ly t ngun t in Vit-Anh ti a ch http://www.stardict.org ta s c mt kho t in kh hon chnh cc t Ting Vit. Tuy nhin ta cng c th thm hoc bt nhng t c nu thy cn thit. Tp cc t trong t in ny s c s dng trong bc tch t trong ti liu cn phn cm.

Hnh: Mn hnh h tr chc nng cp nht chnh sa T in - Module Ly d liu: xy dng kho d liu cc ti liu Web, ta tin hnh ly d liu v. Ngi s dng s nhp ng dn URL ca trang Web, h thng s t ng tm kim v ly tt c ni dung ca trang Web vi mt su n ( c nh trc)

- 66 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Hnh: Mn hnh chc nng h tr ly d liu t Internet - Module Phn cm: Sau khi tin hnh ly d liu, ta thc hin phn cm ti liu. H thng s tin hnh phn cm mt cch t ng. Trong ln phn cm khc vi tp d liu mi c ly v, vic phn cm s khng cn phn cm li vi tp d liu c m ta phn cm trc na. Vic phn cm s ch cn thc hin trn tp d liu mi vi kt qu c ca cc ln phn cm trc. Trong thut ton c s dng cc tham s sau: M: S lng nh nht con ca mt nt M=8 B: H s nhnh ca cy B=20 S2:Ngng tng t 2 S2=1.0 S1: Ngng tng t 1 S1=0.3 repThreshold: Ngng ca c trng tiu biu repThreshold=0.4 MCS: C phn cm nh nht MCS=100

- 67 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Hnh: Mn hnh h tr chc nng Phn cm vi d liu ly v t Internet - Module Tm kim: Ngi s dng s nhp vo t kho cn tm kim. H thng s tm cc ti liu lin quan vi t kho.

- 68 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

Hnh: Mn hnh chc nng h tr Tm kim.

4.4. Kt lun chng 4


Chng ny l kt qu ci t th nghim ca thut ton phn cm cho ti liu Web Ting Vit s dng cu trc d liu DC-tree c trnh by chng 3. Chng trnh ci t vit bng ngn ng lp trnh C# trn nn tng .Net Framework ca Microsoft s dng SQL Server 2000 lu tr c s d liu. Chng trnh thc hin vic phn cm vi kt qu tng i hp l.

- 69 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

KT LUN
Lun vn cung cp mt s ni dung v phn cm Web, t c mt s kt qu nh sau: - Gii thiu khi qut v bi ton phn cm web, cc gii php phn cm web (cc yu cu, k thut, nh gi) trong ch ti tnh tng ca cc thut ton phn cm wbe, - Trnh by hai thut ton phn cm web c tnh tng l STC v DC-tree. phn tch cc ni dung kin thc c bn, nn tng pht trin cc thut ton ny. - Xy dng phn mm th nghim phn cm ti liu theo thut ton DCtree. H thng my tm kim-DC tree do lun vn pht trin c a ln web, c cng c lu cc cu truy vn ca ngi dng, cc phn cm tm thy v cc lin kt c ngi dng i ti. H thng hot ng v thc hin c vic phn cm cc ti liu Web. Do hn ch v thi gian v nng lc, lun vn cha tin hnh nh gi cht lng phn cm ca h thng. Trong tng lai, chng ti s tin hnh cc nh gi cng phu hn. Chng ti d kin a ra cc thng k da trn hnh vi ca h thng trong thc t. Ngoi ra, chng ti c th nghin cu cc hng gii quyt vn t ng ngha trong ting Vit.

- 70 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

TI LIU THAM KHO


Ting Vit
[1]. inh in, X l ngn ng t nhin, NXB Gio Dc.

Ting Anh
[2]. Clement T.Yu v Weiyi Meng (1998), Principles of Database Query Processing for Advanced Application, Morgan Kaufmann Publisher, Inc. [3]. Gerard Salton/Michael J.McGill, Introduction to Modern Information Retrieval. [4]. M. Steinbach, G. Karypis, V. Kumar (2000), A Comparison of Document Clustering Techniques, TextMining Workshop, KDD. [5]. O. Zamir and O. Etzioni (1998), Web Document Clustering: A Feasibility Demonstration, Proc. of the 21st ACM SIGIR Conference, 46-54. [6]. O. Zamir, O. Etzioni, O Madani, R. M. Karp (1997), Fast and Intuitive Clustering of Web Documents, Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining. [7]. K. Cios, W. Pedrycs, R. Swiniarski (1998), Data Mining Methods for Knowledge Discovery, Kluwer Academic Publishers. [8]. R. Krishnapuram, A. Joshi, L. Yi (1999), A Fuzzy Relative of the k-Medoids Algorithm with Application to Web Document and Snippet Clustering, Proc. IEEE Intl. Conf. Fuzzy Systems, Korea. [9]. Z. Jiang, A. Joshi, R. Krishnapuram, L. Yi (2000), Retriever: Improving Web Search Engine Results Using Clustering, Technical Report, CSEE Department, UMBC. [10]. T. H. Haveliwala, A. Gionis, P. Indyk (2000), Scalable Techniques for Clustering the Web, Extended Abstract, WebDB2000, Third International Workshop on the Web and Databases, In conjunction with ACM SIGMOD2000, Dallas, TX. [11]. A. Bouguettaya (1996), On-Line Clustering, IEEE Trans. on Knowledge and Data Engineering. [12]. A. K. Jain v R. C. Dubes (1988), Algorithms for Clustering Data, John Wiley & Sons. [13]. G. Karypis, E. Han, V. Kumar (1999), CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer 32. [14]. O. Zamir v O. Etzioni (1999), Grouper: A Dynamic Clustering Interface to Web Search Results, Proc. of the 8th International World Wide Web Conference, Toronto, Canada. [15]. D. R. Cutting, D. R. Karger, J. O. Pedersen, J.W. Tukey (1993), Scatter/Gather: A Clusterbased Approach to Browsing Large Document Collections, In Proceedings of the 16th International ACM SIGIR Conference on Research and Development in Information Retrieval.

- 71 -

Nguyn Th Thu Hng-Lun vn cao hc-Trng i hc Cng ngh-2007.

[16]. R. Michalski, I. Bratko, M. Kubat (1998), Machine Learning and Data Mining Methods and Applications, John Wiley & Sons Ltd.. [17]. J. Jang, C. Sun, E. Mizutani (1997), Neuro-Fuzzy and Soft Computing A Computational Approach to Learning and Machine Intelligence, Prentice Hall. [18]. G. Biswas, J.B. Weinberg, D. Fisher (1998), ITERATE: A Conceptual Clustering Algorithm for Data Mining, IEEE Transactions on Systems, Man and Cybernetics. [19]. Z. Huang (1997), A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining, Workshop on Research Issues on Data Mining and Knowledge Discovery. [20]. Y. Yang v J. Pedersen (1997), A Comparative Study on Feature Selection in Text Categorization, In Proc. of the 14th International Conference on Machine Learning. [21]. A Guttman (1984). R-tree: A dynamic index structure for spatial searching, In Proceedings of ACM SIGMOD. [22]. Bjornal Larsen v Chinatsu Aone (1999). Fast and effective text mining using lineartime document clustering, In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA. [23]. C.J.van Rijbergen(1979), Information Retrieval, Butterworth & Co (Publishers) LTd. [24]. Wai-chiu Wong v Ada Fu (2000), Incremental Document Clustering for Web Page Classification, IEEE 2000 Int, Conf. on Infor, Society in the 21st century: emerging technologies anf new challenges (IS2000), Nht Bn. [25]. Pierre Baldi, Paolo Frasconi, Padhraic Smyth (2003). Modeling the Internet and the Web: Probabilistic Methods and Algorithms. Wiley, 2003. [26]. Sen Slattery (2002). Hypertext Classification. PhD Thesis (CMU-CS-02142). School of Computer Science. Carnegie Mellon University, 2002.

- 72 -

You might also like