Professional Documents
Culture Documents
LUN VN THC S
H Ni 2007
LUN VN THC S
H Ni - 2007
Nhng li u tin
Vi nhng dng ch u tin ny, ti xin dnh gi li cm n chn thnh v su sc nht ti thy gio, tin s H Quang Thy - ngi tn tnh hng dn, ch bo v to cho ti nhng iu kin tt nht t khi bt u cho ti khi hon thnh cng vic ca mnh. ng thi xin cm n tt c nhng ngi thn yu trong gia nh ti cng ton th bn b, nhng ngi lun gip v ng vin ti mi khi vp phi nhng kh khn, b tc. Cui cng, xin chn thnh cm n ng nghip ca ti ti Trung tm CNTT, NHNo&PTNT VN nhng ngi em n cho ti nhng li khuyn v cng b ch gip tho g nhng kh khn, vng mc trong qu trnh lm lun vn.
-1-
LI CAM OAN
Ti xin cam oan kt qu t c trong lun vn l sn phm ca ring c nhn, khng sao chp li ca ngi khc. Trong ton b ni dung ca lun vn, nhng iu c trnh by hoc l ca c nhn hoc l c tng hp t nhiu ngun ti liu. Tt c cc ti liu tham kho u c xut x r rng v c trch dn hp php. Ti xin hon ton chu trch nhim v chu mi hnh thc k lut theo quy nh cho li cam oan ca mnh. H Ni, ngy 01 thng 11 nm 2007
-2-
MC LC
DANH MC CH VIT TT ........................................................................................ 5 DANH MC HNH V, BNG BIU ............................................................................ 6 M U .......................................................................................................................... 7 CHNG 1 - KHI QUT V KHAI PH D LIU WEB ................................... 9 1.1. Khai ph d liu Web ....................................................................................... 9 1.1.1. Gii thiu v Khai ph d liu .................................................................. 9 1.1.2. D liu Web v nhu cu khai thc thng tin ........................................... 11 1.1.3. c im ca d liu Web ...................................................................... 12 1.1.4. Cc hng tip cn khai ph d liu Web .............................................. 13 1.1.5. Nhu cu Phn cm ti liu Web.............................................................. 14 1.2. M hnh tm kim thng tin ............................................................................ 15 1.2.1. Gii thiu ................................................................................................ 15 1.2.2. Quy trnh tm kim thng tin trong h thng .......................................... 15 1.2.3. ng dng phn cm vo h thng tm kim ........................................... 18 1.3. Kt lun chng 1 ........................................................................................... 19 CHNG 2 - THUT TON PHN CM WEB ................................................... 20 2.1. Mt s ni dung c bn v thut ton phn cm ti liu ................................ 20 2.2. Tiu chun nh gi thut ton phn cm ...................................................... 22 2.3. Cc c tnh ca cc thut ton phn cm web .............................................. 24 2.3.1. M hnh d liu....................................................................................... 24 2.3.2. o v s tng t .............................................................................. 27 2.3.3. M hnh phn cm .................................................................................. 29 2.4. Mt s k thut Phn cm Web in hnh ...................................................... 30 2.4.1. Phn cm theo th bc ............................................................................ 30 2.4.2. Phn cm bng cch phn mnh ............................................................. 33 2.5. Cc yu cu i vi cc thut ton phn cm Web ........................................ 35 2.5.1. Tch cc thng tin c trng................................................................... 35 2.5.2. Phn cm chng lp ................................................................................ 36 2.5.3. Hiu sut ................................................................................................. 36 2.5.4. Kh nng kh nhiu ................................................................................ 36 2.5.5. Tnh tng ................................................................................................. 37 2.5.6. Vic biu din kt qu ............................................................................ 37 2.6. Bi ton tch t t ng ting Vit ................................................................. 37 2.6.1. Mt s kh khn trong phn cm trang Web ting Vit ......................... 37 2.6.2. Ting v T trong ting Vit .................................................................. 39 2.6.3. Phng php tch t t ng ting Vit fnTBL ..................................... 39 2.6.4. Phng php Longest Matching ............................................................. 43 2.6.5. Kt hp gia fnTBL v Longest Matching ............................................. 44 2.7. Kt lun chng 2 ........................................................................................... 44 CHNG 3 - THUT TON PHN CM CY HU T V THUT TON CY PHN CM TI LIU ........................................................................................ 45 3.1. Gii thiu v thut ton phn cm trang Web c tnh tng ............................ 45 3.2. Thut ton phn cm cy hu t ..................................................................... 46 3.2.1. M t ....................................................................................................... 46 3.2.2. Thut ton STC ....................................................................................... 47
-3-
3.3. Thut ton phn cm s dng cy phn cm ti liu ...................................... 51 3.3.1. Gii thiu ................................................................................................ 51 3.3.2. Trch chn c trng v phn cm ti liu ............................................. 51 3.3.3. Cy phn cm ti liu DC Tree ............................................................ 55 3.4. Kt lun chng 3 ........................................................................................... 60 CHNG 4 - PHN MM TH NGHIM V KT QU THC NGHIM ...... 61 4.1. Gii thiu ........................................................................................................ 61 4.2. Thit k c s d liu ..................................................................................... 62 4.3. Chng trnh th nghim ................................................................................ 65 4.4. Kt qu thc nghim ....................................................................................... 66 4.5. Kt lun chng 4 ........................................................................................... 69
-4-
DANH MC CH VIT TT
AHC: Phn cm tch t theo th bc (Agglomerative Hierarchical Clustering) CSDL: C s d liu DF: tn sut xut hin ti liu (Document Frequency) DC-tree: Cy phn cm ti liu (Document Clustering Tree) fnTBL: Hc da trn s bin i (Fast Transformation-based learning) FCM: Fuzzy C-means FCMdd: Fuzzy C-Medoids IR: M hnh tm kim thng tin (Information Retrieval) IDF: tn sut nghch o ti liu (inverse document frequency) KDD: Khai ph tri thc (Knowledge Discovery in Databases) STC: Phn cm cy hu t (Suffix tree clustering) TF: tn sut xut hin (term frequency) UPGMA: (Unweighter Pair-Group Method using Arithmetic averages)
-5-
-6-
M U
World Wide Web l mt kho thng tin khng l vi tim nng c coi l khng c gii hn. Khai ph Web l vn nghin cu thi s trong thi gian gn y, thu ht nhiu nhm nh khoa hc trn th gii tin hnh nghin cu, xut cc m hnh, phng php mi nhm to ra cc cng c hiu qu h tr ngi dng trong vic tng hp thng tin v tm kim tri thc t tp hp cc trang Web khng l trn Internet. Phn cm ti liu Web l mt bi ton in hnh trong khai ph Web, nhm phn hoch tp vn bn thnh cc tp con c tnh cht chung, trong bi ton phn cm cc trang Web l kt qu tr v t my tm kim l rt hu dng [4-6, 8-15, 18, 19, 22, 24]. Nh bit, tp hp cc trang Web p ng mt cu hi tr v t my tm kim ni chung l rt ln, v vy, thut ton phn cm vn bn y cn c c mt tnh cht rt quan trng l tnh "tng" theo ngha thut ton phn cm khng phi thc hin ch trn ton b tp d liu m c th c thc hin theo cch t b phn d liu ti ton b d liu [4, 6, 11, 14, 15, 24]. iu cho php thut ton tin hnh ngay trong giai on my tm kim a cc trang web kt qu v. Lun vn tp trung kho st cc phng php phn cm trong Web c tnh cht tng v thc hin mt s th nghim tch hp cc kt qu nghin cu ni trn vo mt phn mm ti trang Web theo dng my tm kim. ng thi, lun vn trin khai mt s bc u tin trong vic p dng phn cm cho cc trang Web ting Vit. Lun vn xy dng mt phn mm th nghim v tin hnh cc th nghim phn cm Web ting Vit. Ngoi Phn M u, Phn Kt lun v cc Ph lc, ni dung lun vn c chia thnh 4 chng chnh: Chng 1 Khi qut v khai ph d liu Web. Chng ny gii thiu nhng ni dung c bn nht, cung cp mt ci nhn khi qut v Khai ph d liu Web. ng thi, lun vn cng m t s b mt h thng thng tin tm kim v nhu cu phn cm p dng cho h thng ny.
-7-
Chng 2 Thut ton phn cm Web. Chng ny trnh by mt cch khi qut v cc thut ton phn cm Web, nhng c trng v yu cu i vi cc thut ton phn cm Web. Nhng yu cu v o p dng cho cc thut ton phn cm Web cng c trnh by trong chng ny. Mt s kin thc c bn v ting Vit cng c gii thiu y. Chng 3 Thut ton phn cm cy hu t v thut ton cy phn cm ti liu. Chng ny i su vo phn tch cc thut ton phn cm Web c tnh cht tng. Lun vn tp trung vo hai thut ton phn cm Web c tnh tng l thut ton STC v thut ton phn cm c s dng cu trc cy DC (DC-tree). Chng 4 Phn mm th nghim v kt qu thc nghim. Chng ny trnh by kt qu thc nghim phn cm Web theo phn mm th nghim trn c s thut ton phn cm DC-tree. Chng trnh ci t th nghim c vit trn ngn ng lp trnh C# trn nn tng .Net Framework ca Microsoft s dng SQL Server 2000 lu tr c s d liu. Phn mm hot ng, cho kt qu phn cm, tuy nhin, do thi gian hn ch nn lun vn cha tin hnh nh gi kt qu phn cm mt cch chnh thng. Phn Kt lun trnh by tng hp cc kt qu thc hin lun vn v phng hng nghin cu tip theo v cc ni dung ca lun vn. Lun vn t mt s kt qu kh quan bc u trong vic nghin cu v trin khai cc thut ton phn cm Web c tnh cht tng, tuy nhin, lun vn khng trnh khi nhng sai st. Rt mong c s ng gp kin, nhn xt tc gi c th hon thin c kt qu nghin cu.
-8-
-9-
- 10 -
ng dng ca khai ph d liu Khai ph d liu tuy l mt hng tip cn mi nhng thu ht c s quan tm ca rt nhiu nh nghin cu v pht trin nh vo nhng ng dng thc tin ca n. Chng ta c th lit k ra y mt s ng dng in hnh [7,16]: Phn tch d liu v h tr ra quyt nh (data analysis & decision support) iu tr y hc (medical treatment) Text mining & Web mining Tin-sinh (bio-informatics) Ti chnh v th trng chng khon (finance & stock market) Bo him (insurance) Nhn dng (pattern recognition) .v.v. 1.1.2. D liu Web v nhu cu khai thc thng tin S pht trin nhanh chng ca mng Internet v Intranet sinh ra mt khi lng khng l cc d liu dng siu vn bn (d liu Web). Cng vi s thay i v pht trin hng ngy hng gi v ni dung cng nh s lng ca cc trang Web trn Internet th vn tm kim thng tin i vi ngi s dng li ngy cng kh khn. C th ni nhu cu tm kim thng tin trn mt c s d liu phi cu trc (bao gm d liu vn bn) c pht trin ch yu cng vi s pht trin ca Internet. Thc vy vi Internet, con ngi lm quen vi cc trang Web cng vi v vn cc thng tin. Trong nhng nm gn y, Intrnet tr thnh mt trong nhng knh v khoa hc, thng tin kinh t, thng mi v qung co. Mt trong nhng l do cho s pht trin ny l gi c thp cn tiu tn khi cng khai mt trang Web trn Internet. So snh vi nhng dch v khc nh mua bn hay qung co trn mt t bo hay tp ch, th mt trang Web "i" chi ph r hn rt nhiu m li c cp nht nhanh chng hn ti hng triu ngi dng khp mi ni trn th gii. C th ni khng gian Web nh l cun t in Bch khoa ton th. Thng tin trn cc trang Web a dng v mt ni dung cng nh hnh thc. C th ni Internet nh mt x hi o, n bao gm cc
- 11 -
thng tin v mi mt ca i sng kinh t, x hi c trnh by di dng vn bn, hnh nh, m thanh,... Tuy nhin cng vi s a dng v s lng ln thng tin nh vy ny sinh vn qu ti thng tin. Ngi ta khng th tm t kim a ch trang Web cha thng tin m mnh cn, do vy i hi cn phi c mt trnh tin ch qun l ni dung ca cc trang Web v cho php tm thy cc a ch trang Web c ni dung ging vi yu cu ca ngi tm kim. Cc tin ch ny qun l d liu trang Web nh cc i tng phi cu trc. Hin nay chng ta lm quen vi mt s cc tin ch nh vy, l Yahoo, Google, Alvista, ... Mt khc, gi s chng ta c cc trang Web v cc vn Tin hc, Th thao, Kinh t-X hi v Xy dng...Cn c vo ni dung ca cc ti liu m khch hng xem hoc download v, sau khi phn lp cc yu cu nh th ca khch hng, chng ta s bit c khch hng hay tp trung vo ni dung g trn trang Web ca chng ta, m t chng ta s b sung thm nhiu cc ti liu v cc ni dung m khch hng quan tm. Ngc lai, v pha khch hng, sau khi c phc v ph hp yu cu, khch hng s hng s quan tm ti h thng ca chng ta hn. T nhng nhu cu thc t trn, phn lp v tm kim trang Web vn l bi ton thi s v cn c pht trin nghin cu. Nh vy, chng ta c th hiu rng khai ph Web nh l vic trch chn ra cc thnh phn c quan tm hay c nh gi l c ch cng cc thng tin tim nng t cc ti nguyn hoc cc hot ng lin quan ti World-Wide Web [25, 26]. Mt cch trc quan c th quan nim khai ph Web l s kt hp gia Khai ph d liu, X l ngn ng t nhin v Cng ngh Web: Khai ph web = Khai ph d liu + X l ngn ng t nhin + World Wide Web. 1.1.3. c im ca d liu Web * Web dng nh qu ln t chc thnh mt kho d liu phc v Khai ph d liu. * phc tp ca trang Web ln hn rt nhiu so vi nhng ti liu vn bn truyn thng khc.
- 12 -
* Web l mt ngun ti nguyn thng tin c thay i cao * Web phc v mt cng ng ngi dng rng ln v a dng * Ch mt phn rt nh ca thng tin trn Web l thc s hu ch 1.1.4. Cc hng tip cn khai ph d liu Web Nh phn tch v c im v ni dung cc siu vn bn trn, t khai ph d liu Web cng s tp trung vo cc thnh phn c trong trang Web. chnh l: 1. Khai ph ni dung trang Web (Web Content mining) Khai ph ni dung trang Web gm hai phn: a. Web Page Content Ngha l s s dng ch cc t trong vn bn m khng tnh n cc lin kt gia cc vn bn. y chnh l khai ph d liu Text (Textmining) b. Search Result Tm kim theo kt qu. Trong cc my tm kim, sau khi tm ra nhng trang Web tho mn yu cu ngi dng, cn mt cng vic khng km phn quan trng, l phi sp xp kt qu theo th t d gn nhau vi ni dung cn tm kim. y cng chnh l khai ph ni dung trang Web. 2. Web Structure Mining Khai ph da trn cc siu lin kt gia cc vn bn c lin quan. 3. Web Usage Mining a. General Access Partern Tracking: Phn tch cc Web log khm ph ra cc mu truy cp ca ngi dng trong trang Web. b. Customize Usage Tracking: Phn tch cc mu truy cp ca ngi dng ti mi thi im bit xu hng truy cp trang Web ca tng i tng ngi dng ti mi thi im khc nhau
- 13 -
Lun vn ny tp trung ch yu vo ni dung khai ph ph ni dung trang Web v nh hng vo phn cm tp trang web l kt qu tm kim ca cc my tm kim. 1.1.5. Nhu cu phn cm ti liu Web Mt trong nhng bi ton quan trng trong lnh vc khai ph Web l bi ton phn cm Web. Phn cm Web - ni mt cch khi qut - l vic t ng sinh ra cc "cm" (lp) ti liu da vo s tng t ca cc ti liu. Cc lp ti liu y l cha bit trc, ngi dng c th ch yu cu s lng cc lp cn phn loi, h thng s a ra cc ti liu theo tng tp hp, tng cm, mi tp hp cha cc ti liu tng t nhau. Phn cm Web hiu mt cch n gin - l phn cm trn tp cc ti liu c ly t Web. C hai tnh hung phn cm ti liu. Tnh hung th nht l vic phn cm trn ton b mt CSDL c sn gm rt nhiu ti liu Web. Thut ton phn cm cn tin hnh vic phn cm ton b tp d liu thuc CSDL . Tnh hung ny thng c gi l phn cm khng trc tuyn (offline). Tnh hung th hai thng c p dng trn mt tp ti liu nh l tp hp cc ti liu do my tm kim tr v theo mt truy vn ca ngi dng. Trong trng hp ny, gii php phn cm c tin hnh kiu phn cm trc tuyn (on-line) theo ngha vic phn cm tin hnh theo tng b phn cc ti liu nhn c. Khi , thut ton phi c tnh cht gia tng tin hnh phn cm ngay khi cha c ti liu v phn cm tip theo khng cn phi tin hnh vi d liu c phn cm trc . Do tp ti liu trn Web l v cng ln cho nn cch phn cm trc tuyn l thch hp hn v phi i hi tnh "gia tng" ca thut ton phn cm. Qu trnh x l truy vn v kt qu phn hng c phn hi t cc my tm kim ph thuc vo vic tnh ton tng t gia truy vn v cc ti liu. Mc d cc truy vn lin quan phn no n cc ti liu cn tm, nhng n thng qu ngn v d xy ra s nhp nhng. Nh bit, trung bnh cc truy vn trn Web ch gm hai n ba t do gy nn nhp nhng. Chng hn, truy vn star dn n s nhp nhng rt cao, cc ti liu ly c lin quan n astronomy, plants, animals, popular media and sports figures tng t gia cc ti liu ca mt truy t n nh vy l c s khc nhau rt ln. V l ,
- 14 -
nu my tm kim phn cm cc kt qu theo tng ch th ngi dng c th nhanh chng hiu kt qu truy vn hoc tm vo mt ch xc nh.
- 15 -
chnh xc cho thng tin cn thit l cu truy vn (query), v cc thng tin c chn l ti liu (documents). Mi cch tip cn trong IR bao gm hai thnh phn chnh (1) cc k thut biu din thng tin (cu truy vn, ti liu), v (2) phng php so snh cc cch biu din ny. Mc ch l t ng quy trnh kim tra cc ti liu bng cch tnh ton tng quan gia cc cu truy vn v ti liu. Quy trnh ny c nh gi l thnh cng khi n tr v cc kt qu ging vi cc kt qu c con ngi to ra khi so snh cu truy vn vi cc ti liu. C mt vn thng xy ra i vi h thng tm kim l nhng t m ngi dng a ra trong cu truy vn thng khc xa nhng t trong tp ti liu cha thng tin m h tm kim. Trng hp nh th gi l paraphrase problem (vn v din gii). gii quyt vn ny, h thng to ra cc hm biu din x l cc cu truy vn v cc ti liu mt cch khc nhau t ti mt tng thch no .
Hnh 2. M hnh h thng tm kim thng tin Gi min xc nh ca hm biu din cu truy vn q l Q, tp hp cc cu truy vn c th c; v min gi tr ca n l R, khng gian thng nht biu din
- 16 -
thng tin. Gi min xc nh ca hm biu din ti liu d l D, tp hp cc ti liu; v min gi tr ca n l R2. Min xc nh ca hm so snh c l R R v min gi tr ca n l [0,1] l tp cc s thc t 0 n 1. Trong mt h thng tm kim l tng: c(q(query),d(doc)) = j(query,doc), query Q, doc D, khi j: Q D [0,1] biu din vic x l ca ngi dng gia cc mi quan h ca 2 thng tin, c tnh da trn mt tiu chun no (v d: s ging nhau v ni dung hay s ging nhau v kiu,...). Hnh 2 minh ho mi quan h ny. C hai kiu h thng tm kim: tm kim da trn so khp chnh xc v da trn sp xp. M hnh trn y c th m t c hai cch tip cn nh th. Trong h thng tm kim da trn so khp chnh xc, min gi tr ca c c gii hn hai la chn l 0 v 1, v n c chuyn sang nh phn quyt nh liu 1 ti liu c tho biu thc bool c xc nh bi cu truy vn hay khng? Cc h IR da trn s so khp chnh xc thng cung cp cc ti liu khng sp xp tho mn cu truy vn ca ngi s dng, hu ht cc h thng tm kim hin nay u dng cch ny. Cch hot ng chi tit ca h thng s c m t phn sau. i vi h thng IR da trn sp xp, th cc ti liu s c sp xp theo th t gim dn v mc lin quan. C 3 loi h thng tm kim da trn sp xp: ranked Boolean, probabilistic v similarity base. Trong 3 cch ny th min gi tr ca c l [0, 1], tuy nhin chng khc nhau cch tnh gi tr trng thi tm kim (retrieval status value): Trong h thng da trn ranked Boolean gi tr ny l mc m thng tin tho mn biu thc Bool c ch ra bi cc thng tin cn li. Trong h thng da trn probabilistic, khi nim ny hi khc mt cht, gi tr ny l xc sut m thng tin c lin quan n mt cu truy vn. Rt nhiu h thng tm kim da trn xc sut c thit k chp nhn cu truy vn c din t bng ngn ng t nhin hn l mt biu thc bool. Trong h thng tm kim da trn s ging nhau, gi tr trng thi tm kim c tnh bng cch tnh mc ging nhau ca ni dung thng tin. Trong cc h thng tm kim da trn s so khp chnh xc, vic nh gi h thng ch yu da trn vic nh gi mc lin quan. Gi s j l gi tr nh
- 17 -
phn v c cho trc. Ni cch khc, ta gi s rng cc ti liu hoc c hoc khng c lin quan n cu truy vn, v lin quan gia ti liu v cu truy vn do con ngi xc nh l chnh xc. Theo gi nh ny, tnh hiu qu ca cc h thng tm kim da trn so khp chnh xc c nh gi da trn hai i lng thng k l chnh xc ( precision) v hi tng (recall). chnh xc l t l cc ti liu c chn, cc ti liu thc s c lin quan n cc thng tin m ngi dng cn, hi tng l t l ti liu c lin quan c sp xp chnh xc theo lin quan bi h thng tm kim. Ni cch khc, chnh xc bng 1 tr i t l cnh bo sai, trong khi hi tng o mc hon chnh ca vic tm kim. V hai o nh gi ny cng s c cp chi tit trong phn tiu chun nh gi phn cm cho thut ton phn cm pha sau. Vic nh gi tnh hiu qu ca h thng tm kim da trn sp xp l phc tp hn. Mt cch tnh hiu qu ph bin cho cc h thng ny chnh xc trung bnh. N c tnh bng cch chn 1 tp ln hn cc ti liu u danh sch c gi tr hi tng gia 0 v 1. Phng php thng c s dng l phng php tnh da trn 5,7,11 im theo hi tng. chnh xc sau s tnh cho tng tp mt. Quy trnh s c lp li cho tng cu truy vn, v tng ng vi mi chnh xc trung bnh s cho mt hi tng. Mi gi tr trung bnh ca nhng s ny sau s c tnh ton v ghi nhn nh mt c trng ca h thng. chnh xc trung bnh cng ln th cng tt, v vic so snh ch thc s c ngha khi chng ta s dng cng mt tp ti liu v cu truy vn. Tuy nhin chnh xc trung bnh cng lm gim i mc thay i ca cc cu truy vn c cc c tnh khc nhau (v d nh s lng ti liu c lin quan khc nhau). Hn th na, cc ti liu c lin quan thng tp trung u danh sch sp xp nn thng thng chnh xc s gim mi khi tp ti liu c m rng tng hi tng. 1.2.3. ng dng phn cm vo h thng tm kim Nh vy, vi vic phn tch nhu cu phn cm i vi cc ti liu Web, khi ta xy dng mt h thng tm kim th ng thi ta cng s tin hnh tch hp module phn cm vo h thng ny. Vic phn cm vn bn nh mt phng thc t chc cc d liu tr li khc gip ngi s dng thay v phi xem xt chn lc danh sch di cc vn bn theo th t tm kim cc vn bn lin
- 18 -
quan th ch cn xem xt trong cc lnh vc m ngi s dng quan tm m thi. Nh vy h thng tm kim s tr nn hu dng hn cho ngi s dng.
- 19 -
- 20 -
lm trn xp x trn n v o c c s dng. iu ny c ngha l khng c l do tin rng nhng ti liu nn c phn loi vo cc phn cm xp x. Phng thc Single Pass cng gp phi vn ny cng nh gp phi s ph thuc th t v c xu hng a ra cc phn cm ln. Theo [4,11], y l mt thut ton phn cm tng ni ting nht. Buckshot v Fractionation l 2 thut ton phn cm nhanh, thi tuyn tnh do Cutting pht trin nm 1992 [4]. Factionation l mt s xp x vi AHC vi vic tm kim cho hai phn cm gn nhau nht khng c thc hin mt cch tng th thay vo l thc hin mt cch cc b hoc trong cc vng gii hn. Thut ton ny hin nhin s vp phi cng nhc im vi AHC cc iu kin dng c on v hiu nng thp khi c nhiu phn khng lin quan. Buckshot l mt gii thun K-Means vi vic cc phn cm trung tm c to ra bi vic p dng phn cm AHC vi mt tp mu cc ti liu. Vic s dng tp mu l c ri ro khi c th c ngi c hng th vi cc phn cm nh m c th khng c trong cc mu. Tuy nhin, tuy l cc thut ton nhanh song chng khng phi l thut ton phn cm tng. Tt c cc thut ton c ni trn coi mt ti liu l mt tp cc t v khng phi mt tp cc t c th t, do c mt i cc thng tin quan trng. Cc cm t c s dng t lu cung cp cc ch mc t trong cc h thng IR. Vic s dng cc phn t t vng v cc cm t c c php c a ra lm tng kh nng d on m khng cn n vic phn tch li ti liu. Cc cm t c sinh ra bi cc phng thc thng k n gin v ang c s dng mt cch thnh cng. Nhng nhng phng php trn cha c p dng rng ri trong vic phn cm ti liu. Ngoi ra, thut ton s dng DC-tree [24] (Document Clustering Tree: cy phn cm ti liu) c th phn cm cc ti liu m khng cn tp hun luyn. Vi DC-tree, mt i tng d liu a vo khng bt buc phi chn vo mc(v tr) thp khi khng tn to mt nt con tng t cho i tng d liu. iu ny ngn cn mt vi d liu khng tng t t vic t cng nhau. Kt qu l thut ton phn cm da trn cu trc DC-tree l n nh vi yu cu a thm ti liu v d chp nhn cc ti liu nhiu.
- 21 -
Trn Web, c mt vi n lc kim sot s lng ln ti liu c tr li bi cc my tm kim. Nhiu my tm kim cung cp cc tnh nng tm kim chn lc. V d, AltaVista gi cc t nn c thm hoc loi b khi cu truy vn. Nhng t ny c t chc theo nhm, nhng cc nhm ny khng i din cho cc phn cm ca ti liu. My tm kim Northern Light (www.nlsearch.com) cung cp Custom Search Folders (Cc th mc tm kim quen thuc), cc th mc ny c t tn bng mt t hoc mt t kp v bao gm tt c cc ti liu c cha ci tn . Northern Light khng tit l cch thc s dng to ra cc th mc cng nh chi ph ca n. Trong chng 3, lun vn i su nghin cu hai thut ton phn cm c tnh tng thch hp cho vic phn cm trang Web v hn na l d dng p dng cho phn cm Ting Vitthut ton phn cm cu hu t (STC) v thut ton phn cm s dng DC-Tree.
- 22 -
Coi P l mt kt qu phn chia ca mt thut ton phn cm bao gm m phn cm. Vi tt c phn cm j trong P, chng ta cn tnh ton pij , vi pij l kh nng mt thnh vin ca phn cm j thuc vo lp i. Entropy ca mi phn cm j c tnh ton s dng cng thc chun: E j = pij log( pij ) , trong vic tnh
i
tng c thc hin vi tt c cc lp. Tng entropy ca mt tp cc phn cm c tnh ton nh l tng cng entropy ca mi phn cm c tnh ton da theo kch c ca mi phn cm: E P =
Nj E j , trong Nj l kch c ca j =1 N
m
phn cm j v N l tng s lng i tng d liu. Nh ni trn, chng ta cn phi to ra cc phn cm vi cc entropy cng nh cng tt v entropy l mt thc o v ng nht (tng t) ca cc i tng d liu trong phn cm. F-measure o cht lng ngoi th hai l o F (F-measure), mt o gp tng v s chnh xc v kh nng nh li t thng tin thu v. S chnh xc v kh nng nh li ca mt phn cm j i vi lp i c nh ngha l:
P = precision (i, j ) = N ij Ni
R = recall (i, j ) =
N ij Nj
trong Nij l s lng thnh vin ca lp I trong phn cm j, Nj l s lng thnh vin ca phn cm j v Ni l s lng thnh vin ca lp i. o F ca mt lp i c nh ngha l:
F (i ) = 2 PR P+R
Trong cc mi lin h vi lp i, chng ta tm ra gi tr o F ln nht trong cc phn cm j i vi n v gi tr ny l im ca lp i. Gi tr o F ca kt qu phn cm P l trung bnh trng s ca cc o F vi mi lp i.
FP =
(i F (i )) i
i i
- 23 -
Trong |i| l s lng i tng trng lp i. Gi tr o F cng cao th vic phn cm cng tt v chnh xc cng ln ca vic gn kt cc lp gc. Overall Similarity Mt o cht lng trong rt hay c s dng l o tng t ton din (Overall Similarity) v c s dng khi khng c bt c thng tin no t bn ngoi nh cc lp gn nhn. o ny phn cm o s kt ni ca cc phn phn cm bng vic s dng trng s tng t ca phn cm trong
1
xS yS
t gia 2 i tng x v y.
- 24 -
Hu ht cc phng thc phn cm ti liu s dng m hnh khng gian vc t (Vector Space) biu din cc i tng ti liu. Mi ti liu c biu din bng mt vc t d, trong khng gian vc t, d = {tf1, tf2, , tfn} trong tfi (i=1,,n) l tn sut xut hin (term frequency TF) ca t ti trong ti liu. biu din tt c cc ti liu vi cng 1 tp t, chng ta cn tch tt c cc t tm c trn tng cc ti liu v s dng chng nh vc t c trng ca chng ta. Thnh thong, mt vi phng php c s dng gp tn sut xut hin t v tn sut nghch o ti liu (inverse document frequency TF-IDF). Tn sut xut hin ti liu dfi l s lng ti liu trong tp N ti liu m t ti xut hin. Mt thnh phn tn sut nghch o ti liu (idf) c nh ngha l log(N/dfi). Trng s ca t ti trong ti liu c nh ngh l wi= tfi log(N/dfi) [24]. c ca vc t c trng l chp nhn c, ch n t c trng s ln nht trong tt c cc ti liu c s dng nh l n c trng. Wong v Fu [24] ch ra rng h c th lm gim s lng t i din bng vic ch chn nhng t m mc hi tng (coverage) trong tp d liu. Mt vi thut ton [9,24] lp li vic s dng cc tn sut xut hin t (hoc trng s t) bng vic s dng vc t c trng nh phn, trong mi trng s t l 1 hoc 0, ph thuc vo t c trong ti liu hay khng. Wong v Fu [24] phn i rng tn sut xut hin t trung bnh trong ti liu web l nh hn 2 (da theo cc th nghim, thng k), v n khng ch ra quan trng thc s ca t, do mt s phi vi trng s nh phn s l thch hp hn vi vng vn ny. Trc khi ni v tch c trng, tp ti liu s c lm sch bng cch loi b cc t dng (stop-word: cc t c tn sut xut hin nhiu nhng khng c ngha nh: v, vi, ) v p dng mt thut ton lm y chuyn i cc mu t khc nhau thnh mt mu chun tng ng.
- 25 -
Mt v d v cc stop-word Mt m hnh khc v vn biu din ti liu c gi l N-gram. M hnh N-gram gi s rng ti liu l mt chui cc k t, v s dng mt ca s trt vi kch c n k t qut v tch tt c cc chui n k t lin tip trong ti liu. N-gram l c th chp nhn c vi cc li pht m nh bi v s rm r trong cc kt qu tr v ca n. M hnh ny cng x l c cc vn nh v ph thuc ngn ng khi c s dng vi thut ton lm y. Vn tng t trong phng php tip cn ny c da trn s lng n-gram gia hai ti liu. Cui cng, mt m hnh mi c gii thiu bi Zamir v Etzioni [5] l mt phng php tip cn v cm t. M hnh ny tm kim cc cm hu t gia cc ti liu v xy dng mt cy hu t trong mi nt biu din mt phn ca cm t (mt nt hu t) v gn vi n l cc ti liu cha cm t hu t ny. Phng php tip cn ny r rng l nm c cc thng tin quan h gia cc t, rt c gi tr trong vic tm kim tng t gia cc ti liu. b, M hnh d liu s Mt m hnh trong sng hn v d liu l m hnh s. Da trn ng cnh vn l c nhiu c trng c tch, trong mi c trng c biu din nh l mt khong cc gia cc s. Vc t c trng lun lun trong mt
- 26 -
c chp nhn c, v n ph thuc vo vn ang c phn tch. Cc khong cch c trng thng c bnh thng ha v th mi c trng c tc dng nh nhau khi tnh ton o khong cch. tng t trong trng hp ny l minh bch v vic tnh ton khong cch gia 2 vc t l rt n gin [17]. c, M hnh phn loi d liu M hnh ny thng c tm thy trong cc vn v phn cm c s d liu. Thng th cc thuc tnh ca bng c s d liu l c phn loi v c mt vi thuc tnh l kiu s. Cc phng php tip cn v phn cm da trn thng k c dng lm vic vi kiu d liu ny. Thut ton ITERATE c th coi l mt v d v vic lm vic vi d liu phn loi trn cc d liu thng k [18]. Thut ton K-modes cng c th coi l mt v d tt [19]. d, M hnh d liu kt hp Da vo cc vng vn , thnh thong cc i tng biu din d liu c trng khng c cng kiu. Mt s kt hp gia cc kiu d liu s, phn loi, khng gian hoc text c th c s dng. Trong trng hp ny, vn quan trng l ngh ra mt phng php c th nm gi tt c cc thng tin mt cch hiu qu. Mt quy trnh chuyn i nn c p dng chuyn i t mt kiu d liu ny thnh mt kiu d liu khc. Thnh thong mt kiu d liu khng th p dng vo c, lc thut ton phi c chnh sa lm vic vi cc kiu d liu khc [18]. 2.3.2. o v s tng t Nhn t chnh trong thnh cng ca bt k mt thut ton phn cm no chnh l o v s tng t ca n. c th nhm cc i tng d liu, mt ma trn xp x c s dng tm kim nhng i tng (hoc phn cm) tng t nhau. C mt s lng ln cc ma trn tng tng c cp n trong cc ti liu, y, chng ta ch xem qua mt s ma trn thng thng nht. Vic tnh ton (khng) tng t gia 2 i tng c thc hin thng qua cc hm tnh khong cch (distance), thnh thong cng c th s
- 27 -
dng cc hm tnh v khng tng t (dissimilarity). Vi 2 vc t c trng x v y, cn phi tm ra tng t (hoc khng tng t) gia chng. Mt lp rt hay c s dng ca cc hm khong cch l gia nh cc khong cch Minkowski [7], c m t nh pha di:
x y =
p
x
i =1
yi
Trong x,y Rn. Hm khong cch ny thc ra l m t mt h v s cc khong cch c a ra bi p. Thng s ny gi thit l cc gi tr ln hn hoc bng 1. Mt vi gi tr chung ca p v cc hm khong cch l: p = 1: Khong cch Hamming x y = xi yi
i =1 n
x
i =1
yi
p = : Khong cch Tschebyshev x y = maxi=1,2,...,n xi yi Mt o tng t hay c dng, c bit l trong phn cm ti liu l o lin quan cosine (cosine correlation) (c s dng trong [4], [15], v [13]), c nh ngha l:
cos( x, y ) = x. y x y
trong . biu th vic nhn vector v ||.|| biu th cho di ca vector. Mt o hay c dng khc l o Jaccard (c s dng trong [8], [9]), c nh ngha l:
d ( x, y ) =
thnh:
d ( x, y ) =
i =1 n
min( xi , yi ) max( xi , yi )
i =1
- 28 -
Cn phi ch rng t khong cch khng c g nhp nhng vi tng t. Nhng t ny l tri ngha vi nhau, cho chng ta bit tng t gia 2 i tng. tng t gim khi khong cch tng. Thm mt im cn ch khc l nhiu thut ton s dng hm khong cch (hoc tng t) tnh ton s tng t gia 2 phn cm, mt phn cm v mt i tng, hai i tng. Vic tnh ton khong cch gia 2 phn cm (hoc cc phn cm v cc i tng) yu cu mt vector c trng i din cho phn cm. Thng th cc thut ton phn cm thng s dng mt ma trn tng t (similarity matrix). Mt ma trn tng t c N N ghi nhn cc khong cch (hoc tng t) gia tng cp i tng. Hin nhin ma trn tng t l mt ma trn i xng do chng ta ch cn lu phn trn bn phi hoc phn di bn tri ca n. 2.3.3. M hnh phn cm Bt c thut ton phn cm no cng tha nhn mt cu trc phn cm no . i khi cu trc phn cm khng thc s r rng ty theo nhu cu ca bn thn thut ton phn cm. V d, thut ton k-means s dng cc phn cm hnh cu (hoc cc phn cm li). l v theo cch k-means tm kim phn cm trung tm v cp nht cc i tng thnh vin. Nu nh khng cn thn, chng ta c th kt thc vic phn cm vi cc phn cm ko di (elongated cluster), trong kt qu l c t phn cm ln v c nhiu phn cm rt nh. Wong v Fu [16] a ra mt gii php gi kch c phn cm trong mt khong no , nhng vic gi kch c phn cm trong mt khong no khng phi bao gi cng ng thc hin. Mt m hnh ng tm kim cc phn cm khng thch hp vi cu trc ca chng l CHAMELEON, c a ra bi Karypis [13]. Ty theo vn , chng ta c th c cc phn cm tch ri (disjoint) hoc cc phn cm chng cho (overlapping). Trong ng cnh phn cm ti liu thng mong mun c cc phn cm chng cho bi v ti liu c xu hng c nhiu hn mt ch (v d mt ti liu c th cha thng tin v ua t v cc cng ty t). Mt v d khc v vic to ra cc phn cm chng cho l h thng cy hu t (STC) c a ra bi Zamir v Etzionin [5]. Mt cch khc to ra cc phn cm chng cho l phn cm m trong cc i tng c th
- 29 -
thuc vo cc phn cm khc nhau da vo cc cp khc nhau ca t cch thnh vin [8].
Ty thuc vo nh hng ca vic xy dng th t, chng ta c th ch ra cc phng thc ca phn cm theo th bc: tch t (Agglomerative) hay chia x (Divisive). Phng thc tch t c s dng trong hu ht cc phn cm theo th bc. a, Phn cm tch t theo th bc (AHC) Phng thc ny bt u vi tp cc i tng l cc phn cm n l, tip , ti mi bc kt ni 2 phn cm ging giau nht vi nhau. Qu trnh ny
- 30 -
c lp li cho n khi s lng phn cm cn li t n mt ngng cho php hoc l nu cn phi hon thnh ton b th bc th qu trnh ny s tip tc cho n khi ch cn 1 phn cm. Phn cm tch t lm vic theo m hnh tham n (greedy), trong cp nhm ti liu c chn cho vic tch t l cp m c coi l ging nhau nht theo mt s tiu chun no . Phng thc ny tng i n gin nhng cn phi nh ngha r vic tnh khong cch gia 2 phn cm. C 3 phng thc hay c dng nht tnh ton khong cch ny c lit k pha di. Phng thc kt ni n (Single Linkage Method): tng t gia 2 phn cm S v T c tnh ton da trn khong cch ngn nht (minimal) gia cc thnh phn nm trong cc phn cm tng ng. Phng thc ny cn c gi l phng php phn cm lng ging gn nht (nearest neighbour).
T S = min xT x y
yS
Phng thc kt ni ton b (Complete Linkage Method): tng t gia 2 phn cm S v T c tnh ton da trn khong cch ln nht (maximal) gia cc thnh phn thuc vo cc phn cm tng ng. Phng thc ny cn c gi l phng php phn cm lng ging xa nht (furthest neighbour).
T S = max xT x y
yS
Phng thc kt ni trung bnh (Average Linkage Method): tng t gia 2 phn cm S v T c tnh ton da trn khong cch trung bnh (average) gia cc thnh phn ca cc phn cm tng ng. Phng thc ny xt tt c cc cp khong cch cc i tng trong cc 2 phn cm. Phng thc ny cn c gi l UPGMA (Unweighter Pair-Group Method using Arithmetic averages )
T S =
xT yS
x y
S .T
- 31 -
Karypis [13] phn i cc phng thc trn v cho rng chng s dng mt m hnh tnh ca cc lin kt v gn gi ca d liu, v a ra mt m hnh ng trnh c nhng vn trn. H thng c gi l CHAMELEON, ch gp 2 phn cm nu s lin kt v gn gi ca cc phn cm l c quan h mt thit vi s lin kt v gn gi bn trong cc phn cm. Cc k thut cht ng thng s dng thi gian c (n2) v c trng ca n l xem xt tt c cc cp phn cm c th. H thng Phn tn/Tp hp (Scatter/Gather) c gii thiu trong cun Cutting [15], s dng mt nhm tch t trung bnh tm kim cc phn cm ht nhn (seed) s dng cho thut ton chia phn cm. Tuy nhin, trnh thi gian chy bnh phng, h ch s dng n vi mt v d nh ca cc ti liu phn cm. Ngoi ra, phng thc trung bnh nhm c gii thiu trong Steinbach [4] c coi l tt hn hu ht cc phng thc o tng t khc do tnh n nh ca n. b, Phng php phn cm chia x cp bc Nhng phng thc ny lm vic t trn xung di, bt u vi vic coi ton b cc tp d liu l mt phn cm v ti mi bc li phn chia mt phn cm cho n khi ch cn nhng phn cm n ca cc i tng ring l cn li. Chng thng khc nhau bi 2 im: (1) phn cm no c phn chia k tip v (2) lm th no phn chia. Thng th mt tm kim ton din c thc hin tm ra phn cm phn tch da trn mt vi tiu chun khc nhau. Mt cch n gin hn c th c s dng l chn phn cm ln nht chia tch, phn cm c tng t trung bnh t nht hoc s dng mt tiu chun da trn c kch c v tng t trung bnh. Trong Steinbach [4] lm mt th nghim da trn nhng chin thut ny v pht hin ra rng s khc nhau gia chng l rt nh, do h sp xp li bng vic chia nh phn cm ln nht cn li. Chi nh mt phn cm cn a ra quyt nh xem nhng i tng no c a vo phn cm con. Mt phng php c dng tm 2 phn cm con s dng k-means tr li kt qu l mt k thut lai ghp c gi l k thut chia ct k-means (bisecting k-means) [4]. Cng c mt cch khc da trn thng k c s dng bng thut ton ITERATE [18], tuy nhin, khng cn thit phi
- 32 -
chia mt phn cm thnh 2 phn cm con, chng ta c th chia n thnh nhiu phn cm con, ty theo kt cu ca cc i tng. 2.4.2. Phn cm bng cch phn mnh Lp thut ton phn cm ny lm vic bng cch nhn ra cc phn cm tim nng cng mt lc trong khi lp li vic cp nht cc phn cm lm ti u mt vi chc nng. Lp cc thut ton ni ting ca n l thut ton K-means v cc bin th ca n. K-means bt u bng vic chn la ngu nhin k phn cm ht nhn, sau a cc i tng vo phn cm c ngha gn n nht. Thut ton lp li vic tnh ton ngha ca cc phn cm v cp thnh vin ca cc i tng mi. Qu trnh x l tip tc cho n mt s ln lp nht nh hoc khi khng cn s thay i no c pht hin trong ngha ca cc phn cm [17]. Cc thut ton K-means c kch c O(nkT) trong T l s lng vng lp. D sao, mt nhc im chnh ca K-means l n gi nh mt cu trc phn cm cu v khng th c p dng vi cc min d liu m cc cu trc phn cm khng phi l hnh cu. Mt bin th ca K-means cho php s chng lp ca cc phn cm l C-means m (FCM: Fuzzy C-means). Thay v c cc quan h thnh vin kiu nh phn gia cc i tng v cc phn cm tiu biu, FCM cho php cc cp khc nhau ca cp thnh vin [17]. Krishnapuram [8] a ra mt phin bn chnh sa ca FCM c coi l Fuzzy C-Medoids (FCMdd) trong cc ngha c thay bng cc ng cnh. Thut ton ny tng i nhanh v c c l O(n2) v c cng hot ng nhanh hn FCM. Do s la chn ngu nhin ca cc phn cm ht nhn nhng thut ton ny, chng i lp vi phn cm c th bc. Do kt qu ca cc ln chy ca thut ton l khng thc s n nh. Mt vi phng php c ci tin bng cch tm ra cc phn cm ht nhn ban u tt sau mi s dng cc thut ton ny. C mt v d rt hay trong h thng Phn chia/Thu thp [15]. C mt cch tip cn gp c vic phn cm phn mnh v phn cm lai ghp l thut ton chia cch K-means (Bisecting K-means) ni phn trc. Thut ton ny l mt thut ton phn chia trong vic phn chia phn cm s dng K-means tm kim 2 phn cm con. Trong Steinbach ch ra
- 33 -
rng hiu sut ca thut ton Bisecting K-means l tuyt vi so vi K-means bnh thng cng nh UPGMA [4] Cn phi ch rng mt c trng quan trng ca cc thut ton c th bc l hu ht u c cp nht theo tnh tng v cc i tng mi c th c a vo cc phn cm lin quan rt d dng bng vic ln theo mt ng dn no ti v tr thch hp. STC [5] v DC- tree [24] l hai v d v cc thut ton ny. Ni theo cch khc cc thut ton phn chia ng lot thng yu cu vic cp nht ng lot v ngha ca cc phn cm v thm ch l cc i tng thnh vin. Vic cp nht c tnh tng l rt cn thit vi cc ng dng hot ng on-line. Mt phng php nhm thi hnh thut ton phn cm l phn hoch tp ti liu vo k tp con hoc cc cm D1, , Dk lm cc tiu khong cch bn trong cm cm
d1 , d 2 D i
d 1 , d 2 Di
(d1 , d 2 ) .
Nu mt biu din bn trong ca cc ti liu l c gi tr th biu din ny cng c dng xc nh mt biu din ca cc cm lin quan n cng m hnh. Chng hn, nu cc ti liu c biu din s dng m hnh khng gian vector, mt cm ca cc ti liu c th c biu din bi trng tm (trung bnh) ca cc ti liu vector. Khi mt biu din cm l c gi tr, mt mc tiu c th phn hoch D thnh D1, ,Dk cc tiu ha i ha
d Di
( d , D i ) hoc cc
d Di
xem xt ti vic gn ti liu d cho cm i nh vic t mt gi tr Boolean zd,i l 1. iu ny c th pht sinh ra vic phn cm mm ti zd,i l mt s thc t 0 n 1. Trong bi cnh nh vy, ta c th mun tm zd,i cc tiu ha
dDi
(d , Di ) hoc cc i ha
d Di
(d , Di ) .
Vic phn hoch c th thc hin theo hai cch. Bt u vi mi ti liu trong mt nhm ca n v kt hp cc nhm ti liu li vi nhau cho n khi s cc phn hoch l ph hp; cch ny gi l phn cm bottom-up. Cch khc l c
- 34 -
th khai bo s cc phn hoch mong mun v gn cc ti liu vo cc phn hoch; cch ny gi l phn cm top-down. C th xem xt mt k thut phn cm bottom-up da vo qu trnh lp li vic trn cc nhm ca cc ti liu tng t nhau cho n khi t c s cm mong mun, v mt k thut top-down s lm mn dn bng cch gn cc ti liu vo cc cm c thit t trc. K thut bottom-up thng chm hn, nhng c th c s dng trn mt tp nh cc mu khi to cc cm ban u trc khi thut ton top-down tin hnh
- 35 -
nh m hnh c Zamir v Eztioni [5] a ra trong chng tm kim cc cm t hu t c cng im chung trong ti liu s dng cu trc cy hu t. 2.5.2. Phn cm chng lp Vi tp d liu bt k, c bit l trong lnh vc web, s c xu hng cha mt hoc nhiu ch . Khi phn cm ti liu, vic a nhng ti liu vo cc phn cm lin quan vi n l cn thit, iu ny c ngha l vi ti liu c th thuc vo nhiu hn mt phn cm. Mt m hnh phn cm chng lp cho php vic phn cm ti liu vi nhiu ch ny. C rt t thut ton cho php phn cm chng lp trong c phn cm m [8] v cy hu t [5]. Trong vi trng hp nu vic mi ti liu bt buc phi thuc mt phn cm, mt thut ton khng chng lp s c s dng hoc mt tp ca cc phn cm c lp c th c to ra bi phn cm m sau khi lm r cc mi lin h gia cc phn cm. 2.5.3. Hiu sut Trong lnh vc web, mi mt cu lnh tm kim c th tr v hng trm v thnh thong l hng nghn trang web. Vic phn cm cc kt qu ny trong mt thi gian chp nhn c l rt cn thit. Cn phi ch rng mt vi h thng gii thiu ch phn cm trn cc on tin c tr li trn hu ht cc my tm kim ch khng phi ton b trang web [5]. y l mt chin thut hp l trong vic phn cm kt qu tm kim nhanh nhng n khng chp nhn c vi phn cm ti liu v cc on tin khng cung cp y thng tin v ni dung thc s ca nhng ti liu ny. Mt thut ton phn cm online nn c kh nng hon thnh trong thi gian tuyn tnh nu c th. Mt thut ton offline thng hng ti vic a ra cc phn cm c cht lng cao hn. 2.5.4. Kh nng kh nhiu Mt vn c th xy ra vi nhiu thut ton phn cm l s xut hin ca nhiu v cc d liu tha. Mt thut ton phn cm tt phi c kh nng gii quyt nhng kiu nhiu ny v a ra cc phn cm c cht lng cao v khng b nh hng bi nhiu. Trong phn cm c th bc, v d cc tnh ton khong cch lng ging gn nht v lng ging xa nht, rt nhy cm vi cc d
- 36 -
liu tha do khng nn c s dng nu c th. Phng thc trung bnh kt ni l thch hp nht vi d liu b nhiu. 2.5.5. Tnh tng Mt c trng rt ng quan tm trong cc lnh vc nh web l kh nng cp nht phn cm c tnh tng. Nhng ti liu mi cn phi c a vo cc phn cm tng ng m khng phi phn cm li ton b tp ti liu. Nhng ti liu c chnh sa nn c x l li v a n cc phn cm tng ng nu c th. Tht ng nh rng tnh tng cng hiu qu th hiu sut cng c ci thin. 2.5.6. Vic biu din kt qu Mt thut ton phn cm l tt nu n c kh nng biu din mt s m t ca cc phn cm m n a ra ngn gn v chnh xc vi ngi s dng. Cc tng kt ca phn cm nn c tiu biu v ni dung tng ng ngi s dng c th a ra quyt nh nhanh xem phn cm no m h cm thy quan tm.
- 37 -
trng nn vic tm kim cng cha c ci thin. Nhn chung, xy dng mt h thng tm kim thng tin Ting Vit, chng ta gp kh khn trong vic tch t Ting Vit v xc nh bng m ting Vit. ng thi cng chnh l kh khn trong vic phn cm cc ti liu bng ting Vit v bc u tin ca phn cm cng chnh l tch t ting Vit [1]. Vn bng m Ting Vit Khng nh ting Anh, ting Vit c rt nhiu bng m i hi phi x l. Mt s cng c tm kim ting Vit h tr bng m rt tt nh Vinaseek, h tr mi bng m (VNI, TCVN3, ViQR, ...) Kh khn trong tch t Ting Vit C th ni tch t l giai on kh khn nht khi xy dng mt h tm kim thng tin Ting Vit v phn cm ti liu Ting vit. i vi ting Anh, vic xc nh t ch n gin da vo khong trng tch t. V d, cu I am a student s c tch thnh 4 t: I, am, a, student. Tuy nhin, i vi Ting Vit, tch da vo khong trng ch thu c cc ting. T c th c ghp t mt hay nhiu ting. T phi c ngha hon chnh v c cu to n nh. Cu Ti l mt sinh vin c tch thnh 4 t: Ti, l, mt, sinh vin. Trong , t sinh vin c hnh thnh t hai ting sinh v vin. Hin nay c rt nhiu phng php c s dng tch t Ting Vt. Tuy nhin, vi s phc tp ca ng php Ting Vit nn cha c phng php no t c chnh xc 100%. V vic la chn phng php no l tt nht cng ang l vn tranh ci. Cc kh khn khc Ting Vit c cc t ng ngha nhng khc m. Cc cng c hin nay khng h tr vic xc nh cc t ng ngha. V vy, kt qa tr v s khng y . Ngc li, c nhng t ng m khc ngha. Cc h thng s tr v cc ti liu c cha cc t c tch trong cu hi m khng cn xc nh chng c thc s lin quan hay khng. V vy, kt qu tr v s khng chnh xc. Mt s t xut hin rt nhiu nhng khng c ngha trong ti liu. Cc t nh: v, vi, nhng, ... c tn s xut hin rt ln trong bt c vn bn no.
- 38 -
Nu tm cch tr v cc ti liu c cha nhng t ny s thu c kt qu v ch, khng cn thit. Do , chng ta cn tm cch loi b cc t ny trc khi tm kim. 2.6.2. Ting v T trong ting Vit V mt ng m, ting l m tit. m tit bao gm nhng n v bc thp hn gi l m v. Mi m v c ghi bng mt k t gi l ch. V mt ng ngha, ting l n v nh nht c ngha, nhng cng c mt s ting khng c ngha. V gi tr ng php, ting l n v cu to t. S dng ting to thnh t, ta c hai trng hp sau: - T mt ting: gi l t n. Trng hp ny mt t ch c mt ting. V d nh: ng, b, cha, m, ... - T hai ting tr ln: gi l t phc. Trng hp ny mt t c th c hai hay nhiu ting tr ln. V d nh: x hi, an ninh, hp tc x, ... T l n v nh nht to thnh cu. Trong t cu chng ta dng t ch khng dng ting. 2.6.3. Phng php tch t t ng ting Vit fnTBL tng chnh ca phng php hc da trn s bin i (TBL) l gii quyt mt vn no ta s p dng cc php bin i, ti mi bc, php bin i no cho kt qu tt nht s c chn v c p dng li vi vn a ra. Thut ton kt thc khi khng cn php bin i no c chn. H thng fnTBL (Fast Transformation-based learning) gm hai tp tin chnh [1]: - Tp tin d liu hc (Training): Tp tin d liu hc c lm th cng, i hi chnh xc. Mi mu (template) c t trn mt dng ring bit. V d: tp d liu hc cho vic xc nh t loi ca mt vn bn c th c nh dng nh sau: Cng ty danhtu ISA danhturieng b dongtu
- 39 -
kim tra. dongtu Trong v d ny mi mu gm hai phn: phn u tin l t, phn th hai l t loi tng ng. - Tp tin cha cc mu lut (rule template): Mi lut c t trn mt dng, h thng fnTBL s da vo cc mu lut p dng vo tp d liu hc. V d: chunk_ -2 chunk_-1 => chunk p dng i vi vic xc nh t loi, vi chunk_-2 = ng t, chunk_-1 = s t, chunk = danh t th lut trn c ngha nh sau: nu hai t trc l ng t v s t th chuyn t loi hin hnh thnh danh t. p dng tch t Ting Vit: Ta c th p dng phng php fnTBL tch t Ting Vit, ch cn thay i mt s nh dng cho ph hp. - Xy dng tp tin d liu hc: Tp tin d liu cho vic tch t Ting Vit c dng nh sau: V B sao B cng B ty I ISA B b B t B vo B tnh B trng I ... Cc k t B, I gi l cc chunk c ngha nh sau:
- 40 -
Ting c chunk = B ngha l ting bt u mt t (begin) Ting c chunk = I ngha l ting nm trong mt t (inside). - Xy dng tp tin cha cc mu lut: Sau khi tm hiu v t trong Ting Vit, ta xy dng c 3 lut p dng cho tch t ting Vit nh sau: chunk_0 word_0 => chunk chunk_0 word_-1 word_0 => chunk chunk_0 word_0 word_1 => chunk a, Qu trnh hc: (1) T tp d liu hc xy dng t in cc t (2) Khi to cc t (3) Rt ra tp lut bc (1) t tp d liu hc c sn, s dng phng php thng k ta s c t in cc ting (Lexicon). Cc ting c th xut hin trong cc t vi cc chunk khc nhau, ta s ghi nhn li s ln xut hin ca mi ting vi cc chunk tng ng. V d, i vi t cng ty th ting cng c chunk = B nhng trong t ca cng th ting cng c chunk=I. bc (2) t tp d liu hc, to ra tp d liu hc khng c chunk bng cch xo ht cc chunk tng ng. Tp d liu mi ny s c s dng khi to li cc chunk thng dng nht da vo t in. bc (3) so snh tp d liu hc vi tp d liu ang xt, da vo cc mu lut cho, ta s rt ra c cc lut ng vin, ng vi mi lut ng vin ta li p dng vo tp d liu ang xt v tnh im cho n (da vo s li pht sinh khi so snh vi tp d liu hc l tp d liu chun). Chn lut c im cao nht v ln hn mt ngng cho trc a vo danh sch lut c chn. Kt qu ta s c mt tp cc lut c chn. Cc lut c dng nh sau: SCORE: 414 RULE: chunk_0=B word_0=t => chunk=I SCORE: 312 RULE: chunk_0=B word_-1=ca word_0=cng => chunk=I SCORE: 250 RULE: chunk_0=B word_0=ho => chunk=I
- 41 -
SCORE: 231 RULE: chunk_0=B word_0=ng => chunk=I SCORE: 205 RULE: chunk_0=B word_0=nghip => chunk=I SCORE: 175 RULE: chunk_0=B word_-1=pht word_0=trin => chunk=I SCORE: 133 RULE: chunk_0=B word_-1=x word_0=hi => chunk=I SCORE: 109 RULE: chunk_0=B word_-1=u word_0=t => chunk=I SCORE: 100 RULE: chunk_0=B word_0 = th => chunk=I dng 2 ta c lut: nu t hin hnh l cng (word_0=cng) v t trc l ca (word_-1=ca) v chunk ca t hin hnh l B (chunk_0=B) th chuyn chunk ca t hin hnh l I, ngha l ca cng phi l mt t. Ton b qu trnh hc c m t nh sau:
Hnh 4. Qu trnh hc
- 42 -
b, Xc nh t cho ti liu mi (1) Ti liu mi a vo phi c ng dng ging nh tp tin d liu hc, ngha l mi ting trn mt dng. (2) Da vo t in, gn chunk thng dng nht cho cc ting trong ti liu mi. (3) p dng cc lut c c t giai on hc vo ti liu ang xt ta s tch c cc t hon chnh. Giai on xc nh t cho ti liu mi c m t nh sau:
Hnh 5. Giai on xc nh t cho ti liu mi 2.6.4. Phng php Longest Matching Phng php Longest Matching tch t da vo t in c sn [1]. Theo phng php ny, tch t ting Vit ta i t tri qua phi v chn t c nhiu m tit nht m c mt trong t in, ri c tip tc cho t k
- 43 -
tip cho n ht cu. Vi cch ny, ta d dng tch c chnh xc cc ng/cu nh: hp tc|mua bn, thnh lp|nc|Vit Nam|dn ch|cng ho... Tuy nhin, phng php ny s tch t sai trong trng hp nh: hc sinh hc sinh hc c tch thnh hc sinh|hc sinh|hc, mt ng quan ti gii c tch thnh mt|ng|quan ti|gii , trc bn l mt ly nc c tch thnh trc|bn l|mt|ly|nc,... 2.6.5. Kt hp gia fnTBL v Longest Matching Chng ta c th kt hp gia hai phng php fnTBL v Longest Matching c c kt qu tch t tt nht. u tin ta s tch t bng Longest Matching, u ra ca phng php ny s l u vo ca phng php fnTBL hc lut.
- 44 -
- 45 -
Hnh 6. Cy hu t cho xu BANANA Cy hu t cho xu BANANA c thm $ vo cui. C su con ng t gc ti mt l (c ch trn nh cc hp) tng ng vi 6 hu t
A$, NA$,
- 46 -
3.2.2. Thut ton STC Thut ton phn cm cy hu t Suffix Tree Clustering (STC) [5] l mt thut ton phn cm thi gian tuyn tnh da trn vic nhn dng cc cm t chung ca cc vn bn. Mt cm t trong ng cnh ny l mt chui th t ca mt hoc nhiu t. Chng ta nh ngha mt cm c bn (base cluster) l mt tp cc vn bn c chia s mt cm t chung. STC c 3 bc thc hin logic: (1) Lm sch vn bn, (2) nh ngha cc cm c bn s dng mt cy hu t, v (3) kt hp cc cm c bn vo cc cm. Bc 1: Tin x l (Pro-Precessing). Trong bc ny, cc chui ca on vn bn biu din mi ti liu c chuyn i s dng cc thut ton cht (Chng hn nh loi b i cc tin t, hu t, chuyn t s nhiu thnh s t). Phn ra thnh tng cu (xc nh cc du chm cu, cc th HTML). B qua cc t t khng phi l t (chng hn nh kiu s, cc th HTML v cc du cu). Cc chui ti liu nguyn gc c gi li, cng vi cc con tr ti v tr bt u ca mi t trong chui chuyn i n v tr ca n trong chui gc. Vic c cc con tr nhm gip hin th c on vn bn gc t cc nhm t kha chuyn i. Bc 2: Xc nh cc cm c s. Vic xc nh cc cm c s c th c xem xt nh vic to mt ch s ca cc nhm t cho tp ti liu. iu ny c thc hin hiu qu thng qua vic s dng cu trc d liu gi l cy hu t. Cu trc d liu ny c th c xy dng trong thi gian tuyn tnh vi kch c ca tp ti liu, v c th c xy dng tng thm cho cc ti liu ang c c vo. Mt cy hu t ca mt chui S l mt cy thu gn cha ng tt c cc hu t ca S. Thut ton coi cc ti liu nh cc chui ca cc t, khng phi ca cc k t v vy cc hu t cha ng mt hoc nhiu t. M t c th v cy hu t nh sau: 1. Mt cy hu t l cy c gc v c nh hng. 2. Mi node trong c ti thiu 2 con.
- 47 -
3. Mi cnh c gn nhn l mt chui con ca S v chui khc rng. Nhn ca mt node c xc nh thng qua chui ni tip ca cc nhn c gn cho cc cnh t gc ti node . 4. Khng c hai cnh t mt node c gn nhn bt u vi t ging nhau. 5. Vi mi hu t s ca S, tn ti mt suffix-node c nhn l s. Cy hu t ca mt tp cc chui l mt cy thu gn cha ng tt c cc hu t ca tt c cc chui trong tp ti liu. Mi suffix-node c nh du ch ra chui m n thuc v. Nhn ca suffix-node chnh l mt hu t ca chui . phn cm ta s xy dng cy hu t ca tt c cc cu ca tt c cc ti liu trong tp ti liu. Chng hn c th xy dng cy hu t cho tp cc chui l {cat ate cheese, mouse ate cheese too, cat ate mouse too}. - Cc node ca cy hu t c v bng hnh trn - Mi suffix-node c mt hoc nhiu hp gn vo n ch ra chui m n thuc v. - Mi hp c 2 s (s th nht ch ra chui m hu t thuc v, s th hai ch ra hu t no ca chui gn nhn cho suffix-node)
Hnh 7: Cy hu t ca cc chui cat ate cheese, mouse ate cheese too, cat ate mouse too.
Mt s node c bit a f. Mi mt node ny biu din cho mt nhm ti liu v mt nhm t chung c thit t cho tt c ti liu. Nhn ca node biu din nhm t chung. Tp cc ti liu gn nhn suffix-node l k tha ca cc node to bi nhm ti liu. Do , mi node biu din mt cm c s (base
- 48 -
cluster). Ngoi ra, tt c cc cm c s c th (cha 2 hoc nhiu ti liu) xut hin nh cc node trong cy hu t. Bng sau lit k cc node a-> f trong hnh 1 v cc cm c s tng ng.
Bng 1: Su node t hnh 14 v cc cm c s tng ng. Node Phrase Documents a cat ate 1,3 b ate 1,2,3 c cheese 1,2 d mouse 2,3 e too 2,3 f ate cheese 1,2
Trong : |B| l s lng ca cc ti liu trong cm c s B, |P| l s lng cc t c trong nhm t P m c im s khc 0. Vic xt n im s ca nhm t P theo ngha nh sau: Thut ton ci t mt danh sch stoplist bao gm cc t c trng trn internet dng xc nh cc t khc. ( V d previous, java, frames, mail). Cc t xut hin trong danh sch stoplist hay cc t xut hin qu t trong mt nhm t (3 hoc t hn) hay qu nhiu (hn 40% ca tp ti liu) s c gn im s 0 cho nhm t. Hm f trong cng thc (*) thc hin trn cc nhm t n, n l tuyn tnh cho cc nhm t c di t 2 n 6 v l hng s vi cc nhm c di ln hn. Bc 3: Kt ni cc cm c s Cc ti liu c th chia s nhiu hn mt nhm t. Kt qu l, tp hp ti liu ca cc cm c s khc nhau c th trng lp v thm ch l c th l ging nhau. trnh vic c nhiu cc cm gn ging nhau. Ti bc th 3 ny ca thut ton vic trn cc cm c s vi mt s trng lp cao trong tp ti liu ca chng (ch l cc nhm t chung khng xem xt trong bc ny)
- 49 -
Thut ton a ra mt o tnh tng t gia cc cm da trn vic trng lp ca tp ti liu ca chng. Gi s c hai cm c s Bm v Bn vi kch c l |Bm| v |Bn| tng ng. V | Bm Bn| th hin s ti liu chung ca c hai cm, tng t gia Bm v Bn l 1 nu: +) | Bm Bn| / |Bm| > 0.5 v +) | Bm Bn| / |Bn| > 0.5 Ngc li, tng t l 0. Hy xem minh ha tip theo ca v d trong Hnh 7. y mi node l cc cm c s. Hai node c ni vi nhau khi tng t l 1. Mt cm c xc nh l cc thnh phn c ghp ni trong th cm c s. Mi mt cm s bao gm tp ca tt c cc ti liu ca cc cm c s trong n.
Trong v d ny c mt thnh phn kt ni, do c 1 cm. Nu gi s rng t ate c trong danh sch stoplist, th cm c s b s b loi ra bi v n c ch s ca nhm t l 0. V do s c 3 thnh phn kt ni trong th, th hin 3 cm. Chng ta thy rng thi gian ca vic tin x l cc ti liu ti bc 1 ca thut ton STC hin nhin l tuyn tnh vi kch thc tp ti liu. Thi gian ca vic thm cc ti liu vo cy hu t cng tuyn tnh vi kch thc tp ti liu theo thut ton Ukkonen cng nh s lng cc node c th b nh hng bi vic chn ny. Do vy thi gian tng cng ca STC tuyn tnh vi kch thc tp ti liu. Hay thi gian thc hin ca thut ton STC l O(n) trong n l kch thc ca tp ti liu.
- 50 -
- 51 -
php trch chn c trng. Ngoi ra, ti liu v s biu din phn cm ti liu cng s c m t. Cui cng, phng php c lng cht lng phn cm cng s c trnh by.
- 52 -
t kt thc thng c s dng xo cc t t c ngha. K thut lp y thng c s dng kt ni cc t ny trong dng tng t. Bi v cc vector c trng ngn nht dn ti thi gian phn cm ngn hn, bc 4 v 6 c gng lm nh nht s cc c trng v thu c hi tng hp l cho cc c trng. Tha nhn ngi s dng mun cm kt qu bao gm khong k ti liu.Trong trng hp l tng, mt c trng cho mt cm s xut hin ch trong cm v do tn xut ti liu ca ca c trng l k. Bi vy, u tin chng ta chn cc c trng vi tn xut ti liu l bng k, bng cch thit lp lower v upper bng k trong bc 4. Khong gi tr {lower, upper} l tng ln mt cch lp li trong bc 6 bo m bo ph cho tp c trng kt qu. Chng ta thy rng N/k ch l mt hng dn phng on, s lng thc t cc phn cm ca kt qu phn cm c th khng ging nh N/k. Phng php cng s dng mt ngng hi tng m bo rng cc c trng c chn c hi tng. Vi cc th nghim ([24]), chng ta thy rng 0.8 l gi tr ngng hi tng kh tt.
- 53 -
(3) vector c trng ca phn cm nh ngha1: (DC) Cho N ti liu trong mt phn cm: {D1,D2,...DN}, gi tr DC ca mt nt c nh ngha nh mt b ba: DC = (N,ID,W), trong N l s lng cc ti liu trong cm, ID l tp cc nhn dng ti liu ca cc ti liu trong cm, v d ID={ID1,ID2,...IDN}, v W l vector c trng ca cm ti liu, v d W=(w1,w2,...,wn), trong wj= wij , v n l s cc c trng c
i =1 N
trch chn. B ba ny khng ch ra tng hp tn sut ti liu trong cm, nhng c th s dng nh gi s ging nhau gia hai cm. B sau cung cp mt cch linh hot kt ni hai cm thnh mt v cho ra gi tr DC cho cm kt hp. B [24] (Php cng) Cho DC1 = (N1,ID1,W1) and DC2= (N2,ID2,W2) l b gi tr DC ca hai cm ti liu tch ri, trong tch ri c ngha l mt ti liu khng thuc v nhiu hn mt cm ti cng mt thi im. Khi b gi tr DC mi, DCnew, ca cm c hnh thnh bng cch kt hp hai cm tch bit l: DCnew = (N1+N2, ID1 ID1, W1+W2), trong W1+W2= (w11+w21,w12+w22,...,w1n+w2n), v n l s cc c trng c trch chn. d, Cc k thut nh gi nh gi cht lng ca kt qu vic phn cm, chng ta chn k thut nh gi F-Measure ( o lng F) [23]. Chi tit ca phng php nh gi c m t nh sau: Cho tng topic c gn nhn bng tay T trong tp ti liu, gi s rng mt phn cm X tng ng vi topic c hnh thnh. N1= s cc ti liu ca topic T trong phn cm X N2=s cc ti liu trong phn cm X N3= tng s cc ti liu ca topic T P=Precision(X,T)=N1/N2 R=Recall(X,T)=N1/N3 F-measure cho topic T c ng ngha nh sau:
- 54 -
F(T)=
2 PR P+R
Vi nh gi cao vi mt topic T, chng ta quan tm phn cm vi o F-measure cao nht phn cm C cho T, v o F-measure tr thnh im s cho topic T. o overall F-measure[22] cho cy kt qu phn cm l gi tr trung bnh ca F-measure cho tng topic T: Overall_F_Measure=
T M
( T F (T ))
T M
trong M l tp cc topic, |T| l s cc ti liu ca topic T, v F(T) l F-Measure cho topic T. 3.3.3. Cy phn cm ti liu DC Tree Trong phn ny xin gii thiu mt thut ton phn cm ti liu Web bng phng tin l cy phn cm ti liu (Document Cluster -DC-tree). Trong DCtree, mi nt c th c quan tm nh mt phn cm ti liu. Cu trc cy c s dng hng dn cch a i tng ti liu vo mt phn cm ti liu (DC) thch hp ti cc nt l. N l tng t vi B+-tree [2] trong cc bn ghi ch s ti cc nt l bao gm cc con tr tr ti cc i tng d liu, nhng n khng l cy c chiu cao cn bng. Cu trc ny c thit k bi v vic gn mt ti liu vo mt phn cm ch yu cu duyt qua mt s lng nh cc nt. Mt DC-tree l mt cy vi 4 tham s: h s nhnh (B), hai ngng tng t (S1 v S2, trong 0 S1 , S2 1) v s nh nht con ca mt nt (M). Mt nt khng phi l l ca ton b cc ch mc ca B c dng (DCi, Childi), trong i=1, 2,..., B, Childi l mt con tr ti nt con th i hoc mt ti liu, v DCi l gi tr DC ca phn cm con tiu biu cho nt con th i hoc mt ti liu ca n. V th, mt nt khng phi l l m t mt cm cu to nn tt c cc cm con c m t bi ch mc ca n.
- 55 -
Mt nt l DC ca ton b ch mc B l mt ch mc c dng (DCi, Doci), trong i {1, 2, ..., B}, Doci l mt con tr ti mt ti liu hoc mt tp ti liu, v DCi l ch mc DC ca cm con tng ng. Gi tp ti liu di mt con tr l mt nt l ti liu ( document leaf node), phn bit vi n trong nt l cy (tree leaf node) hoc DC leaf node (xem hnh 8). Mt nt l DC cng m t mt cm cu to nn tt c cc cm con c m t bi cc ch mc DC ca n. Cy DC cho php mt ch mc a ti liu vo, chn vo mt nt l ti liu mi ti cc mc khc nhau ca cy. V th, Cy DC khng phi l mt cy c chiu cao cn bng. Hnh 8 biu din mt v d cy DC vi chiu cao l 2, B=3, M=2. Ch rng cy l khng cn bng. Trong vic xy dng cy, hai ngng c s dng:
Hnh 8. V d ca mt cy DC Ngng S1: ngn chn kt qu phn cm ti liu km cht lng (v d: Cc ti liu trong cc lp khc nhau c a vo cng 1 cy con hoc phn cm) c gy ra bi th t chn ti liu, S1 c s dng quyt nh ti liu n E c th c chuyn ti cp tip theo hay khng trong qu trnh chn ti liu. Nu tn ti mt ti liu con ca nt hin ti m tng t gia ti liu ny v ti liu sp c a vo ln hn S1, ti liu mi s c chuyn n nt con tng ng. Ngc li, ti liu mi s c thm vo nt hin ti nh mt nt l mi.
- 56 -
Ngng S2: Do cy DC c s dng cho vic phn cm ti liu m khng s dng nh ch mc, do khng cn thit phi p mi nt l tr n mt ti liu n. lm gim thi gian chn, ti liu mi c th c gp vi mt nt l, nu tng t ca n ln hn mt ngng S2. Vic gp nt ny s dng php gp c m t trong b 1, n gip cho gim bt vic chn nt v cc thao tc phn chia v th m thi gian chn c th gim i. C th coi cy DC l mt th hin ca tp d liu v mi ti liu nt l khng phi l mt im d liu n m l mt phn cm ca cc im d liu (mi nt l c th c nhiu im d liu min l tha mn ngng S2). Vi nh ngha v cy DC nh trn, c ca cy s l mt hm da trn cc ngng S1 v S2. Nu chng ta t S1 = 0 v S2 l mt s no ln hn 1, cy DC s tng t nh mt cy cn bng nh cy B+ hoc mt cy R [21]. Vic xa d liu cng nh thut ton trn l tng t nh ca cy B+. A. Chn Sau y l thut ton chn mt i tng ti liu vo cy DC. i tng ti liu y c th l mt ti liu n hoc mt phn cm ca cc ti liu c biu din bi mt nhm DC (E). Nu nh i tng ti liu l mt ti liu n, u tin n s c gi vo trong mt nhm DC (E). Thut ton chn c tin hnh theo nhng bc sau y: 1. Nhn dng nt l thch hp: Bt u t gc, E duyt xung di cy DC bng vic chn nt con gn nht vi gi tr tng t ln hn S1. Nu nt con ny khng tn ti, E c chn vo nh mt nt l ti liu vo mt nhm rng ca nt. Nu khng c nhm rng no, vic chia nh nt cn c thc hin. 2. Thay i nt l: Khi chng ta ang mt nt l ca cy DC, chng ta tm ra nhm nt l gn nht vi E, k hiu l Li v kim tra xem n c th gp vi E m khng vi phm yu cu v ngng tng t S2 khng. Nu khng vi phm, nhm cha Li s c dng gp. Ch rng mt nhm DC ca mt phn cm mi c th c tnh t nhng nhm DC cho Li v E da vo b 1. Ngc li, E s c cng vo nt l. Nu c khong trng trong nt l thm
- 57 -
c nhm ny, coi nh chng ta hon thnh, ngc li, chng ta phi chia nh nt l. 3.Thay i ng dn t nt l n gc: Sau khi E c chn vo mt nt l, chng ta phi cp nht li cc nhm khng phi l l trn ng t gc n nt l. Do cha c vic chia nh, vic ny c thc hin bng cch thm cc nhm DC tng ng vi vic thm vo E. Vic chia mt nt l yu cu chng ta phi chn mt nhm khng phi l l mi vo nt cha, vic ny tng ng vi vic to ra mt nt l mi. Nu nh nt cha c khong trng mc ny c th chn vo th ti tt c cc mc ln hn, chng ta ch cp cp nht cc nhm DC tng ng vi vic thm vo E. Mt cch tng quan, chng ta c th phi chia nh c nt cha v thm ch c nt gc. Nu nt gc b chia nh, cao ca cy s c tng thm 1 v mt gc mi s c to ra. B. Chi nh nt thm mt nhm mi vo mt nt y cha B nhm, cn thit phi chia tp B+1 nhm thnh 2 nt. S chia s ny nn c hon thnh sao cho tng t gia 2 nt mi s l nh nht v tng t gia cc ti liu trong cng mt nt s l ln nht. Chng ta s s dng mt thut ton chia nh mt tp B+1 nhm vo 2 tp. Cch n gin nht l to ra tt c cc tp c th v chn tp tt nht. Tuy nhin, s lng tp ny c th l rt ln, xp x 2B-1 . Pha di l mt thut ton chia nt s dng thut ton chn. Thut ton chia nt ny tng t nh phng php c s dng trong cy R: 1. Chn mt ht nhn cho mi nhm: Mi mi cp nhm E1 v E2, tnh ton tng t gia chng. Chn ra cp c tng t thp nht nh l cc nhn t u tin ca 2 tp. trnh hiu ng tht nt, chng ta nn chn ra cp c s lng ti liu l ln nht. 2. Kim tra iu kin kt thc: Nu tt c cc nhm c a vo cc tp, dng y. Nu mt tp c rt t nhm th tt c cc nhm cha c xt s c a vo n tha mn s nhm nh nht M. 3. Chn tp tng ng: Vi mi nhm E cha trong tp no, tnh ton tng t gia E v mt nhm ht nhn ca mi tp. a nhm vo tp c gi tr tng t ln nht vi n. Quay li bc 2.
- 58 -
C. Xa v trn nt Thut ton xa d liu l tng t nh trong cy B+. Nu s lng cn li ca cc nhm ln hn hoc bng s lng nhm ti thiu M no sau khi loi b mt nhm, vic xa nt c hon thnh. Ngc li vic trn nt s c s dng. iu ny c ngha l nt s c trn vi vi cc anh em ca n. Hn na, vic trn nt l cn thit khi mt nhm xa l vi nt cha. Cng vic ny c nhn ra ti nt gc v cao ca cy c th b gim xung nu cn thit. D. Nhn dng cc phn cm th v Qu trnh nhn dng phn cm bt u t gc ca cy. Mt thut ton tm kim breath-first c p dng khm ph ra cc phn cm th v. Mt phn cm th v c nh ngha l mt phn cm m cha cc nt c trng tiu biu v kch c trong mt khong nh trc. Chng ta c th s dng cc gi tr chn di (lower) v chn trn (upper) c tm thy trong phng thc bc tch c trng ca chng ta quyt nh khong gii hn ca c phn cm. Gi s l v u l chn di v chn trn ca c phn cm, th l v u c th c quyt nh bng cng thc sau: (1) l = lower
N m
v (2) u = upper
N m
Trong N l c ca tp d liu v m l c ca tp d liu mu c s dng trong qu trnh bc tch c trng. Phm vi ny cng c th c iu chnh th cng t c mt kt qu phn cm tt. Mt khi chng ta nhn din c mt phn cm th v, cc phn cm con trong cc nt con ca n s khng cn phi duyt na. Mt c trng tiu biu c nh ngha l mt c trng c s h tr trong phn cm. C ngha l tn sut ti liu ca cc nt c trng tiu biu phi ln hn mt ngng nh trc no . Chng ta c th gi ngng ny l ngng tiu biu. Cc c trng ny sau s c s dng lm i din cho phn cm.
- 59 -
- 60 -
- 61 -
Trong chc nng ny, c 3 bc c thc hin: Bc 1: Tch t s dng thut ton Longest Matching vi t in dng sn Bc 2: Tch t s dng thut ton fnTBL t d liu tr v t thut ton Longest Matching. Bc 3: Phn cm da trn thut ton DC-Tree s dng hm tnh tng t da trn cc cm t tch c. - Tm kim trn kt qu phn cm Vic tm kim ny s c p dng mt thut ton bao gm 2 bc: Bc 1: Tnh tng t ca chui tm kim vi cc c trng ca cc phn cm, nu tng t ln hn mt ngng S1 no , ta s p dng bc 2 cho phn cm . Bc 2: Tm kim cc ti liu trong phn cm c tng t cao hn mt ngng S2 vi chui tm kim.
- 62 -
Bng Documents y l bng cha cc ti liu c chng trnh ly v Tn trng DocID Source Kiu d liu Int Nvarchar M t L kha chnh ca bng. a ch ngun ca ti liu gc. Dng nh ch mc, trnh trng lp ti liu. L trch on ca ti liu, phc v cho vic phn cm. Cho bit ti liu ny c tch t hay cha. Cho bit ti liu c phn cm hay cha.
Snipet
Ntext
IsTokenized
Bit
IsClustered
Bit
Bng DocumentIndex y l bng lin kt gia cc ti liu v d liu t in. Tn trng DocIndexID PhraseID Kiu d liu Int Int M t L kha chnh ca bng. Kha ngoi, lin kt n bng Dictionary Kha ngoi, lin kt n bng Documents Cho bit tng t/tn sut ca t kha trong ti liu da trn mt hm tnh tng t.
DocID
Int
Score
Float
- 63 -
Bng Nodes Cha cc nt ca cy DC Tn trng NodeID NodeParentID ClusterID Kiu d liu Int Int Int M t L kha chnh ca bng. Cha nt cha ca cy Cho bit phn cm ca nt thuc vo.
Bng Node-Document V mt nt c th cha nhiu ti liu v ngc li, mt ti liu c th nm trn nhiu nt. Bng ny th hin mi quan h nhiu-nhiu ny Tn trng Kiu d liu M t L kha chnh ca bng. Kha ngoi, lin kt n bng Nodes Kha ngoi, lin kt n bng Documents
Bng Clusters Cha cc phn cm tm c Tn trng ClusterID Kiu d liu Int M t L kha chnh ca bng. Cho bit s th t ca phn cm.
- 64 -
Hnh 7. S lin kt thc th ca chng trnh thc nghim 4.3. Chng trnh th nghim
p dng cc nghin cu v l thuyt phn cm, trong chng trnh th nghim ca chng ti, mi mt bc thc hin s c tch thnh tng phn ring. Tng ng vi cc chc nng chnh m t trn, chng trnh bao gm bn module chnh: T in, Ly d liu, Phn cm, Tm kim.
- 65 -
- Module T in: hin th tt c cc t c trong t in Vit. Vi d liu ban u c ly t ngun t in Vit-Anh ti a ch http://www.stardict.org ta s c mt kho t in kh hon chnh cc t Ting Vit. Tuy nhin ta cng c th thm hoc bt nhng t c nu thy cn thit. Tp cc t trong t in ny s c s dng trong bc tch t trong ti liu cn phn cm.
Hnh: Mn hnh h tr chc nng cp nht chnh sa T in - Module Ly d liu: xy dng kho d liu cc ti liu Web, ta tin hnh ly d liu v. Ngi s dng s nhp ng dn URL ca trang Web, h thng s t ng tm kim v ly tt c ni dung ca trang Web vi mt su n ( c nh trc)
- 66 -
Hnh: Mn hnh chc nng h tr ly d liu t Internet - Module Phn cm: Sau khi tin hnh ly d liu, ta thc hin phn cm ti liu. H thng s tin hnh phn cm mt cch t ng. Trong ln phn cm khc vi tp d liu mi c ly v, vic phn cm s khng cn phn cm li vi tp d liu c m ta phn cm trc na. Vic phn cm s ch cn thc hin trn tp d liu mi vi kt qu c ca cc ln phn cm trc. Trong thut ton c s dng cc tham s sau: M: S lng nh nht con ca mt nt M=8 B: H s nhnh ca cy B=20 S2:Ngng tng t 2 S2=1.0 S1: Ngng tng t 1 S1=0.3 repThreshold: Ngng ca c trng tiu biu repThreshold=0.4 MCS: C phn cm nh nht MCS=100
- 67 -
Hnh: Mn hnh h tr chc nng Phn cm vi d liu ly v t Internet - Module Tm kim: Ngi s dng s nhp vo t kho cn tm kim. H thng s tm cc ti liu lin quan vi t kho.
- 68 -
- 69 -
KT LUN
Lun vn cung cp mt s ni dung v phn cm Web, t c mt s kt qu nh sau: - Gii thiu khi qut v bi ton phn cm web, cc gii php phn cm web (cc yu cu, k thut, nh gi) trong ch ti tnh tng ca cc thut ton phn cm wbe, - Trnh by hai thut ton phn cm web c tnh tng l STC v DC-tree. phn tch cc ni dung kin thc c bn, nn tng pht trin cc thut ton ny. - Xy dng phn mm th nghim phn cm ti liu theo thut ton DCtree. H thng my tm kim-DC tree do lun vn pht trin c a ln web, c cng c lu cc cu truy vn ca ngi dng, cc phn cm tm thy v cc lin kt c ngi dng i ti. H thng hot ng v thc hin c vic phn cm cc ti liu Web. Do hn ch v thi gian v nng lc, lun vn cha tin hnh nh gi cht lng phn cm ca h thng. Trong tng lai, chng ti s tin hnh cc nh gi cng phu hn. Chng ti d kin a ra cc thng k da trn hnh vi ca h thng trong thc t. Ngoi ra, chng ti c th nghin cu cc hng gii quyt vn t ng ngha trong ting Vit.
- 70 -
Ting Anh
[2]. Clement T.Yu v Weiyi Meng (1998), Principles of Database Query Processing for Advanced Application, Morgan Kaufmann Publisher, Inc. [3]. Gerard Salton/Michael J.McGill, Introduction to Modern Information Retrieval. [4]. M. Steinbach, G. Karypis, V. Kumar (2000), A Comparison of Document Clustering Techniques, TextMining Workshop, KDD. [5]. O. Zamir and O. Etzioni (1998), Web Document Clustering: A Feasibility Demonstration, Proc. of the 21st ACM SIGIR Conference, 46-54. [6]. O. Zamir, O. Etzioni, O Madani, R. M. Karp (1997), Fast and Intuitive Clustering of Web Documents, Proc. of the 3rd International Conference on Knowledge Discovery and Data Mining. [7]. K. Cios, W. Pedrycs, R. Swiniarski (1998), Data Mining Methods for Knowledge Discovery, Kluwer Academic Publishers. [8]. R. Krishnapuram, A. Joshi, L. Yi (1999), A Fuzzy Relative of the k-Medoids Algorithm with Application to Web Document and Snippet Clustering, Proc. IEEE Intl. Conf. Fuzzy Systems, Korea. [9]. Z. Jiang, A. Joshi, R. Krishnapuram, L. Yi (2000), Retriever: Improving Web Search Engine Results Using Clustering, Technical Report, CSEE Department, UMBC. [10]. T. H. Haveliwala, A. Gionis, P. Indyk (2000), Scalable Techniques for Clustering the Web, Extended Abstract, WebDB2000, Third International Workshop on the Web and Databases, In conjunction with ACM SIGMOD2000, Dallas, TX. [11]. A. Bouguettaya (1996), On-Line Clustering, IEEE Trans. on Knowledge and Data Engineering. [12]. A. K. Jain v R. C. Dubes (1988), Algorithms for Clustering Data, John Wiley & Sons. [13]. G. Karypis, E. Han, V. Kumar (1999), CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer 32. [14]. O. Zamir v O. Etzioni (1999), Grouper: A Dynamic Clustering Interface to Web Search Results, Proc. of the 8th International World Wide Web Conference, Toronto, Canada. [15]. D. R. Cutting, D. R. Karger, J. O. Pedersen, J.W. Tukey (1993), Scatter/Gather: A Clusterbased Approach to Browsing Large Document Collections, In Proceedings of the 16th International ACM SIGIR Conference on Research and Development in Information Retrieval.
- 71 -
[16]. R. Michalski, I. Bratko, M. Kubat (1998), Machine Learning and Data Mining Methods and Applications, John Wiley & Sons Ltd.. [17]. J. Jang, C. Sun, E. Mizutani (1997), Neuro-Fuzzy and Soft Computing A Computational Approach to Learning and Machine Intelligence, Prentice Hall. [18]. G. Biswas, J.B. Weinberg, D. Fisher (1998), ITERATE: A Conceptual Clustering Algorithm for Data Mining, IEEE Transactions on Systems, Man and Cybernetics. [19]. Z. Huang (1997), A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining, Workshop on Research Issues on Data Mining and Knowledge Discovery. [20]. Y. Yang v J. Pedersen (1997), A Comparative Study on Feature Selection in Text Categorization, In Proc. of the 14th International Conference on Machine Learning. [21]. A Guttman (1984). R-tree: A dynamic index structure for spatial searching, In Proceedings of ACM SIGMOD. [22]. Bjornal Larsen v Chinatsu Aone (1999). Fast and effective text mining using lineartime document clustering, In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA. [23]. C.J.van Rijbergen(1979), Information Retrieval, Butterworth & Co (Publishers) LTd. [24]. Wai-chiu Wong v Ada Fu (2000), Incremental Document Clustering for Web Page Classification, IEEE 2000 Int, Conf. on Infor, Society in the 21st century: emerging technologies anf new challenges (IS2000), Nht Bn. [25]. Pierre Baldi, Paolo Frasconi, Padhraic Smyth (2003). Modeling the Internet and the Web: Probabilistic Methods and Algorithms. Wiley, 2003. [26]. Sen Slattery (2002). Hypertext Classification. PhD Thesis (CMU-CS-02142). School of Computer Science. Carnegie Mellon University, 2002.
- 72 -