You are on page 1of 45

CHNG 1: TNG QUAN V TRCH XUT THNG TIN......................................................

3
1.1 Mc tiu v phm vi chuyn

1.2 Gii thiu v trch xut thng tin (IE) 3


1.3 Trch xut thng tin (IE) v truy vn thng tin (IR)

1.4 Cc nghin cu v ng dng lin quan6


1.5 Cc bc c bn ca mt h thng IE 11
1.6 Phng php rt trch thng tin

12

1.7 Phng php nh gi 12


CHNG 2: CC BI TON, PHNG PHP TRCH XUT THNG TIN.......................14
2.1 M u 14
2.2 Rt trch cm t kha 14
2.2.1 Gii thiu 14
2.2.2 Phm vi ng dng 15
2.2.3 Bi ton sinh keyphrase t ng 16
2.2.4 Thut ton KEA

16

2.2.4.1 Chn cm ng vin

18

2.2.4.2 Tnh ton c trng

19

2.2.4.3 Hun luyn

20

2.2.4.4 Rt trch nhng cm t kha


2.2.5 Thut ton KIP

20

21

2.3 Nhn din thc th c tn

22

2.3.1 Khi nim 22


2.3.2 Phng php tip cn v cc h thng ph bin 23
2.4 Nhn din mi quan h 24
2.4.1 Khi nim 24
2.4.2 Phng php tip cn v cc nghin cu lin quan

24

CHNG 3: RT TRCH METADATA......................................................................................26


3.1. M u 26
3.2 Khi nim Metadata

27

3.3 Chun Dublin Core Metadata 28


3.4 Rt trch metadata v cc nghin cu lin quan
1

30

3.5 Cch tip cn ca ti

32

3.5.1 Kin trc h thng 32


3.5.2 Rt trch metadata da trn lut

33

3.5.3 Cc lut JAPE rt metadata cho bi bo khoa hc


3.6 Thc nghim v nh gi

34

38

CHNG 4: KT LUN V HNG PHT TRIN...............................................................39


4.1 Kt lun39
4.2 Hng pht trin

40

TI LIU THAM KHO.............................................................................................................41

CHNG 1: TNG QUAN V TRCH XUT THNG TIN


1.1 Mc tiu v phm vi chuyn
Vi mc tiu tm kim v xut mt m hnh biu din tri thc cho ti liu vn
bn bao gm cc thnh phn tri thc nh: siu d liu m t ngun gc, cu trc vn bn
(tiu , tc gi, ni xut bn, nm xut bn, ch , ni lu tr, ...), cc cm t kha, cc
thc th, v quan h gia cc thc th biu din ni dung ti liu t h tr truy vn
thng minh, tm kim thng tin, ti liu lin quan t kho ti liu thu thp, t chc lu
tr. Cng vic ca chuyn ny l tin hnh nghin cu v tm kim cc phng php,
cng c cho vic trch xut cc thng tin, tri thc ca ti liu v a vo m hnh, chun
b cho vic t chc tri thc vn bn h tr x l truy vn.
Da trn mc tiu t ra chng ti s tin hnh kho st cc bi ton, phng
php, cng c rt trch thng tin vn bn nh:
Rt trch t kha, cm t kha
Rt trch thc th (c tn, khng tn)
Rt trch cc mi quan h
Rt trch cc thnh phn cu trc, metadata ca ti liu

1.2 Gii thiu v trch xut thng tin (IE)
Cc nh ngha c dng ph bin trn internet lin quan n trch xut thng tin

Theo (Jim Cowie and Yorick Wilks) [2]: IE l tn c t cho qu trnh cu trc
v kt hp mt cch c chn lc d liu c tm thy, c pht biu r rng
trong mt hay nhiu ti liu vn bn.

Theo Line Eikvil [1]: IE l lnh vc nghin cu hp ca x l ngn ng t nhin


v xut pht t vic xc nh nhng thng tin c th t mt ti liu ngn ng t
nhin. Mc ch ca trch xut thng tin l chuyn vn bn v dng c cu trc.
Thng tin c trch xut t nhng ngun ti liu khc nhau v c biu din
di mt hnh thc thng nht. Nhng h thng trch xut thng tin vn bn khng
3

nhm mc tiu hiu vn bn a vo, m nhim v chnh ca n l tm kim cc


thng tin cn thit lin quan, m chng ta mong mun c tm thy.

Cng theo Line Eikvil [1], thnh phn ct li ca cc h thng trch xut thng tin
l mt tp hp cc lut v mu dng xc nh nhng thng tin lin quan cn
trch xut.

Theo Tin s Alexander Yates trng i hc Washington [3] th trch xut thng
tin l qu trnh truy vn nhng thng tin cu trc t nhng vn bn khng cu trc.

Theo nhng chuyn gia v trch xut thng tin ca GATE1 th nhng h thng trch
xut thng tin s tin hnh phn tch vn bn nhm trch ra nhng thng tin cn
thit theo cc dng c nh ngha trc, chng hn nh nhng s kin, cc thc
th v cc mi quan h.

Tm li, chng ta c th hiu trch xut thng tin (Information Extraction) l mt k


thut, lnh vc nghin cu c lin quan n truy vn thng tin (Information Retrieval),
khai thc d liu (Data mining), cng nh x l ngn ng t nhin (Natural Language
Processing). Mc tiu chnh ca trch xut thng tin l tm ra nhng thng tin cu trc t
vn bn khng cu trc hoc bn cu trc. Trch xut thng tin s tm cch chuyn thng
tin trong vn bn khng hay bn cu trc v dng c cu trc v c th biu din hay th
hin chng mt cch hnh thc di dng mt tp tin cu trc XML hay mt bng cu
trc (nh bng trong c s d liu chng hn).
Mt khi d liu, thng tin t cc ngun khc nhau, t internet c th biu din mt
cch hnh thc, c cu trc. T chng ta c th s dng cc k thut phn tch, khai
thc d liu (data mining) khm ph ra cc mu thng tin hu ch. Chng hn vic cu
trc li cc mu tin qung co, mu tin bn hng trn internet c th gip h tr t vn,
nh hng ngi dng khi mua sm. Vic trch xut v cu trc li cc mu tin tm
ngi, tm vic s gip cho qu trnh phn tch thng tin ngh nghip, xu hng cng
vic, h tr cho cc ngi tm vic, cng nh nh tuyn dng.

http://gate.ac.uk/ie/

Rt trch thng tin khng i hi h thng phi c hiu ni dung ca ti liu vn bn,
nhng h thng phi c kh nng phn tch ti liu v tm kim cc thng tin lin quan
m h thng mong mun c tm thy. Cc k thut rt trch thng tin c th p dng
cho bt k tp ti liu no m chng ta cn rt ra nhng thng tin chnh yu, cn thit
cng nh cc s kin lin quan. Cc kho d liu vn bn v mt lnh vc trn internet l
v d in hnh, thng tin trn c th tn ti nhiu ni khc nhau, di nhiu nh
dng khc nhau. S rt hu ch cho cc kho st, ng dng lin quan n mt lnh vc
nu nh nhng thng tin lnh vc lin quan c rt trch v tch hp li thnh mt hnh
thc thng nht v biu din mt cch c cu trc. Khi thng tin trn internet s c
chuyn vo mt c s d liu c cu trc phc v cho cc ng phn tch v khai thc
khc nhau.
Cc nghin cu hin nay lin quan n rt trch thng tin vn bn tp trung vo:
Rt trch cc thut ng (Terminology extraction): tm kim cc thut ng
chnh c lin quan, th hin ng ngha, ni dung, ch ti liu hay mt tp cc
ti liu.
Rt trch cc thc th c tn (named entity recognition): vic rt trch ra cc
thc th c tn tp trung vo cc phng php nhn din cc i tng, thc th
nh: tn ngi, tn cng ty, tn t chc, mt a danh, ni chn.
Rt trch quan h (Relationship Extraction): cn xc nh mi quan h gia cc
thc th nhn bit t ti liu. Chng hn xc nh ni chn cho mt t chc,
cng ty hay ni lm vic ca mt ngi no . V d t mt on vn bn:
James Gosling vo lm vic cho Sun Microsystems t nm 1984 nm ti Silicon
Valley , bng cc phng php, k thut trch xut thng tin lm th no ta c th
nhn din c cc thc th, loi thc th v quan h gia chng nh sau:
CONNGI lm vic TCHC: nhn din c hai thc th l
James Gosling v Sun Microsystems. Mi quan h gia hai thc
th ny l lm vic.

TCHC nm ti NICHN: nhn din c hai thc th l Sun


Microsystems v Silicon Valley; mi quan h gia hai thc th ny
l nm ti.
1.3 Trch xut thng tin (IE) v truy vn thng tin (IR)
Trch xut thng tin l tm ra cc thng tin cu trc, thng tin cn thit t mt ti liu,
trong khi truy vn thng tin l tm ra cc ti liu lin quan, hoc mt phn ti liu lin
quan t kho d liu cc b nh th vin s hoc t internet phn hi cho ngi dng
ty vo mt truy vn c th.
Truy vn vn bn thng minh hng ti ti u hay tm kim cc phng php nhm
cho kt qu phn hi tt hn, gn ng hoc ng vi nhu cu ngi dng. Chng hn
ty vo mt truy vn ca ngi dng, h thng c th tm ra nhng thnh phn no
trong ti liu ph hp vi cu truy vn (chng hn mt on, mt cu trong ti liu),
thng minh hn h thng c th tr li chnh xc thng tin t cu truy vn hay cu hi
ca ngi dng.
1.4 Cc nghin cu v ng dng lin quan
Phn ln cc h thng thng minh nhn to ph thuc nhiu vo ngun tri thc v c
ch suy din ca h thng, bn cnh kh nng suy din th ngun tri thc cng phong ph
s gip kh nng p ng cc hnh vi ca h thng cng tt. Web l mt kho d liu
khng l v v tn n cha bn trong nhiu tri thc hu ch thuc cc lnh vc khc nhau
do con ngi cp nht v pht trin, tuy nhin ngun tri thc Web tn ti phn tn di
nhiu dng thc khc nhau. Vn t ra l lm th no c th trch xut ra nhng tri
thc cn thit, hu ch, t chc qun l chng mt cch hiu qu t gip gii quyt
nhng vn do con ngi t ra. Cu tr li l cn pht trin cc h thng rt trch
thng tin trn WEB [8][9]. Theo tin s Alexander Yates trng i hc Washington [3]
nhng h thng rt trch thng tin trn Web, WIE (Web Information Extraction) ha hn
s v nhng l trng gia WEB v thng minh nhn to. WIE s gip cho vic pht
trin, xy dng cc c s tri thc t WWW, t c th p dng trin khai cc nghin

cu v ng dng khc. Bn di l mt s v d in hnh v cc nghin cu v ng dng


ca WIE.
H thng h tr tm vic [4], chng hn khi ngi dng c nhu cu tm kim mt cng
vic dng Goolge Search th r rng cng c Google Search Engine khng tht s hiu v
p ng c cc yu cu tm kim ca ngi dng. Nhng thng tin ngi dng thc s
quan tm nh: cc cng ty no c tuyn dng chc danh hay mt ngh nghip no ,
thng tin v cc cng ty cn tuyn dng, lin h vi ai, ch chnh sch ca mi cng ty
nh th no, nhng thng tin phn hi, kin nhn xt t cc nhn vin v ang lm
ti cc cng ty ra sao, v.v Tt c nhng thng tin nh vy cn thit phi c rt trch,
tng hp v t vn cho ngi dng mt cch c h thng (hnh v 1).

Hnh 1: Rt trch thng tin h tr tm vic (Ngun ti liu tham kho [4])
Mt ng dng khc l trch xut v lc ra nhng thng tin lin quan ti u vn
tm kim thng tin [4]. V d trong hnh v 2 bn di, khi ngi dng c nhu cu tm
kim cc cng vic lin quan n ngh lm bnh m (baker), th ngi ta nhp vo
Goolge chui baker job opening. Kt qu tr v ca Google c rt nhiu thng tin
7

khng lin quan: chng hn thng tin ng tuyn dng ca trng hc MtBaker v cng
ty Baker Hostetler, v.v. Nhng thng tin ny khng lin quan n cng vic cn tm l
ngh lm bnh m (Baker). ng ra h thng phi tr v cc lin kt n cc trang hay
cc cng ty tuyn dng ngh Baker. Nh vy trong trng hp ny IE c nhim v
trch ra cc lin kt lin quan n nhu cu tm kim ca ngi dng.

Hnh 2:Tm vic da trn search engine (Ngun ti liu tham kho [4])
IE ng dng tm kim cu tr li cho cc h thng hi p QA (Question Answering)
da vo kt qu tr v ca search engine. Gn y xut hin mt cch tip cn nghin cu
pht trin h thng QA da vo vic phn tch kt qu tm kim tr v t cc search
engine nhm tm ra cu tr li chnh xc cho cu hi a vo. V d ngi dng cn hi
Thnh ph no l th ca nc Vit Nam, th kt qu tr v t cc search engine th
rt nhiu v h thng phi tm cch trch ra cu tr li m ngi dng mong ch, l
H Ni hay Thnh ph H Ni y l mt dng ng dng k thut rt trch thng tin
IE trong QA. (hnh 3)
8

Hnh 3: Hi p da trn cc kt qu t search engine


IE ng dng trong cc h thng h tr, t vn mua hng. V d khi ngi dng cn
mua mt mn hng, nhng thng tin m ngi dng quan tm n nh: thng tin sn
phm (gi c t cc ca hng, cht lng sn phm, thng tin phn hi t ngi dng),
thng tin nh cung cp (ch hu mi, cht lng dch v, ...), v.v. Ngi dng phi tn
nhiu thi gian tm kim v t ng trch xut, tng hp thng tin theo kiu ca mnh
c th quyt nh cho vic mua hng. Mt h thng IE gip trch xut, tng hp cc
thng tin theo cc yu cu, tiu ch t ra th rt cn thit trong cc h thng thng minh
thng mi nh th.
IE dng cho vic rt trch thng tin t cc bi bo khoa hc nh tn tc gi, tiu t
mc header ca bi bo cng nh nhng thng tin t mc reference ng dng xy

dng cc h thng t chc ch mc, tm kim bi bo khoa hc. Mt h thng tm kim


bi bo khoa hc c dng rng ri l Citeseer. (hnh 4)

Hnh 4: H thng tm kim bi bo khoa hc Citeseer


Mt d n khc tn DBLP thuc trng i hc Trier ca c 2 xy dng mt c
s d liu ca cc bi bo khoa hc t cc hi tho, tp ch v cc lin kt n cc trang
c nhn ca cc nh khoa hc h tr tm kim bi bo khoa hc. Theo tc gi th vic xy
dng c s d liu ny t cc k yu v tp ch c thc hin th cng (thu sinh vin
kim tra v cp nht d liu). Hin c s d liu ca DBLP cha khong 1.4 triu bi bo
khoa hc t mt s hi tho, tp ch uy tn nh ACM, IEEE, Springer, ScienceDirect, ...
(hnh 5)

http://dblp.uni-trier.de/

10

Hnh 5: C s d liu ch mc DBLP


1.5 Cc bc c bn ca mt h thng IE
Theo tin s Diana Maynard [5] hu ht cc h thng IE ni chung thng tin hnh
cc bc sau
Tin x l
o Nhn bit nh dng ti liu (Format detection)
o Tch t (Tokenization)
o Phn on t (Word segmentation)
o Gii quyt nhp nhng ng ngha (Sense disambiguation)
o Tch cu (Sentence splitting)
o Gn nhn t loi (POS tagging)
11

Nhn din thc th t tn (Named Entity Detection)


o Nhn bit thc th (Entity detection)
o Xc nh ng tham chiu (Coreference)
1.6 Phng php rt trch thng tin

Tip cn tri thc

Tip cn hc t ng

Da trn lut, mu c xy dng th cng.

Da trn hc my thng k.

c pht trin bi nhng chuyn gia ngn


ng, chuyn gia lnh vc c kinh nghim.

Ngi pht trin khng cn thnh tho


ngn ng, lnh vc.

Da vo trc gic, quan st. Hiu qu t


c tt hn. Vic pht trin c th s tn
nhiu thi gian

Cn mt lng ln d liu hc c
gn nhn tt.

Kh iu chnh khi c s thay i

Khi c s thay i c th cn phi


gn nhn li cho c tp d liu hc.
Theo [1][5] cc phng php trch xut hin nay c th chia thnh hai cch tip cn

chnh: tip cn cng ngh tri thc (Knowledge Engineering) v tip cn hc my t ng


(Automatic Training)
1.7 Phng php nh gi
Theo [1] vn nh gi cc bi ton trch xut thng tin c cp v thu ht
nhiu quan tm trong cc hi tho MUC Message Understanding Conference c c
quan qun l cc d n v quc phng thuc b Quc Phng Hoa K 3 khi sng v h
tr ti chnh. MUC c u t v khuyn khch nghin cu pht trin cc phng php
mi cho trch xut thng tin. nh gi kt qu ca thng tin c trch xut, cc
chuyn gia a ra o da vo cc o c s dng trong lnh vc truy vn
thng tin (IR) l tin cy Precision v chnh xc Recall.

http://en.wikipedia.org/wiki/DARPA

12

chnh xc Recall (R): l phn s th hin t l thng tin c rt trch ng. Bao
nhiu phn trm thng tin c rt l ng. T l gia s lng cu tr li ng tm thy
vi tng s cu tr li ng c th.
tin cy Precision (P): l o hay phn s th hin kh nng tin cy ca thng tin
c trch xut. T l gia tng s cu tr li ng tm thy vi tng s cu tr li tm
thy.

Vi

tp
(tp tn)

tp
(tp fp )

tp: s kt qu ng c tm thy
tn: s kt qu ng m khng tm thy
fp: s kt qu tm thy m khng ng

P v R thuc khong [0, 1], kt qu tt nht l 1. P v R c lin quan v nh hng


ln nhau. Nu gim R, chng ta c th t c P cao hn v ngc li. Khi so snh,
nh gi mt h thng hay mt phng php th nht thit phi so snh v nh gi da
trn c P v R. Theo Line Eikvil [1], vic so snh, xem xt c hai thng s cng lc th
khng phi n gin, v d dng. V th ngi ta tm cch kt hp hai o ny v
xut mt o mi, l F-Measure (F).

Thng s xc nh mc tng quan gia chnh xc R (Recall) v tin cy P


(Precision). Cc chuyn gia v rt trch thng tin thng s dng = 1 nh gi o
F. Khi P v R c gn trng bng nhau, hiu nng ca h thng c nh gi thng
qua cc gi tr khc nhau ca chnh xc R v tin cy P, t chng ta c th so
snh mt cch d dng.
2 P R

Vi = 1 th F-Mearsure: F ( P R )

13

CHNG 2: CC BI TON, PHNG PHP TRCH XUT THNG TIN


2.1 M u
Nh chng ta bit trch xut thng tin l mt lnh vc nghin cu chuyn su thuc
lnh vc x l ngn ng t nhin. V vy cc bi ton cng nh phng php trch xut
thng tin u c ngun gc, v tng t cc phng php k thut c s dng trong x
l ngn ng t nhin.
Trong chng ny chng ti s trnh by tm tt kho st v cc bi ton lin quan
n trch xut thng tin t vn bn (t kha, cm t kha, thc th c tn, quan h gia
cc thc th, ) cng nh cc phng php tip cn.
2.2 Rt trch cm t kha (Keyphrase Extraction)
2.2.1 Gii thiu
Cm t kha c xem l thnh phn chnh hay mt dng siu d liu (metadata) th
hin ni dung ca ti liu vn bn [7]. Mc ch ca hu ht cc nghin cu rt trch cm
t kha l nhm tm kim cc c trng tt m ha vn bn [19][20][21] ng dng
trong cc h thng phn loi, gom cm, tm tt v tm kim vn bn. Ty vo c trng
ca tng ngn ng s c nhng phng php khc nhau tm kim cc cm t kha.
Hu ht cc phng php u da trn cc k thut truyn thng c dng trong x l
ngn ng t nhin nh tin x l vn bn, tch on, tch cu, tch t, phn tch c php,
phn tch ng ngha, thng k v hc my. Theo quan st ca ti th Cc nghin cu v
rt trch cc cm t lm c trng cho vn bn ting Vit ng dng trong cc h thng
phn loi, tm tt, tm kim ti liu bt u t nhng nm 2000. Mt s kt qu ph
bin nh inh in, Hong Kim (2001) v tch t ting Vit [27]; v tm kim cc cm
ph bin m ha v gom cm vn bn ting Vit, Hong Kim v Nguyn Tun ng
(2002) da trn th l m tch cm v thng k n-gram [26], Hong Kim v Hunh
Ngc Tn (2003) rt trch cc cm ph bin bng cch phn tch vn bn da trn danh
sch cc h t ting Vit v thng k n-gram [22][25]; nhm tc gi ng Th Bch Thy,
H Bo Quc (2003) xut vic tm cm n-gram kt hp danh mc t lm c trng
m ha cho h tm thng tin vn bn ting Vit [24]; Phc v Hong Kim (2004) tm
14

dy t ph bin dng cy hu t rt trch chnh phc v tm tt vn bn ting Vit


[23]. Vic rt trch trc y hu ht da vo tip cn phn tch c php, tch cu, thng
k tn xut xut hin tf*idf rt ra cc cm. Kt qu rt trch vn cha thc s tt, cn
kh nhiu rc (cm v ngha, cm khng th hin in ng ngha ca ti liu cp).
Vn xc nh chnh xc cc cm t kha, cng nh xc nh c bin gii ca cc t
kha, cm t kha t ti liu ting Vit hin nay vn l mt bi ton kh v vn ang
c quan tm nghin cu.
Vi ting Anh th cch tip cn c in vn l dng tn s xut hin tf*idf, bn cnh
mt s thut ton hc my thng k, cng vi cc k thut x l ngn ngn t nhin
nh gn nhn t loi, phn tch c php kt hp cc t in lnh vc c pht trin. Ph
bin rng ri trong cng ng nghin cu v trch xut cm t kha ting Anh l cc
thut ton nh KEA [17][18], KIP [7][14].
2.2.2 Phm vi ng dng
Kh nng ng dng ca t kha v cm t kha c th k n nh sau:
Cc kho d liu vn bn ln nh cc th vin s pht trin rt nhanh dn n
gia tng gi tr thng tin tm tt.
H tr ngi dng nhn bit v ni dung ca ti liu v kho ti liu.
ng dng trong truy vn thng tin m t nhng ti liu tr v t kt qu truy
vn. nh hng tm kim cho ngi dng.
Nn tng cho ch mc tm kim.
L c trng dng trong k thut phn loi, gom cm ti liu.
Vic gn cc keyphrases cho ti liu: cc cm t kha thng c gn bng tay, tc
cc tc gi ch ng gn cc keyphrases cho ti liu h vit. i vi cc b ch mc
chuyn nghip thng chn cc cm (phrases) t mt t in nh ngha trc
(predefined controlled vocabulary)
Vn gp phi i vi cc ti liu khng c keyphrases. Vic gn bng tay l qu
trnh tn nhiu thi gian, cng sc, cng nh cn c kin thc chuyn mn.
Rt cn thit cc k thut rt trch t ng
15

2.2.3 Bi ton sinh keyphrase t ng


Bi ton gn keyphrases (Keyphrase assignment): tm kim v chn cc
keyphrase t t in nh ngha trc (Controlled Vocabulary) m thch hp nht m
t ti liu. Tp d liu hun luyn l mt tp hp cc ti liu vi mi phrase trong t in
v da vo xy dng mt b phn lp (classifier)
Bi tan trch xut keyphrase (Keyphrase extraction): s dng cc k thut
truy vn thng tin v x l t vng chn ra cc keyphrase t chnh ti liu ang xt
thay v dng cc phrase nh ngha trc trong t in (controlled vocabulary).
2.2.4 Thut ton KEA
Turney (2000) c xem l ngi u tin gii quyt bi ton rt trch cc keyphrase
da trn phng php hc gim st [15][16], trong khi cc nghin cu khc dng
heuristic, k thut phn tch n-gram, phng php nh mng Neural [11][12][13]. KEA
[17][18] l mt thut ton trch xut cc cm t kha (keyphrases) t d liu vn bn.
KEA xc nh danh sch cc cm ng vin dng cc phng php t vng hc, sau
tin hnh tnh ton gi tr c trng cho mi ng vin, tip n dng thut ton hc my
tin on xem cc cm ng vin no l cc cm t kha. Hin nay KEA c xem l
mt thut ton n gin v hiu qu nht rt cc keyphrases [6][11]. KEA dng
phng php hc my Nave Bayes hun luyn v rt trch cc keyphrases.
Theo nhn nh ca cc tc gi, KEA l thut ton c kh nng c lp ngn ng.
Thut ton KEA c th c tm tt thng qua cc bc sau:
Bc 1: Rt trch cm ng vin: KEA rt cc cm ng vin n-gram (chiu di 1 n 3
t) m khng bt u hay kt thc bng cc stop word. Trong trng hp bi ton gn
cm t kha (keyphrase assignment) dng t in nh ngha trc (controlled indexing),
KEA ch chn ra cc cm ng vin m khp vi cc thut ng nh ngha trong t
in. Vi cc cm n-gram thu c KEA tin hnh loi b ra khi cm ng vin cc
stop word v chuyn v dng gc ca t (stemming) cho cm ng vin.

16

Kho
Ti liu

T in
lnh vc

Rt trch ng vin

Cm ng
vin

Cm t kha
c gn nhn
trc
C
Xy dng m
hnh dng Nave
Bayes

Tnh c trng
Khng
Hun
luyn?

M hnh

Tnh xc sut

Cm t
kha

Hnh 7: S thut ton KEA (tham kho: http://www.nzdl.org/Kea/description.html)

Bc 2: Tnh ton c trng: mi cm ng vin, KEA tnh 4 gi tr c trng sau:


TFIDF: th hin mc quan trng ca mt cm ng vin trong ti liu ang
xt so vi cc ti liu khc trong tp d liu. Mt cm ng vin c TFIDF cng
cao th cng c kh nng tr thnh cm t kha.
V tr xut hin u tin: theo quan nim tc gi cc cm ng vin m c v tr
xut hin gn u hay cui ti liu th cng c kh nng tr thnh cm t kha.
Chiu di cm: s lng t trong cm. Theo tc gi cc cm c chiu di l 2
thng c quan tm.
tng quan: l s lng cc cm trong danh sch cc cm ng vin c lin
quan ng ngha vi cm ang xt. tng quan c tnh nh vo t in nh
ngha trc. Mt cm ng vin c tng quan cao th cng c kh nng tr
thnh cm t kha.
Bc 3: Hun luyn v xy dng m hnh: dng tp ti liu hun luyn m cc cm t
kha c gn bi tc gi xy dng m hnh. Vi danh sch cc cm ng vin
17

xc nh dng cc k thut n-gram, loi b stop word v chuyn v gc t (stemming)


trn. KEA s nh du nhng cm no l cm + (l cm t kha) v nhng cm no
l cm - (khng l cm t kha). M hnh s c xy dng bng cch tin hnh phn
tch, tnh ton gi tr cho cc c trng cm (nh m t pha trn) cho cc cm + v
cm -. M hnh xy dng s phn nh phn b ca cc gi tr c trng cho mi cm
t.
Bc 4: Rt trch cm t kha: KEA s dng m hnh xy dng bc 3 v tnh ton
gi tr c trng cho cc cm ng vin. Sau tnh xc sut cm ng vin l cm t
kha. Cc cm ng vin vi xc sut xp hng cao nht c chn a vo danh sch cc
cm t kha. Ngi dng c th ch nh s lng cc cm t kha cho mt ti liu.
2.2.4.1 Chn cm ng vin (candidate phrases)
Vic chn cm ng vin c tin hnh thng qua 3 bc nh sau:
Tin x l (Input Cleaning): cc files d liu u vo c dn dp v chun ha v
xc nh bin gii ban u ca cc cm. Chui u vo s c cht thnh cc tokens
Cc du chm cu, ngoc n v nhng con s c thay th bi cc ng
bin ca cc cm (phrase boundaries).
Xa cc du nhy n
Tch nhng t c du gia thnh hai
Xa nhng k t cn li khng phi l token. (v khng c token no m
khng cha cc k t).
Kt qu
Tp hp cc lines
Mi line l mt dy cc token (mi token cha t nht 1 k t)
Nhng t vit tt cha cc du ngn cch phi c gi li l token (nh
C4.5 chng hn)
Xc nh cm (phrase): KEA xem xt tt c cc dy con (subsequences) trong mi dng
v xc nh dy con no thch hp l mt cm ng vin. Mt s phng php khc c

18

gng xc nh cc noun phrase, tuy nhin KEA dng cc lut xc nh cc phrase nh


sau:
Chiu di ti a: phrase ng vin thng ti a l 3 t
Phrase ng vin khng th l tn ring

Phrase ng vin khng c php bt u v kt thc vi 1 stopword.

Tt c cc dy t lin nhau trong mi dng s c kim tra dng 3 lut


trn. Kt qu l mt tp cc cm ng vin.
V d:
Dng
the programming by demonstration
method

Cm ng vin
programming
demonstration
method
programming by demonstration
demonstration method
programming by demonstration
method

Xc nh gc t (stemming): bc sau cng trong vic xc nh cc cm ng vin l


xc nh gc t (stemming) dng thut ton Lovins (1968) b i cc hu t. Vic lm
ny gip h thng c th xem nhiu bin th khc nhau ca cm (phrase) nh l mt.
(chng hn cut elimination s tr thnh cut elim). V h thng cng dng stemming so
snh nhng cm t kha kt qu ca KEA vi cc cm t kha do tc gi nh ngha.
2.2.4.2 Tnh ton c trng (Feature calculation)
Tnh ton cc c trng cho mi cm ng vin v chng s c dng trong hun
luyn v rt trch. Hai c trng c dng l: tn s tf*idf, v tr xut hin u tin
ca cm.
Tn s TF*IDF (t): c trng ny th hin tn sut xut hin ca mt cm trong mt
ti liu so vi tn sut ca cm trong c kho d liu. S lng ti liu cha mt cm cng
t th kh nng cm l cm t kha (keyphrase) cho ti liu ang xt cng cao. Thut
ton KEA to mt tp tin lu tr gi tr tn xut ca c trng ny.

19

Freq(P, D) l s ln cm P xut hin trong ti liu D


Size(D) l s lng t ca ti liu D
df(P) l s lng ti liu cha cm P trong kho d liu.
N: kch thc ca kho d liu
V tr xut hin u tin (d: disttance): y l c trng th 2, l s lng t pha
trc v tr xut hin u tin ca cm t chia cho kch thc ca ti liu (tng s t). Gi
tr ca c trng ny thuc khong [0, 1].
2.2.4.3 Hun luyn
Bc hun luyn dng mt tp ti liu hun luyn trong cc cm t kha c
tc gi xc nh trc. i vi mi ti liu trong tp hun luyn, nhng cm ng vin s
c xc nh v cc gi tr c trng ca tng cm ng vin s c tnh ton. gim
kch thc ca tp hun luyn, tc gi b qua cc cm m ch xut hin mt ln trong ti
liu. Mi cm ng vin s c gn nhn l cm t kha hay khng l cm t kha da
vo nhng cm t kha do tc gi ch nh. Qu trnh hun luyn s sinh ra mt mt m
hnh v m hnh ny c dng tin on phn lp cho cc mu d liu mi dng cc
gi tr ca hai c trng. Nhm tc gi th nghim vi mt s phng php hc my
khc nhau v quyt nh chn k thut Nave Bayes cho thut ton KEA, v theo tc gi
phng php hc da trn xc sut Nave Bayes n gin nhng cho kt qu kh tt.
2.2.4.4 Rt trch nhng cm t kha
rt trch cc cm t kha t mt ti liu mi, KEA xc nh cc cm ng vin v
cc gi tr c trng, sau p dng m hnh xy dng trong qu trnh hun luyn. M
hnh xc nh xc sut m mi ng vin l mt cm t kha. Sau KEA s thc hin
thao tc hu x l chn ra tp hp nhng cm t kha tt nht c th.
Khi m hnh Nave Bayes c p dng cho cc cm ng vin vi cc gi tr c trng
t(TF*IDF) v d (distance), hai lng sau c tnh ton l
(1)

20

Y: s lng cc cm l cm t kha (do tc gi ch nh)


N: s lng cc cm ng vin khng phi l cm t kha.
Xc sut tng th m cm ng vin l cm t kha c tnh nh sau:
(2)
Sau khi tnh ton gi tr xc sut p. Cc ng vin c sp theo th t (tng hay gim
dn) ca gi tr p ny. Tip sau s l 2 bc hu x l. Th nht, TF*IDF s l gi tr
quyt nh trong trng hp 2 cm ng vin c cng xc sut p. Th hai, tc gi quyt
nh loi b ra khi danh sch cc cm m l cm con ca mt cm c xc sut cao
hn. T danh sch cn li, thut ton s chn ra r cm c xc sut cao nht (vi r l s
lng cc cm t kha cn xc nh theo yu cu).
2.2.5 Thut ton KIP
2.2.5.1 tng
Mt cm danh t cha nhng t kha hay cm t kha v mt lnh vc c th s c
kh nng tr thnh cm t kha trong lnh vc . Mt cm danh t cng cha nhiu t
kha hay cm t kha th cm danh t ny cng c nhiu kh nng tr thnh cm t
kha. H thng xy dng sn mt c s d liu t vng lu gi cc t kha, cm t kha
v mt lnh vc c th. V cc t kha trong t in nh ngha trc s dng tnh
ton im hay trng s cho mt cm danh t. T quyt nh cm ng vin no l cm
t kha da trn trng s, im s tnh c cao hn.
2.2.5.2 M t thut ton
KIP n gin gm cc bc nh: rt trch cc cm danh t (noun phrase) ng vin t
ti liu u vo. Sau kim tra cu thnh ca cm ng vin v tnh im cho n. T
quyt nh cm ng vin no l cm t kha da trn trng s, im s tnh c cao
hn.
im ca mt cm danh t c tnh da vo cc yu t:

21

Tn xut xut hin trong ti liu


Cu thnh ca cm danh t (cha t hay cm con no)
Nhng t v cm t cu thnh cm danh t lin quan nh th no n lnh vc ca
ti liu
KIP bao gm cc thnh phn chnh: gn nhn t loi (POS tagger), rt trch cm danh t
(Noun phrase extractor), cng c rt trch cm t kha.
* Gn nhn t loi (POS tagger): KIP dng phng php gn nhn t loi dng ph
bin ca Brill [32].
* Rt trch cm danh t: b rt trch cm danh t da vo cc nhn t loi gn trong
bc trc v rt ra cc cm danh t da vo mu {[A]} {N}
(A adjective; N noun; {} lp li nhiu ln; [] c th c hoc khng)
* Rt trch cm t kha: tnh trng s cho cc cm danh t, thut ton xy dng mt
t in t vng cha cc t kha, cm t kha vi cc gi tr khi to v mt lnh vc c
th. T in bao gm 2 danh sch: mt danh sch cc cm t kha (cha 1 hay nhiu t),
mt danh sch cc t kha (cha 1 t n c phn tch t danh sch th 1, cm t
kha).
Trng ca mt cm danh t: WNP = F x S
F: tn s xut hin ca cm danh t trong ti liu.
S: tng trng s ca nhng t n v cc kt hp c th trong cm ng vin.
+

Wi: trng s ca mt t trong cm danh t ny


Pj: trng s ca ca cm con trong cm danh t.
Mc tiu ca vic tnh ton trng s ca tt c nhng t n v nhng cm con l nhm
xc nh xem mt cm con c phi l mt cm t kha c nh ngha sn trong t
in hay khng. Nu n tn ti trong t in th cm danh t ang xt cng quan trng
hn. KIP s truy vn danh sch cc t kha v cm t kha t t in lnh vc c
c trng s cho cc t n (Wi) v cm con (Pj).

22

2.3 Nhn din thc th c tn


2.3.1 Khi nim
Nhn din thc th c tn (NER-Named Entity Recognition) 4 l mt cng vic thuc
lnh vc trch xut thng tin nhm tm kim, xc nh v phn lp cc thnh t trong vn
bn khng cu trc thuc vo cc nhm thc th c xc nh trc nh tn ngi, t
chc, v tr, biu thc thi gian, con s, gi tr tin t, t l phn trm, v.v. Thc th c tn
(Named Entity) c rt nhiu ng dng, c bit trong cc lnh vc nh hiu vn bn, dch
my, truy vn thng tin, v hi p t ng.
2.3.2 Phng php tip cn v cc h thng ph bin
Hin nay, hu ht cc h thng nhn din thc th c tn p dng cc k thut khai
thc d liu vn bn, x l ngn ng t nhin v tip cn theo cc hng chnh sau:
K thut da trn vn phm ngn ng: qui tc, lut vn phm c xy dng bng
tay nh kin chuyn gia ngn ng, v tn nhiu thi gian cho vic xy dng qui
tc vn phm. Qui tc vn phm s phi thay i khi c s thay i v lnh vc
ng dng hay ngn ng.
Cc m hnh hc thng k: t ph thuc ngn ng, v cng khng ph thuc vo
chuyn gia lnh vc nhng cn chun b tp d liu hun luyn tht tt v ln
c th xy dng c mt b phn lp ti u.
Kt hp my hc v cc k thut x l ngn ng t nhin.
H thng nhn din thc th c tn ph bin: c th k n cc h thng ph bin hin
nay nh:
H thng Standford NER5: xy dng b phn lp CRFClassifier da trn m hnh
thuc tnh ngu nhin c iu kin (CRF-Condictional Random Field)
H thng GATE-ANNIE 6: l mt h thng con ca GATE Framework (General
Architecture of Text Engineering) mt trong cc d n ln nht thuc khoa Khoa
4

http://en.wikipedia.org/wiki/Named_entity_recognition

http://nlp.stanford.edu/ner/index.shtml
6
http://gate.ac.uk/ie/annie.html

23

hc My tnh, i hc Sheffield ca Anh. y l h thng da trn cc t in,


Ontology v vic xy dng lut nh du (annotation) cc thnh t trong vn
bn. Vic xc nh cc thc th c tn trong vn bn thc hin trong qu trnh
nh du vn bn.
2.4 Nhn din mi quan h
2.4.1 Khi nim
Cc nghin cu v rt trch thc th, cng nh quan h c t chc MUC
(Message Understanding Conferences) v ACE (Automatic Content Extration) u t v
thc y pht trin. Rt trch quan h bt u c quan tm t hi tho MUC ln th 7
nm 1998, t ngy cng c ch n. Rt trch quan h l vic xc nh mi quan
h ng ngha gia cc thc th trong vn bn hay trong mt cu. Chng hn xc nh ni
chn cho mt t chc, cng ty hay ni lm vic ca mt ngi no . V d t mt on
vn bn: James Gosling vo lm vic cho Sun Microsystems t nm 1984 nm ti
Silicon Valley ta c th nhn din c cc thc th, loi thc th v quan h gia
chng nh sau:
CONNGI lm vic TCHC: nhn din c hai thc th l James
Gosling v Sun Microsystems. Mi quan h gia hai thc th ny l lm
vic.
TCHC nm ti NICHN: nhn din c hai thc th l Sun
Microsystems v Silicon Valley; mi quan h gia hai thc th ny l
nm ti.
2.4.2 Phng php tip cn v cc nghin cu lin quan
Hu ht cc phng php rt trch quan h tip cn theo cc hng nh da trn lut
(rule-base), da trn c trng (feature-based) v cc phng php kernel (kernel-based).
Mt s nghin cu lin quan nh sau:
Cc phng php da trn trn lut, c trng ngn ng ch yu da vo cc
k thut x l ngn ng t nhin, cc qui tc ngn ng, c php, c im t
24

vng, c im c php, c im ng ngha xc nh cc mi quan h. Mt


s h thng in hnh [28][29].
Cc phng php kernel da vo cc cy kernel tch bit khai thc c im
cu trc. Mt s nghin cu n hnh [30][31] tin hnh xy dng quan h
kernel trn cy c php. Kernel so trng cc node t gc cho n l theo tng
lp t trn xung mt cch qui.
Hu ht cc nghin cu ph bin hin nay tp trung vo vn rt trch quan h gia
cc thc th c tn. Bn cnh quan h gia cc thc th khng tn, hay quan h gia
thc th c tn v khng tn cha tht s c quan tm nhiu. Cc nghin cu lin quan
n rt trch thc th v quan h da trn Ontology l cch tip cn m hin nay ang
c cng ng nghin cu quan tm. ti tip cn theo hng ny.

25

CHNG 3: RT TRCH METADATA


3.1. M u
Metadata hay cn gi l siu d liu (tiu , tn tc gi, ni xut bn, nm xut
bn, ) c dng ph bin, rng ri trong cc th vin s nhm m t thng tin v
ti nguyn (sch, bo, tp ch, ti liu, lun vn, lun n, ). Metadata gip phn loi,
tm kim ti liu mt cch d dng, c nh hng. Theo ti, i vi m hnh biu din tri
thc cho vn bn th metadata c th c xem l mt thnh phn trong m hnh tri thc,
cng vi cc thnh phn khc nh cc cm t kha (keyphrase), cc thc th v quan h.
Cc th vin s ca cc t chc gio dc cng nh cc trng i hc ngy cng m
rng v pht trin vi nhng ngun ti liu in t a dng v phong ph v th loi, nh
dng, ch . Vic u t xy dng cc chun, phng php, v phn mm nhm t chc,
thu thp, phn loi, qun l v khai thc cc ti liu ny mt cch hiu qu l mt vic
lm rt cn thit, hu ch v c nhiu ngi, nhm nghin cu, t chc u t nghin
cu, pht trin trong nhng nm gn y [33][34][40].
Theo [35], chun trao i d liu trn internet hin nay c t chc tiu chun quc
gia ca M thng qua nhm thay th cho cc chun c khng cn ph hp l chun
ANSI/NISO Z39.85 2001. Ni dung ch yu ca chun ny m t d liu gm 15
trng d liu cn c gi l Dublin Core Metadata 7. y l cc trng d liu ph
bin v hu ch nht km theo cc ti liu s ha trao i trn mng internet.
Vic rt trch v to metadata cho cc ti liu in t gip cho vic sp xp ti liu
mt cch khoa hc v h tr ngi dng c th tm kim chng mt cch d dng. To
metadata bng tay s tn km nhiu thi gian v cng sc. Theo [41] chng ta s tn 60
nm cho mt ngi to metadata cho mt triu ti liu.
Mc ch nghin cu ca chng ti l tm phng php v xy dng cng c xc nh
c cc thnh phn metadata cho mt ti liu in t. Vic xc nh c metadata t
ng s h tr tch cc cho cng vic xy dng m hnh tri thc ti liu vn bn, t chc
bin mc ti liu in t. ng thi vi metadata ca ti liu chng ta c th s tm kim
7

http://dublincore.org/

26

nhng mi lin h gia cc ti liu thng qua metadata. Chng hn sau khi xc nh c
thng tin metadata ca mt bi bo. Chng ta c th bit c bi bo ny c nhng
ti liu no trch dn, nhiu hay t. Da vo chng ta c th gn cho mi bi bo mt
o. o ny s gip ch nhiu trong vn xp hng cc bi bo khi tm kim. Bn
cnh metadata ca cc ti liu v mt lnh vc no c th gip ch cho vic lm
giu Ontology lnh vc. Chng hn t cc thng tin metadata ca cc computer scienece
publications chng ta c th dng lm giu mt Ontology v Khoa hc My tnh
(Computer Science Ontology - CSOnt).
Trong chng ny chng ti trnh by mt cch tip cn rt trch Metadata cho cc
bi bo khoa hc da trn thng tin cu trc trnh by v vic xy dng lut da trn cc
mu (patterns). ng thi chng ti cng xy dng mt cng c rt trch metadata t
ng c th dng kt hp vi cc phm mm th vin s.
Trong mc 3.2 chng ti s trnh by v cc khi nim c bn v Metadata, mc 3.3
gii thiu v chun Dublin Core Metadata c hin ang dn c p dng trong cc
th vin s v thay th dn cho nhng chun trc y. Mc 3.4 trnh by v cc nghin
cu lin quan n rt trch metadata t ng t chc d liu s. Mc 3.5 s trnh by
v cch tip cn ca chng ti, kin trc h thng rt trch v nhng lut c nh ngha
da trn JAPE Grammar v plug in l ANNIE ca GATE. Mc 3.6 s trnh by kt qu
thc nghim ca phng php xut v cng c xy dng.
3.2 Khi nim Metadata
Metadata (siu d liu) dng m t ti nguyn thng tin. Thut ng meta xut x
l mt t Hy Lp ng ch mt ci g c bn cht c bn hn hoc cao hn. Mt
nh ngha chung nht v c dng ph bin trong cng ng nhng ngi lm Cng
ngh Thng tin: Metadata l d liu v d liu khc (Metadata is data about other data)
hay c th ni ngn gn l d liu v d liu.
Trong cc phm vi c th, nhng chuyn gia a ra cc quan im khc nhau v
metadata:

27

Theo Chris.Taylor gim c dch v truy cp thng tin th vin thuc trng
i hc Queensland8 th Metadata l d liu c cu trc c dng m t
nhng c im ca ti nguyn. Mt mu tin metadata bao gm mt s lng
nhng phn t c nh ngha trc gi l elements dng m t c tnh,
thng tin ti nguyn. Mi elements c th c 1 hay nhiu gi tr.
Theo tin s Warwick Cathro thuc th vin quc gia Australia 9 th mt phn t
metadata hay cn gi l metadata elements m t ti nguyn thng tin, hay h
tr truy cp n mt ti nguyn thng tin.
Tm li, ta c th hiu metadata l thng tin dng m t ti nguyn thng tin.
3.3 Chun Dublin Core Metadata
Dublin Core Metadata10 l mt chun metadata c nhiu ngi bit n v c
dng rng ri trong cng ng cc nh nghin cu, chuyn gia v th vin s. Dublin
Core Metadata ln u tin c xut nm 1995 bi Dublin Core Metadata Element
Initiative. Dublin l tn mt a danh Dublin, Ohio M ni t chc hi tho
OCLC/NCSA Metadata Workshop nm 1995. Core c ngha l mt danh sch cc thnh
phn ct li dng m t ti nguyn (Element metadata), nhng thnh phn ny c th m
rng thm.
Theo [35], thng 9/2001 b yu t siu d liu Dublin Core Metadata c ban hnh
thnh tiu chun M, gi l tiu chun The Dublin Core Metadata Element Set
ANSI/NISO Z39.85-2001.
Dublin Core Metadata bao gm 15 yu t c bn [35] c m t chi tit trong bng
bn di
Cc yu t c bn ca chun Dublin Core Metadata
STT
1
2
3

Yu t
Title
Creator
Subject

M t
Nhan hay tiu ca ti liu
Tc gi ca ti liu, bao gm c tc gi c nhn v tc gi tp th
Ch ti liu cp dng phn loi ti liu. C th th hin

http://www.library.uq.edu.au/iad/ctmeta4.html

http://www.nla.gov.au/nla/staffpaper/cathro3.html
http://dublincore.org/

10

28

bng t, cm t/(Khung ch ), hoc ch s phn loi/ (Khung


4
5
6
7
8
9

Description

phn loi).
Tm tt, m t ni dung ti liu. C th bao gm tm tt, ch

Publisher

thch, mc lc, on vn bn lm r ni dung


Nh xut bn, ni ban hnh ti liu c th l tn c nhn, tn c

Contributor

quan, t chc, dch v...


Tn nhng ngi cng tham gia cng tc ng gp vo ni dung

Date
Type

ti liu, c th l c nhn, t chc..


Ngy, thng ban hnh ti liu.
M t bn cht ca ti liu. Dng cc thut ng m t phm tr

Format

kiu: trang ch, bi bo, bo co, t in...


M t s trnh by vt l ca ti liu, c th bao gm; vt mang
tin, kch c di, kiu d liu (.doc, .html, .jpg, xls, phn

10

Identifier

mm....)
Cc thng tin v nh danh ti liu, cc ngun tham chiu n,
hoc chui k t nh v ti nguyn: URL (Uniform Resource
Locators) (bt u bng http://), URN (Uniform Resource Name),
ISBN (International Standard Book Number), ISSN (International
Standard Serial Number), SICI (Serial Item & Contribution

11

Source

Identifier), ...
Cc thng tin v xut x ca ti liu, tham chiu n ngun m ti
liu hin m t c trch ra/to ra, ngun cng c th l: ng

12
13
14

Language
Relation

dn (URL), URN, ISBN, ISSN...


Cc thng tin v ngn ng, m t ngn ng chnh ca ti liu
M t cc thng tin lin quan n ti liu khc. c th dng ng

Coverage

dn (URL), URN, ISBN, ISSN...


Cc thng tin lin quan n phm vi, quy m hoc mc bao
qut ca ti liu. Phm vi c th l a im, khng gian hoc

15

Rights

thi gian, ta ...


Cc thng tin lin quan n bn quyn ca ti liu

29

3.4 Rt trch metadata v cc nghin cu lin quan


Rt trch metadata l lnh vc nghin cu thu hp thuc lnh vc rt trch thng tin.
Hu ht cc phng php rt trch metadata hin nay c th chia lm 2 cch tip cn
chnh l: cc phng php da trn hc my [10][36][38][42] v mt nhm cc
phng php da trn lut [39][41][43], cc phng php ny c p dng kt hp cng
vi s xut hin v pht trin ca cc t in v cc Ontologies.
Theo [36], nhng phng php hc my rt trch metadata in hnh c th k n
nh: lp trnh logic, m hnh Markov n (Hidden Markov Models), Support Vector
Machince, v cc phng php hc thng k khc. Trong [36], nhm tc gi dng
SVM rt trch metadata t cc bi bo khoa hc. Qu trnh rt trch ca h gm 2
bc: bc th 1 h dng SVM phn lp cc dng (lines) thuc phn heading ca cc
ti liu (t phn gii thiu tr ln); bc th 2 h rt trch metadata t cc dng phn
lp trong bc th 1 dng cc lut du cu, k t vit hoa kt hp vi cc t in. Kt
qu th nghim ca cc tc gi trong [36] cho thy phng php ca h cho kt qu
tt hn cc phng php hc my khc (da trn thc nghim).
Trong [38], nhm tc gi xut phng php rt trch metadata dng CRF
(Conditional Random Fields) v da trn nh gi thc nghim trong [38], phng php
ca h cho kt qu tng ng vi phng php SVM trong [36]. Kt qu thc nghim
trong [36][38] cho thy cc phng php trong CRF v SVM l tng ng nhau v
hiu xut. Kt qu t c Precision t 86% - 99%, Recall t 45%-100%, v chnh
xc t 96% 100% (kt qu khc nhau i vi cc metadata khc nhau).
Trong [42], nhm tc gi xy dng mt package t tn l PDF2gsdl, package ny
ch dng rt trch cc tiu v tc gi t cc bi bo c nh dng PDF, package ny
c th dng kt hp vi phn mm th vin s Greenstone 11 to metadata t ng cho
cc ti liu trong th vin s. Trong [42], nhm tc gi p dng hc my v xy dng
b phn lp Neural dng c trng nh thng tin trnh by, kch thc font ch, v tr, th
nghim trn mt tp d liu bao gm 45 bi bo ly t cc k yu hi tho v chnh
xc t c cho tiu khong 93% v cho tc gi khong 70%.
11

http://www.greenstone.org/

30

Mc d nhng phng php my hc cp n trn p dng cho vic rt trch


metadata cho kt qu kh n tng. Tuy nhin chng ta bit rng i vi cc phng
php my hc, vic to ra mt tp d liu hc, c gn nhn s tn nhiu cng sc, chi ph
cho vic chn mu v gn nhn. l l do cho vic u t cho vic pht trin cc
phng php, h thng da trn lut, t in, ontologies [37][39][41][43].
Trong ti liu [37], nhm tc gi xut mt phng php rt trch cu trc logic
(tiu , cc tc gi, cc mc, cc nh ngha, nh l, ) t cc bi bo trong lnh vc
ton hc. T h xy dng xy dng mt trnh duyt gip ngi dng c th d
dng c cc bi bo ton hc. Thut ton hc xut gm 2 bc: th nht xc nh
nhng vng c bit trong ti liu (s trang, mc, phn footnote cui trang, tiu ca
cc bng biu v hnh nh) dng cc t kha, kiu dng font ch, khong cch khng
gian trnh by trong ti liu; sau thng tin chi tit s c xc nh t cc vng ny
da vo kiu dng, v tr v trnh by ca tng vng. Nhm tc gi thc nghim trn
29 bi bo ton hc v chnh xc l 93%.
Trong bi bo [39], nhm tc gi xut phng php lm giu mt Ontology v
nhng ngi lm ngh thut hay ngh s bng cch tm kim v rt trch cc thng tin c
nhn lin quan (ngy sinh, ni sinh, c quan cng tc, ngy thnh hn, qu trnh lm vic,
v.v) t kt qu tm kim trn internet. lm c iu , h tin hnh tch cu
trong vn bn (kt qu tm kim trn internet), sau dng GATE Framework nhn
din cc thc th nh NGI, A IM, THI GIAN v kt hp vi mt ontology
c sn Artequakt Ontology (CONCEPT-RELATION-CONCEPT) [39] nhn din
mi quan h gia cc thc th nh NGI, A IM, THI GIAN t cc cu trong
vn bn ca kt qu tm kim.
Mi cch tip cn u c nhng u, nhc im ring. i vi cc phng php my
hc th chng ta cn phi tn nhiu thi gian cho vic chn mu, gn nhn v c kt
qu tt cn rt nhiu d liu hc. Bn cnh cc phng php da trn lut v mu n
gin v d dng thc hin hn, nhng c kt qu tt cng tn rt nhiu cng sc cho
vic kho st, nh ngha lut ca chuyn gia. Cc lut cng cn phi thay i khi xut
hin cc loi d liu mi m nhng lut hin c khng th gii quyt c. Thng thng
31

i vi tng bi ton c th ngi ta s a ra mt cch tip cn v phng php gii


quyt vn tng ng ph hp vi bi ton t ra.
3.5 Cch tip cn ca ti
Phng php tip cn ca ti da trn xy dng cc lut, mu da trn thng tin
cu trc v trnh by ca ti liu, kt hp vi nhng t in, ontologies v th vin sn c
ca GATE rt trch cc metadata cho cc ti liu khoa hc.
3.5.1 Kin trc h thng

Hnh 8: Kin trc h thng rt trch metadata


3.5.2 Rt trch metadata da trn lut
Rt trch metadata cho mc header ca ti liu khoa hc

32

Hnh 9: Cc bc rt trch metadata t header ca bi bo


Rt trch metadata cho mc reference ca ti liu khoa hc

Hnh 10: cc bc rt trch metadata t phn reference ca bi bo

3.5.3 Cc lut JAPE rt metadata cho bi bo khoa hc


2.5.3.1 Lut xc nh t kha Abstract
Rule: AbstractKeyword
33

Priority: x
(
({SpaceToken.kind=="control"})+
({Token.string=="Abstract\u2014" } | {Token.string=="ABSTRACT\u2014"} |
{Token.string=="Abstract" } | {Token.string=="ABSTRACT"})
({Token.string=="."})?
):abstract_Keyword
-->
:abstract_Keyword.AbstractKeyword = {rule = "AbstractKeyword"}

3.5.3.2 Lut xc nh t kha References


Rule: ReferencesKeyword
Priority: x
(
({SpaceToken.kind=="control"})+
(
{Token.kind=="number"}
({Token.string=="."})?
({SpaceToken.kind=="space"})+
)?
({Token.string=="References"} | {Token.string=="REFERENCES"} |
{Token.string=="reference"} | {Token.string=="REFERENCE"} )
):referencesKeyword
-->
:referencesKeyword.ReferencesKeyword = {rule= "ReferencesKeyword" }

3.5.3.3 Lut tch cc References


Rule:ReferencesBreak
Priority: x
(
(
{SpaceToken.kind=="control"}
34

(
(
({Token.string=="["})
({Token} | {SpaceToken.kind=="space"})+
({Token.string=="]"})
):referenceBreak_1
|
(
({Token.string=="("})
{Token.kind=="number", Token.length < 3}
({Token.string==")"})
):referenceBreak_2
|
(
{Token.kind=="number", Token.length < 3}
{Token.string=="."}
):referenceBreak_3
)
)
|
(
({Token.string=="References"} | {Token.string=="REFERENCES"} |
{Token.string=="."} | {Token.kind=="number"} | {Lookup.majorType=="year"})
(({SpaceToken.kind=="control"})+):referenceBreak_4
({Person} | {Lookup.majorType=="person_first"})
)
)
-->
:referenceBreak_1.ReferenceBreak_1 = {rule = "ReferencesBreak"},
:referenceBreak_2.ReferenceBreak_2 = {rule = "ReferencesBreak"},
:referenceBreak_3.ReferenceBreak_3 = {rule = "ReferencesBreak"},
:referenceBreak_4.ReferenceBreak_4 = {rule = "ReferencesBreak"}

3.5.3.4 Lut xc nh dng email


Rule:LineEmailAnnotation
Priority: x
(
35

(
{Token.string=="{"}
(
{Token}
({SpaceToken.kind=="space"})?
)+
({SpaceToken.kind=="control"})?
)?
(
{Token}
({SpaceToken.kind=="space"})?
)+
(
{Token.string=="@"} | {Address.kind=="email"} | {Token.string=="}"}
)
({SpaceToken.kind=="space"})?
(
{Token}
({SpaceToken.kind=="space"})?
)+
):lineEmailAnnotation
-->
:lineEmailAnnotation.LineEmailAnnotation = {rule = "LineEmailAnnotation"}

3.5.3.5 Lut xc nh dng c quan cng tc


Rule:LineAffiliationAnnotation
Priority: x
(
(
{Token.string=="Dept"} | {Token.string=="dept"} |
{Token.string=="University"} | {Token.string=="university"} |
{Token.string=="Faculty"} | {Token.string=="FACULTY"} |
{Lookup.majorType=="location"} |
{Lookup.majorType=="org_key"} | {Lookup.majorType=="org_base"} |
{Lookup.majorType=="cdg"} | {Lookup.majorType=="facility_key", !
Token.string=="Hall"} |
(
(
36

{Token.kind=="number", Token.length>=3}
{SpaceToken.kind=="space"}
)
|
(
{Token.kind=="number"}
({SpaceToken.kind=="space"})?
({Token.kind== "punctuation", Token.subkind =="dashpunct"})
({SpaceToken.kind=="space"})?
{Token.kind=="number"}
)
)
)
({SpaceToken.kind=="space"})?
(
{Token}
({SpaceToken.kind=="space"})?
)*
):lineAffiliationAnnotation
-->
:lineAffiliationAnnotation.LineAffiliationAnnotation = {rule = "LineAffiliationAnnotation"}

3.5.3.6 Lut tch cc tc gi t dng tc gi


Rule: Author
Priority: 40
(
(
{Person}
|
(
{Token.string!=",", Token.string!="and", Token.kind!="number"}
)+
):author
)
-->
:author.Author = {rule= "Author"}
37

3.6 Thc nghim v nh gi


Chng ti download cc ti liu, bi bo khoa hc t cc th vin s v tp ch
chuyn ngnh Khoa hc My tnh nh ACM, Springer, IEEE, Citeseer, thc
nghim. Chng ti tin hnh thc nghim vi 200 bi bo c download. nh
gi kt qu cch tip cn chng ti s dng cc o truyn thng c dng trong truy
vn thng tin l chnh xc Recall (R), tin cy Precision (P), v o F-measure.
R

Trong

tp
tp
2 P R
P
F
( P R)
(tp tn) ;
(tp fp ) ;

tp: s kt qu ng c tm thy
tn: s kt qu ng m khng tm thy
fp: s kt qu tm thy m khng ng

Kt qu thc nghim c o trn mt s thuc tnh metadata chnh theo chun Dubline
Core Metadata, v kt qu c th hin trong bng bn di:
Metadata
Title
Authors
Affiliation
Email
Abstract
References

Precision (%)
100.00
92.72
95.83
100.00
96.55
97.44

Recall (%)
100.00
89.47
92.00
100.00
93.33
88.05

38

F-Measure (%)
100.00
91.07
93.87
100.00
94.92
92.51

CHNG 4: KT LUN V HNG PHT TRIN


4.1 Kt lun
Vi mc tiu tm kim v xy dng mt m hnh tri thc cho ti liu vn bn v khai
thc cc thnh phn tri thc lin quan t vn bn a vo m hnh hng n xy
dng mt h thng tm kim, truy vn thng minh hn. Chuyn tp trung nghin cu
tng quan v lnh vc rt trch thng tin t vn bn, cc phng php, h thng, ng
dng lin quan nh vn rt trch cm t kha, rt trch siu d liu (metadata), rt
trch cc thc th v quan h gia cc thc th. Phn nghin cu chnh ca chuyn l
xut cch tip cn rt trch t ng thnh phn metadata t cc bi bo khoa hc
chuyn ngnh Cng ngh Thng tin cng b trong cc k yu hi tho, tp ch chuyn
ngnh da trn vic xy dng cc mu (pattern) vi cc yu t ln cn ca thnh phn rt
trch (tin t, hu t). Kt qu t c ca chuyn c th tm tt nh sau:
Kin thc c bn v rt trch thng tin vn bn
Cc nghin cu lin quan, bi ton ng dng ca rt trch thng tin vn bn
Cc phng php rt trch cm t kha (keyphrase), thc th, quan h gia cc
thc th v cc phng php rt trch siu d liu (metadata) t bi bo khoa hc
xut phng php rt trch metadata da trn vic xy dng cc lut, mu
(pattern) kt hp cc t in, thng tin tin t v hu t.
Chuyn cng thu thp d liu bao gm cc bi bo khoa hc chuyn ngnh
Cng ngh Thng tin t cc tp ch, th vin s nh ACM, IEEE, Springer,
CiteSeer thc nghim. V kt qu t c hon ton c th so snh vi cc
phng php my hc khc (chi tit kt qu thc nghim v nhn xt nh gi ti
mc 3.6 chng 3)
Cng b 2 bi bo trong hi tho quc t ( ICEMT2010 ca t chc IEEE, v mt
trong hi tho IT@EDU2010) [44][45]

39

4.2 Hng pht trin


Nghin cu ci tin cc phng php rt trch cm t kha, rt trch thc th v
quan h t ti liu.
Xy dng m hnh tri thc cho ti liu vn bn gm cc thnh phn chnh: siu d
liu (Metadata), cm t kha (Keyphrase), thc th (Entity) v quan h
(Relationship).
Xy dng o cho m hnh tri thc vn bn
ng dng xy dng h thng truy vn ti liu thng minh (tm kim, hi p).

40

TI LIU THAM KHO


[1] Line Eikvil. Information Extraction from World Wide Web A Survey. Norwegian
Computing Center, PB, Citeseer. July 1999.
[2] Jim Cowie and Yorick Wilk. Information Extraction, 1996.
[3] Alexander Yates. Information Extraction from the Web: Techniques and Applications.
Phd thesis, University of Washington, 2007.
[4] Kamal Nigam, Google Pittsburg. Machine Learning for Information Extraction: An
Overview, 2007. (Slides)
[5] Dr Diana Maynard, Computer Science Department,University of Sheffield.
http://gate.ac.uk/g8/page/print/2/demos/talks/maynard_diana_01.wmv. (Slides&video)
[6] Eleni Mangina *, John Kilbride. Evaluation of keyphrase extraction algorithm and
tiling process for a document/resource recommender within e-learning environments. Edu
Elsevier. 2008.
[7] Yi-fang Brook Wu, Quanzhi Li. Document keyphrases as subject metadata:
incorporating document key concepts in search results. Inf Retrieval -Springer. 2008.
[8] Mo Chen, Jian-Tao Sun, Hua-Jun Zeng, Kwok-Yan Lam. A Practical System of
Keyphrase Extraction for Web Pages. ACM SIGIR_2005.
[9] Raymond J. Mooney and Rarvan Bunescu. Mining knowledge Using Information
Extraction. ACM SIGKDD_2005.
[10] K. Seymore, A. McCallum, R. Rosenfeld, Learning hidden Markov model structure
for information extraction, In: AAAI, Workshop on Machine Learning for Information
Extraction, 1999.
[11] Su Nam Kim-University of Melbourne, Min-Yen Kan-National University of
Singapore, Re-examining Automatic Keyphrase Extraction Approaches in Scientific
Articles, Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP
2009, Singapore, 6 August 2009, c2009 ACL and AFNLP, page 9-16.

41

[12] Niraj Kumar & Kannan Srinathan, Automatic Keyphrase Extraction from Scientific
Documents Using N-gram Filtration Technique, Proceeding of the eighth ACM
symposium on Document engineering. Information extraction in documents, 2008, page
199-208.
[13] Jiabing Wang et al, Ensemble Learning for Keyphrases Extraction from Scientific
Document, Book-Advances in Neural Networks - ISNN 2006, Publisher Springer
Berlin/Heidelberg 2006, page.1267-1272.
[14] Yi-fang Brook Wu, Quanzhi Li, Razvan Stefan Bot, Xin Chen, Domain-specific
Keyphrase Extraction. CIKM05, October 31-November 5, 2005, Bremen, Germany,
ACM-2005.
[15] P.D. Turney, Learning algorithms for keyphrase extraction, Information Retrieval,
vol. 2, no. 4, pp. 303- 336, 2000.
[16] P.D. Turney, Learning to Extract Keyphrases from Text. National Research Council,
Institute for Information Technology, Technical Report ERB-1057, 1999.
[17] I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin and C.G. Nevill-Manning. KEA:
Practical automatic Keyphrase Extraction. The proceedings of Digital Libraries '99: The
Fourth ACM Conference on Digital Libraries, pp. 254-255, 1999.
[18] Web link for KEA5.0 source code: http://www.nzdl.org./Kea/download.html
[19] Teuvo Kohonen, et al. Self-Organizing Maps, Third edition, Springer, 2002.
[20] A. Rauber, D. Merkl, and M. Dittenbach: The Growing Hierarchical Self-Organizing
Map: Exploratory Analysis of High-Dimensional Data in: IEEE Transactions on Neural
Networks, Vol. 13, No 6, pp. 1331-1341, IEEE, November 2002.
[21] Michael Dittenbach, Andreas
Rauber, Dieter Merkl, Uncovering Hierarchical Struture in Data Using the Growing
Hierarchical Self-Organizing Map, Institute of Software Technology, Vienna University
of Technology, Vienna Austria, 24 July 2002.
[22] Hoang Kiem Huynh Ngoc Tin. Organization, management and knowledge
discovery from the English, Vietnamese text collection. Proceedings JCIS2003-USA. (7th
42

Joint Conference on Information Sciences, September 2003, North Carolina, USA), page
1613-1616.
[23] Phc, Hong Kim. Rt trch chnh t vn bn ting Vit h tr tm tt ni
dung. Tp ch cc cng trnh nghin cu trin khai vin thng v cng ngh thng tin,
s 13, 2004.
[24] ng Th Bch Thy, H Bo Quc. ng dng x l ngn ng t nhin trong h tm
kim thng tin trn vn bn ting Vit. i hc Khoa hc T nhin, 2003.
[25] Hunh Ngc Tn. Qun l ni dung v khai thc tri thc trn bn vn bn ting
Vit. Lun vn thc s ti trng i hc Khoa hc T nhin HQG TpHCM, 2003.
[26] Nguyn Tun ng. Khai thc d liu vn bn ting Vit vi SOM (SelfOrganizationg Map). Lun vn thc s Khoa CNTT - HKHTN - HQG TpHCM. 2002.
[27] Dinh Dien, Hoang Kiem, Nguyen Van Toan. Vietnamese Word Segmentation.
Proceedings of the NLPRS2001, Tokyo (Japan, 27-30 November 2001, p.749-756.
[28] Scott Miller, Heidi Fox, et al. A Novel use of statistical parsing to extract
information from Text, In 6th Applied Natural Language Processing Conference, 2000.
[29] Zhou GuoDong, Su Jian, et al. Exploring Various Knowledge in Relation Extraction.
Proceedings of the 43rd Annual Meeting of ACL, pages 427 434, Association for
computational linguitics, 2005.
[30] Dmitry Zelenko, Chinatsu Aone, Anthony Richardella. Kernel Methods for Relation
Extraction. Journal of Machine Learning Research 3, pages 1083-1106, 2003.
[31] Razvan C. Bunescu, Raymond J. Mooney. Subsequence Kernels for Relation
Extraction. In Advances in Neural Information Processing Systems, 2006.
[32] Brill, E. Transformation-based error-driven learning and natural
language

processing:

case

study

in

part-of-speech

tagging.

Computational Linguistics, 21(4), 543565, 1995.


[33] D. Bainbridge, J. Thompson, and I. Witten, Assembling and enriching digital library
collections, In Proc. Joint Conference on Digital Libraries, pages 323334, 2003.
43

[34] D. Bainbridge, K. J. Don, G. R. Buchanan, I. H. Witten, S. Jones, M. Jones, and M.


I. Barr, Dynamic digital library construction and configuration, In Proc. European
Conference on Digital Libraries, pages 116, 2004.
[35]

http://www.nlv.gov.vn/nlv/index.php/en/2008060697/DUBLIN-CORE/XML-

Metadata-va-Dublin-Core-Metadata.html
[36] H. Han, C.L. Giles, E. Manavoglu, H. Zha, Z. Zhang, E.A. Fox, Automatic
document metadata extraction using support vector machines, In: Proceedings of the 3rd
ACM/IEEECS Joint Conference on Digital Libraries, International Conference on Digital
Libraries, pages 3748. IEEE Computer Society Press, Washington, DC, 2003.
[37] K. Nakagawa, A. Nomura, and M. Suzuki, Extraction of Logical Structure from
Articles in Mathematics, MKM, LNCS 3119, pages 276-289, Springer Berlin Heidelberg
from Articles in Mathematics, 2004.
[38] F. Peng, A. McCallum, Accurate Information Extraction from Research Papers using
Conditional Random Fields, Information Processing and Management: an International
Journal, Pages: 963 979, 2006.
[39] H. Alani, S. Kim, D. E. Millard, M. J. Weal, P. H. Lewis, W. Hall and N. R Shadbolt,
Automatic Extraction of Knowledge from Web Documents, In: 2nd International
Semantic Web Conference - Workshop on Human Language Technology for the Semantic
Web abd Web Services, October 20-23, Sanibel Island, Florida, USA, 2003.
[40] J. Greenburg, K. Spurgin, A. Crystal, Final Report for the Automatic Metadata
Generation Applications (AMeGA) Project, UNC School of Information and Library
Science. http://ils.unc.edu/mrc/amega/, 2005. Last visited date 30/04/2010.
[41] P. Flynn, L. Zhou, K. Maly, S. Zeil, and M. Zubair, Automated Template-Based
Metadata Extraction Architecture, ICADL 2007, LNCS 4822, pages 327336, 2007.
Springer-Verlag Berlin Heidelberg, 2007.
[42] S. Marinai, Metadata Extraction from PDF Papers for Digital Library Ingest, 10th
International Conference on Document Analysis and Recognition. ICDAR-IEEE, pages
251-255, 2009.
44

[43] B. A. Ojokoh, O. S. Adewale and S. O. Falaki, Automated document metadata


extraction. Journal of Information Science, pages 563-570, 2009.
[44] Tin Huynh, Kiem Hoang. Automatic Metadata Extraction from sciencetific papers.
Proceeding of IT@EDU, Phan Thiet, VietNam, 2010.
[45] Tin Huynh, Kiem Hoang. GATE Framework Based Metadata Extraction from
Scientific Papers, Proceeding of ICEMT Egypt, IEEE, 2010.

45

You might also like