Professional Documents
Culture Documents
Tm tt
Khai thc vn bn (text mining) l mt nhnh ca data mining nhm tm kim v trch
xut thng tin nm trong vn bn [2]. Hin nay, vi s tng trng nhanh chng ca d
liu vn bn, text mining ngy cng c nhiu ng dng trong thc t, nh lc th rc, i
chiu l lch c nhn, phn tch cm ngh, phn loi ti liuBo co ny nhm gii thiu
v text mining cng nh cc l thuyt c bn trong trch xut thng tin t vn bn.
1. GII THIU
Hin nay, c s d liu vn bn (text
database) ang pht trin nhanh chng v
thu ht s quan tm nghin cu bi s
gia tng nhanh chng s lng thng tin
dng s, v d nh cc loi ti liu in t,
email, th in t, cc trang webC th
thy hu ht thng tin ca cc chnh ph,
cc ngnh cng nghip, kinh doanh,
trng hcu c s ha v lu tr
di dng c s d liu ny. D liu lu
tr trong c s d liu vn bn l d liu
bn cu trc (semistructrured data), tc
l chng khng hon ton phi cu trc
(unstructured) nhng cng khng hon
ton c cu trc [1]. V d, mt ti liu c
th cha mt vi trng c cu trc chng
hn nh tiu , tn tc gi, ngy xut
bn, phn loinhng cng c th cha
mt lng ln nhng thnh phn vn bn
phi cu trc chng hn nh phn tm tt
hay ni dung ca ti liu. Do o van e a t
ra la la m sao e co the tm kiem va khai
tha c tri thc t nguon d lie u nh va y.
Ca c ky thua t e gia i quyet van e na y
c go i la ky thua t "Text Mining" hay
khai pha d lie u va n ba n [4].
|{
} {
}|
|{
}|
} {
|{
|{
}|
}|
)/2
( , )
( , )=0
Bn cnh o tn s t, cn c mt
o khc cng tng i quan trng c
gi l nghch o tn s ti liu (Inverse
Document Frequency - IDF) biu din
mc quan trng ca t t [3]. Nu mt
t no xut hin trong nhiu vn bn,
mc quan trng ca n s b gim
xung bi nng lc phn bit ti liu ca
n b gim xung [1]. V d, cm t
database system c kh nng l t quan
trng nu n xut hin hu ht cc bi
bo trong cc hi tho v c s d liu.
Gi tr IDF(t) c nh ngha bi cng
thc sau:
()=
1+| |
| |
)=
.
=
| |. | | .
.
. .
4. KT LUN
Nhn chung, co the thay rang ca c ky thua t
ca text mining phc ta p hn so vi ca c
ky thua t data mining truyen thong bi
pha i thc hie n tre n d lie u va n ba n von
da ng phi cau tru c va co tnh m (fuzzy).
Tuy nhin, thc t cho thy hin nay
ngi s dng vn u thch v dng ngy
cng nhiu cc h thng lu tr d liu
dng vn bn. T o ta co the tin rang ca c
sa n pham t text mining co the co gia tri
thng ma i cao hn nhieu lan so vi ca c
sa n pham khai pha d lie u truyen thong
kha c [4].
5. THAM KHO
[1]
J. Han and M. Kamber, Data mining: concepts and techniques. San Francisco: Morgan
Kaufmann Publishers, 2006.
[2]
[3]
R.Feldman and J.Sanger, The text mining textbook: advanced approaches in analyzing
unstructured data. Cambridge Univ. Press, 2007
[4]