2227 Diana Cs ICTS2007

FOURIER DOMAIN SCORING FOR RANKING METHOD
IN SMALL DATA SET WITH PREPROCESSING

USING ORACLE TEXT
Diana Purwitasari, Nanik Suciati, Rully Soelaiman, Dian Farida
Informatics Engineering Department, Sepuluh Nopember Institute of Technology, Surabaya, Indonesia
diana@its-sby.edu
ABSTRACT
Document retrieval methods using vector space model
(VSM) transformtextual data into numeric vectors. VSMmeth-
ods apply matrix analysis techniques to give relevance scores
to documents when related to a specic query. With VSM, any
spatial information contained in documents is lost. Fourier
Domain Scoring (FDS) in [1] takes advantage of spatial infor-
mation and uses Fourier transform to give relevance scores.
Before transforming into vectors, document contents will be
represented as a set of index terms. Oracle Text extracts tex-
tual data from all different document formats to help in creat-
ing those document representations. Next, to prove that spa-
tial information usage makes an improvement in searching, the
capability of both ranking methods must be measured. In this
paper, even though with small data set (n = 500 documents),
FDS shows such improvement in accuracy of search results.
1 INTRODUCTION
Information Retrieval (IR) is a search process within docu-
ment collections that satisfy user information need. IR system
must somehowinterpret the contents of document collections
and rank them according to a degree of relevance to the user
query. The primary goal of an IR system is to retrieve all the
documents which are relevant to a user query while retrieving
non-relevant documents as few as possible. There are three ba-
sic techniques for searching information retrieval collections:
Boolean models, vector space models, and probabilistic mod-
els [2].
Vector space models (VSM) transform textual data into nu-
meric vectors, and then apply matrix analysis techniques to
give relevance scores to documents when related to a specic
query. VSM return documents in an ordered list, sorted ac-
cording to a relevance score. Once documents are converted
into vectors, the number of times each term appears is repre-
sented in vectors, but the position of terms is ignored. And that
makes any spatial information contained in documents is lost.
Fourier Domain Scoring (FDS) method presented in [1] can
keep document spatial information and use it to rank docu-
ments. In FDS, rather than storing only the frequency of terms,
it stores the signal of terms that show how terms are spread
throughout documents. By comparing magnitude and phase
value of terms signals, FDS could observe documents where
terms equals to query terms should be occuring a lot and ap-
pearing together.
Before transform textual data into numeric vectors, for each
document in collections, the contents are frequently repre-
sented through a set of index terms or keywords. However with
very large collections, set of representative keywords might
have to be reduced. This can be accomplished through elim-
ination of stopwords or use of stemming [2]. Stopwords are
words which occur frequently inside a document, i.e. arti-
cles, prepositions, and conjunctions. And stemming is a tech-
nique for reducing distinct words to their common grammati-
cal roots.
Oracle Text [3], a tool that integrated to Oracle10g database
with technology features support IR system, can lter and ex-
tract textual data from all different document formats such as
Microsoft Ofce les, Adobe PDF les, as well as HTML and
XML. Oracle Text provides text indexing for text retrieval ap-
plications. Therefore, here in this paper Oracle Text is used to
do indexing to help in creating document representation.
Initial results in [4] using 1, 000 documents, 100 queries
and larger document sets in [1] with 150, 000 documents,
200 queries showed that FDS improves the accuracy of search
results. In order to prove that spatial information usage makes
an improvement in accuracy of search results, the capability of
ranking methods with and without spatial information must be
measured. Determining sample size of document collections is
a very important issue because samples that are too large may
waste time and resources, while samples that are too small may
lead to inaccurate results. In this paper, 500 documents, 30
queries are selected and it is showed that FDS also gives such
improvement in accuracy of search results even with small data
set.
This paper is organized as follows: Section 2 will give out-
line the basic steps needed to do ranking documents using FDS
including indexing using Oracle Text in preprocessing phase.
Section 3 will illustrate the difference between VSM and FDS.
Section 4 explains the experiments performed using selected
data from TREC data set [5] and compares the results with
VSM and FDS ranking method.
2 Fourier Domain Scoring
This is an illustration of basic concept in FDS. For exam-
ple if user input a query of distance learning system. The
results will contain terms of distance, learning, and sys-
tem, or their combinations. But the ones that user really needs
are documents that has text distance learning system in that
following order. To nd key phrases, spatial or position in-
formation must be used to determine whether the query terms
appear together. Document d
1
will be more related to a query
q than document d
2
, if d
1
contains more query terms than d
2
,
and the query terms appear together throughout document d
1
.
FDS tries to capture the location of terms in documents.
While VSM gives a relevance score based on the count of
each term in documents, FDS not only justies the appear-
ance of the term by observing the magnitude but also com-
pares the position of terms by comparing their phase. In FDS,
occurence numbers and positions of terms is called as magni-
tude and phase value of terms signals. A relevant document to
query should have large magnitude, and corresponding phase
of terms equal to query terms should be similar.
2.1 Create Inverted Index
VSM stores count of a frequency term, while FDS stores a
term signal. Term signal shows how a term is spread through-
out the document. If a term appearing in a document can be
thought of as a signal, then this signal can be transformed and
compared with other signals. Term-signal is an information
value represented in time domain since it describes when a
term occurs and its number of occurrences in certain location.
For example there is a document consists of 4 words with 3
distinct terms. There will be 3 term-signal generated for this
document as shown in Figure 1. The time domain here means
times for a term occurs. The value 1 refers to that current term
exists in certain position while value 0 means not exists.
Using this kind of signal mapping, there is a problem with
term signal vector size. If there are lots of terms numbers
inside a document then term signal will consist of large data
points but most of them having a value of zero. To reduce term
vector size-problem, the terms should be grouped into bins.
If the number of bins is set to B, a document containing W
terms would have W/B terms in each bin. For example, Fig-
ure 2 shows the positions of travels throughout a document.
For B = 8, travels spatial term signal becomes [1 0 0 2 0 2
0 0] in time domain.
In FDS, to map each document into a set of term vec-
tors, rst, the contents of documents must be extracted into
terms. Once documents are loaded into database, CREATE
INDEX statement in Oracle Text can be used to create a
CTXSYS.CONTEXT index type. Note, this index type is used
to index large amounts of text such as Word, PDF, XML,
HTML or plain text documents.
The basic index creation is detailed into three steps [3]:
Figure 1. Illustration of term signal in document
Figure 2. Visual example of how the term signals are obtained
Step 1 Create preferences.
Here, source code to create preferences which tell Ora-
cle Text to specify stored location of documents in a text
column, to converts tokens to all uppercase, to specify
language of text being indexed and index theme informa-
tion in English, and to list words that are not to be indexed
(stopwords list).
1 begin
ctx ddl.create preference(my datastore, direct datastore);
3 ctx ddl.create preference(my lexer, basic lexer);
ctx ddl.set attribute(my lexer, mixed case, no);
5 ctx ddl.set attribute(my lexer, index text, yes);
ctx ddl.set attribute(my lexer, index themes, yes);
7 ctx ddl.set attribute(my lexer, theme language, english);
ctx ddl.create stoplist(my stoplist, basic stoplist);
9 end;
Step 2 Create index.
Oracle Text indexes text by converting all terms into to-
kens which are usually terms separated by spaces. To-
kens can also be numbers, acronyms and other strings that
are whitespaces separated in documents. Having created
DATASTORE, LEXER, and STOPLIST preferences, then
indexing of documents that saved as binary les in table
docs and column contents, is created using parame-
ters as follows:
1 create index my index on docs(contents) indextype is
ctxsys.context parameters (datastore my datastore
3 lexer my lexer stoplist my stoplist);
For example there is a sample collection of 3 documents,
D = {d
1
, d
2
, d
3
}:
- d
1
= The cat in the hat is my cat
- d
2
= Either your cat or my cat is a ne pet
- d
3
= Dogs and cats make good pets
After creating index named my index, Oracle index ta-
ble dr$my index$i is generated which contains col-
umn token text and token count as shown in Ta-
ble 1(a). The tokens returned have already excluded stop-
words like term the, in, a and so on. Column
token count shows number of documents in collec-
tion that consists terms in column token text.
Step 3 Collect terms into spatial bins.
Use CTX DOC.TOKENS to identify text tokens positions
in a document. Tokens returned are those tokens which
are inserted into index list in Table 1(a). Stopwords are
not returned because they are not text tokens. Sample
to tokenize document d
1
, d
2
, d
3
and save the results into
table my token is shown as follows:
1 ctx doc.tokens(my index, 1, my token, 1);
ctx doc.tokens(my index, 2, my token, 1);
3 ctx doc.tokens(my index, 3, my token, 1);
Table 1(b) shows list of token texts in Oracle token table
my token that ordered based on position of terms inside
documents. With some SQL scripts, the terms inside each
document can be grouped into the number of bins chosen.
Now, all document in collections have already been indexed.
Using those information, a term in in each document is repre-
sented as:
< n > . . . (1)
Table 1. Create inverted index using Oracle Text
(a) After indexing (b) After tokenizing
where n is the number of nonzero bins, b
i
is the bin num-
ber, and f
i
is count of the term in bin b
i
. If each document,
d
1
, d
2
, d
3
, is represented as a vector with number of elements
is equal to number of terms, and B = 2, then the vectors will
be
d
1
= [(<2>)
t
1
=cat
, (<1>)
t
2
=hat
]
T
d
2
= [(<1>)
t
1
=cat
, (<1>)
t
3
=fine
,
(<1>)
t
4
=pet
]
T
d
3
= [(<1>)
t
1
=cat
, (<1>)
t
4
=pet
,
(<1>)
t
5
=dog
, (<1>)
t
6
=make
,
(<1>)
t
7
=good
]
T
(2)
2.2 Calculate Terms Weight
Termweights are used to compute similarity degree between
each document in collection and the user query. Let t be a cer-
tain index term in document d and B is the chosen number
of bins. A weight w
d,t,b
> 0 is associated with the occur-
rence numbers of term t inside document d in location bin b,
where b {0, . . . , B 1}. For each term t insides document
d, the vector will be w
d,t
= [w
d,t,0
, . . . , w
d,t,B1
]
T
. This
weight vector w
d,t
quanties the importance of index term t
for describing document semantic contents. Using this de-
nition, assume value of weight vector is equal to term count,
then weight vector for term CAT and HAT in d
1
will be
w
d
1
,t
1
= [, ]
T
and w
d
1
,t
2
= []
T
,
respectively. But this is basic denition about term weight. In
practical, value of weight not just term count but will be mod-
ied as explained next.
FDS requires to apply weights before performing the
Fourier transform. FDS [1] implements BD-ACI-BCAscheme
[6] which is used for calculation similarity measure between
documents and queries; document-term weight denoted by
w
d,t
(in position xx-Axx-xxx) and the query-term weight de-
noted by w
q,t
(in position xx-xxx-Bxx) as the equations are
shown below:
w
d,t
= 1 + log
e
f
d,t
(3)
w
q,t
= (1 + log
e
f
q,t
) log
e
(1 +
f
m
f
t
) (4)
where f
d,t
, f
q,t
are count of terms t in document d and query
q, respectively, f
t
is the number of documents having term t,
and f
m
is the largest f
t
.
With Oracle Text as preprocessing tool, the value of f
t
, and
of course the value of f
m
, can be retrieved from Table 1(a)
column token count. With some SQL scripting, value of
f
d,t
is obtained from Table 1(b).
In document collections, length of document is often var-
ied. For example there is a short document that is totally rele-
vant to query, and a long document that is partially relevant to
query. Both documents have same number of query terms oc-
currences. Those two documents should not be ranked equally
and the shorter one should be ranked higher. Here, normal-
ization of term weights is used to remove the advantage that
the long documents have in retrieval over the short documents.
Document length normalization is a way of penalizing term
weights for a document in accordance with its length.
BD-ACI-BCA scheme [6] normalizes document-term
weight w
d,t
with its document length, or document vector
norm, W
d
(in position xx-xxI-xxx); And w
d,t
in (3) becomes
w
d,t
=
1 + log
e
f
d,t
(1 s) +s
W
d
av
dD
W
d
(5)
where s is the slope factor (set to 0.7), and av
dD
W
d
is the
average document vector norm in set of document collections
D. Because FDS captures location of terms within document
through the value of bin b, it makes w
d,t
must consider bin
value and modify (5) into
w
d,t,b
=
1 + log
e
f
d,t,b
(1 s) +s
W
d
av
dD
W
d
(6)
where f
d,t,b
is the count of term t in bin b of document d and
b (0, 1, , B 1).
2.3 Perform Fourier Transform
The sequence number w
d,t,0
, . . . , w
d,t,B1
is still an infor-
mation in time or spatial domain and needs to be transformed
into frequency domain. The Fourier transform denes a rela-
tionship of signals in time or spatial domain with its represen-
tation in frequency domain known as a (Fourier) spectrum. A
spectrum is made up of a number of frequency components
with real and imaginary part for each frequency component.
Or in a different way, spectrum has an associated phase, in ad-
dition to a magnitude to represent the same information. The
discrete form of Fourier transform is of the form [1]:
v
d,t,
=
B1
b=0
w
d,t,b
_
cos
2b
B
i sin
2b
B
_
(7)
where v
d,t,
is projection of term signal

w
d,t
onto a sinusoidal
wave of frequency . The spectral component number is an
element of set {0, . . . , B 1}.
The Discrete Fourier Transform would produce the follow-
ing mapping [1]:
_
w
d,t,b
_
_
v
d,t,b
_
=
_
H
d,t,b
exp(i
d,t,b
)
_
(8)
where v
d,t,b
is bth frequency component of termt in document
d, H
d,t,b
, and
d,t,b
are magnitude and phase of frequency
component v
d,t,b
, respectively, and i is
1.
2.4 Calculate Magnitude Value
A relevant document should have large magnitudes, there-
fore more weight must be given to documents with more oc-
currences of query terms. To calculate the effect of query terms
on document, Sum Magnitudes H
m
d,b
takes into account only
the magnitude part, H
d,t,b
, of frequency component, v
d,t,b
.
While Sum Vectors H
v
d,b
also considers the phase part,
d,t,b
,
of v
d,t,b
.
Let T be set of query terms, then magnitude value using
Sum Magnitudes, H
m
d,b
, and Sum Vectors, H
v
d,b
, will be:
H
m
d,b
=
tT
w
q,t
H
d,t,b
(9)
H
v
d,b
=
_
_
tT
w
q,t
H
d,t,b
cos
d,t,b
_
2
+
_
tT
w
q,t
H
d,t,b
sin
d,t,b
_
2
_1
2
(10)
2.5 Calculate Phase Precision Value
In a relevant document, the corresponding phases from each
term equals to query term should be similar. There are three
ways to examine phase information of term signal: Non-Zero
phase precision
n
d,b
, Zero phase precision
z
d,b
and No phase
precision
1
d,b
.
z
d,b
=
_
_
tT;H
d,t,b
=0
cos
d,t,b
#(T)
_
2
+
_
tT;H
d,t,b
=0
sin
d,t,b
#(T)
_
2
_1
2
(11)
n
d,b
=
_
_
tT;H
d,t,b
=0
cos
d,t,b
#(
T
d,b
)
_
2
+
_
tT;H
d,t,b
=0
sin
d,t,b
#(
T
d,b
)
_
2
_1
2
(12)
1
d,b
= 1 (13)
Zero phase precision
z
d,b
(11) only includes phase of fre-
quency component with nonzero magnitude because term with
zero magnitude value means that the term does not exist and its
phase value could be left out. For each frequency component,
the phase value of terms equal to query terms in document will
be summed and averaged with total number of query terms,
#(T).
Non-Zero phase precision
n
d,b
(12) is similar to Zero phase
precision, but instead of being averaged with total number
of query terms #(T), it averages over total number of query
terms, #(
T
d,b
). Note, set of query terms

T
d,b
is list of terms
equal to query terms which do not have zero magnitude for
frequency component b in document d.
The last method, No phase precision
1
d,b
(13) ignores any
phase information. This is best used when the phase has al-
ready been taken into account when creating the magnitude
vector. In another words that happens when Sum Magnitude is
selected as method to calculate magnitude of documents. Us-
ing that assumption, its precision value always sets to one.
2.6 Calculate Score Value
After magnitude and phase precision of frequency compo-
nents in each document have been obtained, the next step is to
combine them to create document score vector by multiplying
their values. The score value for each frequency component
s
d,b
will be:
s
d,b
= H
d,b

d,b
(14)
Here, H
d,b
means that to calculate magnitude value, either
Sum Magnitudes, H
m
d,b
, or Sum Vectors, H
v
d,b
can be selected.
While to calculate phase precision value
d,b
is an option of
Non-Zero phase precision
n
d,b
, Zero phase precision
z
d,b
or
No phase precision
1
d,b
.
To get document score S
d
, each score of frequency compo-
nents in document score vector will be summarized. There are
four methods selected from [1] to do the summarization.
First method is called as SumAll Components which com-
bines s
d,b
of all frequency components in document d. How-
ever Nyquist-Shannon sampling theorem [7] states that the
highest frequency component to be found in a real signal is
equal to half of sampling rate. This implies that, if there are B
frequency components for term signal, then to analyze the sig-
nal would only need to examine frequency components of 1 to
B
2
+ 1 [1]. The zeroth component (DC component) is always
the largest value from all components and because of that, this
frequency component can be ignored. Using that assumption,
only half of frequency component scores in document score
vector will be needed and the document score will be:
S
d
=
B
2
+1
b=1
s
d,b
(15)
If there are high values contained in any elements of doc-
ument score vector, resulted from either magnitude or phase
precision value, then that document should be considered more
relevant to query terms. The idea of next three methods is to
calculate document score which only considers summation of
two greatest values of frequency component scores in docu-
ment score vector.
Sum Largest Score Vector Components, select the two
largest score of frequency component scores in doc-
ument score vector; because based on information of
magnitude and phase precision represented in frequency
components, query terms inside that document should be
occured a lot and appeared together.
The condition is stated as:
s
d,b
1
, s
d,b
2
max
b=b
1
,b
2
(s
d,b
) (16)
Sum Largest Phase Precision Components, select score of
two frequency components in document score vector
which have the largest phase precision values; because
frequency component with larger phase precision value
would have more and more similar position of term sig-
nals compare to query terms.
d,b
1
,
d,b
2
max
b=b
1
,b
2
(
d,b
) (17)
Sum Largest Magnitude Components, select score of two
frequency components in document score vector which
have the largest magnitude values; because the document
which has frequency component with larger magnitude
value would have query terms that appear a lot inside.
H
d,b
1
, H
d,b
2
max
b=b
1
,b
2
(H
d,b
) (18)
3 VSM vs FDS methods
Assume there are two documents, d
1
= {AABB} and d
2
=
{ABBA}, plus a query q = {AB}. Let w
q,t
= (1, 1)
T
and
for B = 4 then w
d
1
,t
A
= (1, 1, 0, 0)
T
, w
d
1
,t
B
= (0, 0, 1, 1)
T
and w
d
2
,t
A
= (1, 0, 0, 1)
T
, w
d
2
,t
B
= (0, 1, 1, 0)
T
.
The calculation of document score for d
1
= {AABB} and
d
2
= {ABBA} is shown at Table 2. Using Sum All Compo-
nents (15), S
d
1
,q
= 2.0048 and S
d
2
,q
= 2.6489. Document
d
2
is more similar with query q than d
1
because there are two
phrases {AB} in d
2
eventhough the last one has reverse posi-
tion.
To calculate similarity using VSM, Cosine similarity mea-
sure [2] between document and query is a dot-product result
dened as
S
d,q
=
tT
w
q,t
w
d,t
(19)
With VSM, w
d
1
,t
= (w
d
1
,t
A
, w
d
1
,t
B
)
T
= (2, 2)
T
and
w
d
2
,t
= (2, 2)
T
. Calculate using VSM in (19), S
d,q
will be:
S
d
1
,q
= S
d
2
,q
(20)
_
1
1
_
_
2
2
_
=
_
1
1
_
_
2
2
_
Using VSM, similarity between S
d
1
,q
is equal to S
d
2
,q
,
while using FDS there is slightly different value in document
Table 2. Calculation of document score
term Frequency Component(magnitude, phase)
0 1 2 3
d
1
, t
A
(2.00, 0.00) (1.94, -0.25) (1.76, -0.50) (1.46, -0.75)
d
1
, t
B
(2.00, 0.00) (1.94, -1.25) (1.76, -2.50) (1.46, 2.53)
d
2
, t
A
(2.00, 0.00) (1.46, -0.75) (0.14, -1.50) (1.26, 0.89)
d
2
, t
B
(2.00, 0.00) (1.94, -0.75) (1.76, -1.50) (1.46, -2.25)
H
v
d
1
,b
4.0000 3.4012 1.8966 0.2070
z
d
1
,b
0.5000 0.4388 0.2702 0.0354
H
v
d
2
,b
4.0000 3.4012 1.8966 0.2070
z
d
2
,b
0.5000 0.5000 0.5000 0.0000
s
d
1
,b
2.0000 1.4924 0.5124 0.0073
s
d
2
,b
2.0000 1.7006 0.9483 0.0073
scores. Therefore, it is proofed that similarity with FDS will
give document that contains more similar terms with phrase of
query term.
Next is a theorem that will show VSM method is a special
case for FDS method for bin B = 1.
Theorem 1. VSM are a special case of FDS where B = 1.
Proof. The rst step is to gather the document terms into bins.
B = 1, so the whole document is considered to be in one bin,
therefore:
f
d,t,b
= f
d,t,0
= f
d,t
(21)
when the document weighting is performed, the weighting to
each bin w
d,t,b
in this case:
w
d,t,0
=
1 + log
e
f
d,t,0
(1 s) +s
W
d
av
dD
W
d
by (6)
=
1 + log
e
f
d,t
(1 s) +s
W
d
av
dD
W
d
by (21)
= w
d,t
by (5) (22)
and then do Fourier transform with b = 0,
v
d,t,0
=
0
b=0
w
d,t,0
_
cos 0 i sin 0
_
by (7)
= w
d,t,0
= w
d,t
by (22) (23)
It is proofed in (22) that Fourier transform of a signal of length
one is equal to itself. The mapping in (8) shows that v
d,t,0
=
H
d,t,0
exp(i
d,t,0
) with H
d,t,0
= w
d,t
, and
d,t,0
= 0.
To calculate magnitude value, H
d,0
, of frequency component
= 0 is,
H
m
d,0
=
tT
w
q,t
H
d,t,0
by (9)
=
tT
w
q,t
w
d,t
(24)
H
v
d,0
=
_
_
tT
w
q,t
H
d,t,0
cos 0
_
2
+
_
tT
w
q,t
H
d,t,0
sin 0
_
2
_1
2
by (10)
=
tT
w
q,t
w
d,t
(25)
Calculated using (11) and (12) plus since the value of (13) al-
ways equal to 1, then phase precision value of frequency com-
ponent = 0 is
d,0
= 1. Because number of bin is only 1,
then then the only method that enable to calculate document
score is (15).
S
d
=
0
2
+1
b=1
s
d,0
= s
d,0
by (14)
= H
d,0

d,0
by (24), (25)
=
tT
w
q,t
w
d,t
(26)
Document score calculated using number of bin, B = 1 in
(26) is equals to similarity measurement between document
and query using VSM in (19). Therefore, VSM is a special
case of FDS method where B = 1.
4 Experiments
Before the documents were indexed, preprocessing was per-
formed using Oracle Text to extract terms and create an in-
verted index to enable quick retrieval of term vectors. This
preprocessing consists of removing stop words and stemming
using Porter stemming algorithm [8]. The stop word list con-
tained about 400 common English words including stop word
list provided by Oracle Text.
The experiments use eight as number of bins since [1]
showed that eight bins would provide the best precision. The
document sets used here are part of the TREC English doc-
ument collection [5]. The documents are articles from the
Associated Press Newswire (1988) disk 2 which selected into
N = 500 documents. The queries used were corresponding
to the ad-hoc tasks in the TREC-1 (queries 51 to 80). For
query terms, only the terms which appeared in title section of
queries 51 to 80 are selected. The titles only consist of a few
terms which similar to a query when user is searching for ar-
ticles. The precision for relevant document are showed at 5,
Table 3. Method conguration for experiments, fds-x-y.bn
Table 4. Dene VSM precision from FDS methods
10, 15 and 20 documents retrieved (i.e. value at 5 shows ratio
of the number of relevant documents from 5 retrieved docu-
ments). Since TREC English document collection have been
already provided with sets of query and relevance list for test-
ing purpose, then it can differ which documents are relevant
and which are not.
The FDS methods performed were combinations of mag-
nitude, phase precision, and summing components shown in
Table 3. Note, x value refers to the combination of magnitude
and phase calculation method, and y value refers to the com-
bination components of document score calculation. While n
is the number components needed for document score calcula-
tion.
In Section 3, it is proofed that VSM is a special case of the
FDS method where B = 1 and the only way to calculate doc-
ument score is using Sum All Components. To measure capa-
bility of VSM-FDS, and prove that spatial information usage
makes an improvement in searching, experiments with various
scenario fds-x-1.b1 are done in Table 4. For comparing capa-
bility between VSM and FDS, then average value of method
fds-x-1.b1 will be assumed as value of vsm-x-1.b1.
The results in Table 5 show that using FDS with certain
combination does boost the precision of a document search
comparing to VSM. Over half of the methods appearing in the
top 20 results use Sum Vectors, H
v
d,b
, to calculate magnitude
value (x = 1, fds-1-y.bn or x = 5, fds-5-y.bn).
Combination FDS methods which are better than VSM,
mostly, use Sum Large Score Vector Components (y = 4, fds-
x-4.bn) to do summation of frequency components.
Eventhough Nyquist-Shannon sampling theorem[7] states that
the highest frequency component to be found in a real signal
is equal to half of sampling rate, also it is shown in Table 5
the precision of fds-x-y.b5 almost the same with fds-x-y.b8, but
fds-x-y.b8 gives more signicant results in these experiments.
5 Conclusions
The results, analytically and experimentally, show that FDS
is a superior method because it makes use of the spatial infor-
mation within a document rather than the count of each query
term. Analytically, it is proofed that the existing vector space
similarity methods are a special case of FDS. In this paper, it
is also showed that FDS gives such improvement in accuracy
of search results even with small data set.
6 Future Works
The Web, which can be considered a huge database of doc-
uments, has become so popular that its content has grown to
more than a billion documents. Internet search engines have
become an essential tool for locating resources and informa-
tion on the Web. But there is a different characteristic between
documents in traditional IR system and the Web; it is called
hyperlink. FDS is a ranking method based on the contents of
document. In order to improve the quality of search results, a
technique that exploited the additional information inherent in
the hyperlink structure of documents in the Web is necessary.
The analysis of hyperlink structure could determine the popu-
larity score of documents in the Web. Then, the content score,
using FDS method, is combined with the popularity score to
determine an overall score for each relevant document.
References
[1] Laurence A. F. Park, Kotagiri Ramamohanarao, and
Marimuthu Palaniswami. Fourier domain scoring: A
novel document ranking method. IEEE Transactions on
Knowledge and Data Engineering, 16(5):529539, 2004.
[2] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto.
Modern Information Retrieval. ACM Press / Addison-
Wesley, 1999.
Table 5. Best 20 document retrieval methods ordered by precision
(a) (b) (c) (d)
[3] Oracle Technology Network. Oracle Text. It can
be found at the URL http://www.oracle.com/
technology/products/text/index.html.
[4] Laurence A. F. Park, Marimuthu Palaniswami, and Kota-
giri Ramamohanarao. Internet document ltering using
Fourier domain scoring. In Principles of Data Mining and
Knowledge Discovery, pages 362373. Springer-Verlag,
2001.
[5] National Institute of Standards and Technology. Text Re-
trieval Conference: Data - English Documents. It can be
found at the URL http://trec.nist.gov/data/
docs_eng.html.
[6] Justin Zobel and Alistair Moffat. Exploring the similarity
space. SIGIR Forum, 32(1):1834, 1998.
[7] Wikipedia, It can be found at the URL http://en.
wikipedia.org/wiki/Nyquist-Shannon_
sampling_theorem. Nyquist-Shannon Sampling
Theorem.
[8] William B Frakes and Ricardo Baeza Yates. Information
Retrieval: Data Structures and Algorithms. Prentice Hall,
1992.

2227 Diana Cs ICTS2007

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2227 Diana Cs ICTS2007

Uploaded by

Copyright:

Available Formats

FOURIER DOMAIN SCORING FOR RANKING METHOD

IN SMALL DATA SET WITH PREPROCESSING

You might also like