Professional Documents
Culture Documents
d
1
= [(<2><b
0
, 1><b
1
, 1>)
t
1
=cat
, (<1><b
0
, 1>)
t
2
=hat
]
T
d
2
= [(<1><b
0
, 2>)
t
1
=cat
, (<1><b
1
, 1>)
t
3
=fine
,
(<1><b
1
, 1>)
t
4
=pet
]
T
d
3
= [(<1><b
0
, 1>)
t
1
=cat
, (<1><b
1
, 1>)
t
4
=pet
,
(<1><b
0
, 1>)
t
5
=dog
, (<1><b
0
, 1>)
t
6
=make
,
(<1><b
1
, 1>)
t
7
=good
]
T
(2)
2.2 Calculate Terms Weight
Termweights are used to compute similarity degree between
each document in collection and the user query. Let t be a cer-
tain index term in document d and B is the chosen number
of bins. A weight w
d,t,b
> 0 is associated with the occur-
rence numbers of term t inside document d in location bin b,
where b {0, . . . , B 1}. For each term t insides document
d, the vector will be w
d,t
= [w
d,t,0
, . . . , w
d,t,B1
]
T
. This
weight vector w
d,t
quanties the importance of index term t
for describing document semantic contents. Using this de-
nition, assume value of weight vector is equal to term count,
then weight vector for term CAT and HAT in d
1
will be
w
d
1
,t
1
= [< b
0
, 1 >, < b
1
, 1 >]
T
and w
d
1
,t
2
= [< b
0
, 1 >]
T
,
respectively. But this is basic denition about term weight. In
practical, value of weight not just term count but will be mod-
ied as explained next.
FDS requires to apply weights before performing the
Fourier transform. FDS [1] implements BD-ACI-BCAscheme
[6] which is used for calculation similarity measure between
documents and queries; document-term weight denoted by
w
d,t
(in position xx-Axx-xxx) and the query-term weight de-
noted by w
q,t
(in position xx-xxx-Bxx) as the equations are
shown below:
w
d,t
= 1 + log
e
f
d,t
(3)
w
q,t
= (1 + log
e
f
q,t
) log
e
(1 +
f
m
f
t
) (4)
where f
d,t
, f
q,t
are count of terms t in document d and query
q, respectively, f
t
is the number of documents having term t,
and f
m
is the largest f
t
.
With Oracle Text as preprocessing tool, the value of f
t
, and
of course the value of f
m
, can be retrieved from Table 1(a)
column token count. With some SQL scripting, value of
f
d,t
is obtained from Table 1(b).
In document collections, length of document is often var-
ied. For example there is a short document that is totally rele-
vant to query, and a long document that is partially relevant to
query. Both documents have same number of query terms oc-
currences. Those two documents should not be ranked equally
and the shorter one should be ranked higher. Here, normal-
ization of term weights is used to remove the advantage that
the long documents have in retrieval over the short documents.
Document length normalization is a way of penalizing term
weights for a document in accordance with its length.
BD-ACI-BCA scheme [6] normalizes document-term
weight w
d,t
with its document length, or document vector
norm, W
d
(in position xx-xxI-xxx); And w
d,t
in (3) becomes
w
d,t
=
1 + log
e
f
d,t
(1 s) +s
W
d
av
dD
W
d
(5)
where s is the slope factor (set to 0.7), and av
dD
W
d
is the
average document vector norm in set of document collections
D. Because FDS captures location of terms within document
through the value of bin b, it makes w
d,t
must consider bin
value and modify (5) into
w
d,t,b
=
1 + log
e
f
d,t,b
(1 s) +s
W
d
av
dD
W
d
(6)
where f
d,t,b
is the count of term t in bin b of document d and
b (0, 1, , B 1).
2.3 Perform Fourier Transform
The sequence number w
d,t,0
, . . . , w
d,t,B1
is still an infor-
mation in time or spatial domain and needs to be transformed
into frequency domain. The Fourier transform denes a rela-
tionship of signals in time or spatial domain with its represen-
tation in frequency domain known as a (Fourier) spectrum. A
spectrum is made up of a number of frequency components
with real and imaginary part for each frequency component.
Or in a different way, spectrum has an associated phase, in ad-
dition to a magnitude to represent the same information. The
discrete form of Fourier transform is of the form [1]:
v
d,t,
=
B1
b=0
w
d,t,b
_
cos
2b
B
i sin
2b
B
_
(7)
where v
d,t,
is projection of term signal
w
d,t
onto a sinusoidal
wave of frequency . The spectral component number is an
element of set {0, . . . , B 1}.
The Discrete Fourier Transform would produce the follow-
ing mapping [1]:
_
w
d,t,b
_
_
v
d,t,b
_
=
_
H
d,t,b
exp(i
d,t,b
)
_
(8)
where v
d,t,b
is bth frequency component of termt in document
d, H
d,t,b
, and
d,t,b
are magnitude and phase of frequency
component v
d,t,b
, respectively, and i is
1.
2.4 Calculate Magnitude Value
A relevant document should have large magnitudes, there-
fore more weight must be given to documents with more oc-
currences of query terms. To calculate the effect of query terms
on document, Sum Magnitudes H
m
d,b
takes into account only
the magnitude part, H
d,t,b
, of frequency component, v
d,t,b
.
While Sum Vectors H
v
d,b
also considers the phase part,
d,t,b
,
of v
d,t,b
.
Let T be set of query terms, then magnitude value using
Sum Magnitudes, H
m
d,b
, and Sum Vectors, H
v
d,b
, will be:
H
m
d,b
=
tT
w
q,t
H
d,t,b
(9)
H
v
d,b
=
_
_
tT
w
q,t
H
d,t,b
cos
d,t,b
_
2
+
_
tT
w
q,t
H
d,t,b
sin
d,t,b
_
2
_1
2
(10)
2.5 Calculate Phase Precision Value
In a relevant document, the corresponding phases from each
term equals to query term should be similar. There are three
ways to examine phase information of term signal: Non-Zero
phase precision
n
d,b
, Zero phase precision
z
d,b
and No phase
precision
1
d,b
.
z
d,b
=
_
_
tT;H
d,t,b
=0
cos
d,t,b
#(T)
_
2
+
_
tT;H
d,t,b
=0
sin
d,t,b
#(T)
_
2
_1
2
(11)
n
d,b
=
_
_
tT;H
d,t,b
=0
cos
d,t,b
#(
T
d,b
)
_
2
+
_
tT;H
d,t,b
=0
sin
d,t,b
#(
T
d,b
)
_
2
_1
2
(12)
1
d,b
= 1 (13)
Zero phase precision
z
d,b
(11) only includes phase of fre-
quency component with nonzero magnitude because term with
zero magnitude value means that the term does not exist and its
phase value could be left out. For each frequency component,
the phase value of terms equal to query terms in document will
be summed and averaged with total number of query terms,
#(T).
Non-Zero phase precision
n
d,b
(12) is similar to Zero phase
precision, but instead of being averaged with total number
of query terms #(T), it averages over total number of query
terms, #(
T
d,b
). Note, set of query terms
T
d,b
is list of terms
equal to query terms which do not have zero magnitude for
frequency component b in document d.
The last method, No phase precision
1
d,b
(13) ignores any
phase information. This is best used when the phase has al-
ready been taken into account when creating the magnitude
vector. In another words that happens when Sum Magnitude is
selected as method to calculate magnitude of documents. Us-
ing that assumption, its precision value always sets to one.
2.6 Calculate Score Value
After magnitude and phase precision of frequency compo-
nents in each document have been obtained, the next step is to
combine them to create document score vector by multiplying
their values. The score value for each frequency component
s
d,b
will be:
s
d,b
= H
d,b
d,b
(14)
Here, H
d,b
means that to calculate magnitude value, either
Sum Magnitudes, H
m
d,b
, or Sum Vectors, H
v
d,b
can be selected.
While to calculate phase precision value
d,b
is an option of
Non-Zero phase precision
n
d,b
, Zero phase precision
z
d,b
or
No phase precision
1
d,b
.
To get document score S
d
, each score of frequency compo-
nents in document score vector will be summarized. There are
four methods selected from [1] to do the summarization.
First method is called as SumAll Components which com-
bines s
d,b
of all frequency components in document d. How-
ever Nyquist-Shannon sampling theorem [7] states that the
highest frequency component to be found in a real signal is
equal to half of sampling rate. This implies that, if there are B
frequency components for term signal, then to analyze the sig-
nal would only need to examine frequency components of 1 to
B
2
+ 1 [1]. The zeroth component (DC component) is always
the largest value from all components and because of that, this
frequency component can be ignored. Using that assumption,
only half of frequency component scores in document score
vector will be needed and the document score will be:
S
d
=
B
2
+1
b=1
s
d,b
(15)
If there are high values contained in any elements of doc-
ument score vector, resulted from either magnitude or phase
precision value, then that document should be considered more
relevant to query terms. The idea of next three methods is to
calculate document score which only considers summation of
two greatest values of frequency component scores in docu-
ment score vector.
Sum Largest Score Vector Components, select the two
largest score of frequency component scores in doc-
ument score vector; because based on information of
magnitude and phase precision represented in frequency
components, query terms inside that document should be
occured a lot and appeared together.
The condition is stated as:
s
d,b
1
, s
d,b
2
max
b=b
1
,b
2
(s
d,b
) (16)
Sum Largest Phase Precision Components, select score of
two frequency components in document score vector
which have the largest phase precision values; because
frequency component with larger phase precision value
would have more and more similar position of term sig-
nals compare to query terms.
The condition is stated as:
d,b
1
,
d,b
2
max
b=b
1
,b
2
(
d,b
) (17)
Sum Largest Magnitude Components, select score of two
frequency components in document score vector which
have the largest magnitude values; because the document
which has frequency component with larger magnitude
value would have query terms that appear a lot inside.
The condition is stated as:
H
d,b
1
, H
d,b
2
max
b=b
1
,b
2
(H
d,b
) (18)
3 VSM vs FDS methods
Assume there are two documents, d
1
= {AABB} and d
2
=
{ABBA}, plus a query q = {AB}. Let w
q,t
= (1, 1)
T
and
for B = 4 then w
d
1
,t
A
= (1, 1, 0, 0)
T
, w
d
1
,t
B
= (0, 0, 1, 1)
T
and w
d
2
,t
A
= (1, 0, 0, 1)
T
, w
d
2
,t
B
= (0, 1, 1, 0)
T
.
The calculation of document score for d
1
= {AABB} and
d
2
= {ABBA} is shown at Table 2. Using Sum All Compo-
nents (15), S
d
1
,q
= 2.0048 and S
d
2
,q
= 2.6489. Document
d
2
is more similar with query q than d
1
because there are two
phrases {AB} in d
2
eventhough the last one has reverse posi-
tion.
To calculate similarity using VSM, Cosine similarity mea-
sure [2] between document and query is a dot-product result
dened as
S
d,q
=
tT
w
q,t
w
d,t
(19)
With VSM, w
d
1
,t
= (w
d
1
,t
A
, w
d
1
,t
B
)
T
= (2, 2)
T
and
w
d
2
,t
= (2, 2)
T
. Calculate using VSM in (19), S
d,q
will be:
S
d
1
,q
= S
d
2
,q
(20)
_
1
1
_
_
2
2
_
=
_
1
1
_
_
2
2
_
Using VSM, similarity between S
d
1
,q
is equal to S
d
2
,q
,
while using FDS there is slightly different value in document
Table 2. Calculation of document score
term Frequency Component(magnitude, phase)
0 1 2 3
d
1
, t
A
(2.00, 0.00) (1.94, -0.25) (1.76, -0.50) (1.46, -0.75)
d
1
, t
B
(2.00, 0.00) (1.94, -1.25) (1.76, -2.50) (1.46, 2.53)
d
2
, t
A
(2.00, 0.00) (1.46, -0.75) (0.14, -1.50) (1.26, 0.89)
d
2
, t
B
(2.00, 0.00) (1.94, -0.75) (1.76, -1.50) (1.46, -2.25)
H
v
d
1
,b
4.0000 3.4012 1.8966 0.2070
z
d
1
,b
0.5000 0.4388 0.2702 0.0354
H
v
d
2
,b
4.0000 3.4012 1.8966 0.2070
z
d
2
,b
0.5000 0.5000 0.5000 0.0000
s
d
1
,b
2.0000 1.4924 0.5124 0.0073
s
d
2
,b
2.0000 1.7006 0.9483 0.0073
scores. Therefore, it is proofed that similarity with FDS will
give document that contains more similar terms with phrase of
query term.
Next is a theorem that will show VSM method is a special
case for FDS method for bin B = 1.
Theorem 1. VSM are a special case of FDS where B = 1.
Proof. The rst step is to gather the document terms into bins.
B = 1, so the whole document is considered to be in one bin,
therefore:
f
d,t,b
= f
d,t,0
= f
d,t
(21)
when the document weighting is performed, the weighting to
each bin w
d,t,b
in this case:
w
d,t,0
=
1 + log
e
f
d,t,0
(1 s) +s
W
d
av
dD
W
d
by (6)
=
1 + log
e
f
d,t
(1 s) +s
W
d
av
dD
W
d
by (21)
= w
d,t
by (5) (22)
and then do Fourier transform with b = 0,
v
d,t,0
=
0
b=0
w
d,t,0
_
cos 0 i sin 0
_
by (7)
= w
d,t,0
= w
d,t
by (22) (23)
It is proofed in (22) that Fourier transform of a signal of length
one is equal to itself. The mapping in (8) shows that v
d,t,0
=
H
d,t,0
exp(i
d,t,0
) with H
d,t,0
= w
d,t
, and
d,t,0
= 0.
To calculate magnitude value, H
d,0
, of frequency component
= 0 is,
H
m
d,0
=
tT
w
q,t
H
d,t,0
by (9)
=
tT
w
q,t
w
d,t
(24)
H
v
d,0
=
_
_
tT
w
q,t
H
d,t,0
cos 0
_
2
+
_
tT
w
q,t
H
d,t,0
sin 0
_
2
_1
2
by (10)
=
tT
w
q,t
w
d,t
(25)
Calculated using (11) and (12) plus since the value of (13) al-
ways equal to 1, then phase precision value of frequency com-
ponent = 0 is
d,0
= 1. Because number of bin is only 1,
then then the only method that enable to calculate document
score is (15).
S
d
=
0
2
+1
b=1
s
d,0
= s
d,0
by (14)
= H
d,0
d,0
by (24), (25)
=
tT
w
q,t
w
d,t
(26)
Document score calculated using number of bin, B = 1 in
(26) is equals to similarity measurement between document
and query using VSM in (19). Therefore, VSM is a special
case of FDS method where B = 1.
4 Experiments
Before the documents were indexed, preprocessing was per-
formed using Oracle Text to extract terms and create an in-
verted index to enable quick retrieval of term vectors. This
preprocessing consists of removing stop words and stemming
using Porter stemming algorithm [8]. The stop word list con-
tained about 400 common English words including stop word
list provided by Oracle Text.
The experiments use eight as number of bins since [1]
showed that eight bins would provide the best precision. The
document sets used here are part of the TREC English doc-
ument collection [5]. The documents are articles from the
Associated Press Newswire (1988) disk 2 which selected into
N = 500 documents. The queries used were corresponding
to the ad-hoc tasks in the TREC-1 (queries 51 to 80). For
query terms, only the terms which appeared in title section of
queries 51 to 80 are selected. The titles only consist of a few
terms which similar to a query when user is searching for ar-
ticles. The precision for relevant document are showed at 5,
Table 3. Method conguration for experiments, fds-x-y.bn
Table 4. Dene VSM precision from FDS methods
10, 15 and 20 documents retrieved (i.e. value at 5 shows ratio
of the number of relevant documents from 5 retrieved docu-
ments). Since TREC English document collection have been
already provided with sets of query and relevance list for test-
ing purpose, then it can differ which documents are relevant
and which are not.
The FDS methods performed were combinations of mag-
nitude, phase precision, and summing components shown in
Table 3. Note, x value refers to the combination of magnitude
and phase calculation method, and y value refers to the com-
bination components of document score calculation. While n
is the number components needed for document score calcula-
tion.
In Section 3, it is proofed that VSM is a special case of the
FDS method where B = 1 and the only way to calculate doc-
ument score is using Sum All Components. To measure capa-
bility of VSM-FDS, and prove that spatial information usage
makes an improvement in searching, experiments with various
scenario fds-x-1.b1 are done in Table 4. For comparing capa-
bility between VSM and FDS, then average value of method
fds-x-1.b1 will be assumed as value of vsm-x-1.b1.
The results in Table 5 show that using FDS with certain
combination does boost the precision of a document search
comparing to VSM. Over half of the methods appearing in the
top 20 results use Sum Vectors, H
v
d,b
, to calculate magnitude
value (x = 1, fds-1-y.bn or x = 5, fds-5-y.bn).
Combination FDS methods which are better than VSM,
mostly, use Sum Large Score Vector Components (y = 4, fds-
x-4.bn) to do summation of frequency components.
Eventhough Nyquist-Shannon sampling theorem[7] states that
the highest frequency component to be found in a real signal
is equal to half of sampling rate, also it is shown in Table 5
the precision of fds-x-y.b5 almost the same with fds-x-y.b8, but
fds-x-y.b8 gives more signicant results in these experiments.
5 Conclusions
The results, analytically and experimentally, show that FDS
is a superior method because it makes use of the spatial infor-
mation within a document rather than the count of each query
term. Analytically, it is proofed that the existing vector space
similarity methods are a special case of FDS. In this paper, it
is also showed that FDS gives such improvement in accuracy
of search results even with small data set.
6 Future Works
The Web, which can be considered a huge database of doc-
uments, has become so popular that its content has grown to
more than a billion documents. Internet search engines have
become an essential tool for locating resources and informa-
tion on the Web. But there is a different characteristic between
documents in traditional IR system and the Web; it is called
hyperlink. FDS is a ranking method based on the contents of
document. In order to improve the quality of search results, a
technique that exploited the additional information inherent in
the hyperlink structure of documents in the Web is necessary.
The analysis of hyperlink structure could determine the popu-
larity score of documents in the Web. Then, the content score,
using FDS method, is combined with the popularity score to
determine an overall score for each relevant document.
References
[1] Laurence A. F. Park, Kotagiri Ramamohanarao, and
Marimuthu Palaniswami. Fourier domain scoring: A
novel document ranking method. IEEE Transactions on
Knowledge and Data Engineering, 16(5):529539, 2004.
[2] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto.
Modern Information Retrieval. ACM Press / Addison-
Wesley, 1999.
Table 5. Best 20 document retrieval methods ordered by precision
(a) (b) (c) (d)
[3] Oracle Technology Network. Oracle Text. It can
be found at the URL http://www.oracle.com/
technology/products/text/index.html.
[4] Laurence A. F. Park, Marimuthu Palaniswami, and Kota-
giri Ramamohanarao. Internet document ltering using
Fourier domain scoring. In Principles of Data Mining and
Knowledge Discovery, pages 362373. Springer-Verlag,
2001.
[5] National Institute of Standards and Technology. Text Re-
trieval Conference: Data - English Documents. It can be
found at the URL http://trec.nist.gov/data/
docs_eng.html.
[6] Justin Zobel and Alistair Moffat. Exploring the similarity
space. SIGIR Forum, 32(1):1834, 1998.
[7] Wikipedia, It can be found at the URL http://en.
wikipedia.org/wiki/Nyquist-Shannon_
sampling_theorem. Nyquist-Shannon Sampling
Theorem.
[8] William B Frakes and Ricardo Baeza Yates. Information
Retrieval: Data Structures and Algorithms. Prentice Hall,
1992.