Professional Documents
Culture Documents
providing techniques to nd out the form of certain ument frequency of term t. However, variations of this idf
term-weight formulas; denition are not uncommon and frequently involve some
providing techniques to determine the constants em- adjustable constants.
bedded in the formulas for computing the weight of a Document length normalization: A longer document
term; and tends to have more terms and/or terms with higher term
frequencies. This means that document retrieval would be
providing experimental results to illustrate how our biased in favor of long documents. To compensate for the
techniques can be utilized to \discover" the similarity document length, term weights are frequently normalized
computation of the WebCrawler search engine. or divided by a factor involving the document length. Al-
though less common, this document length normalization
In a recent paper [3], the problem of discovering the lan- factor may also involve certain constants.
guage model for a text database is addressed. A language Note that the query term weights may also be computed
model describes the words or indexing terms that occur in using a query length normalization factor. However, as the
the database and frequency information indicating how of- query length normalization factor is contained in each query
ten each term occurs [3]. A method is presented that uses a term weight, it can be factored out. Thus, with respect to
query-based sampling approach. It is shown that a database a query, the relative ranking of any two documents is un-
selection service (global search engine) can learn the lan- changed whether the query term weights contain the query
guage model of a (uncooperative) database by sampling the length normalization factor or not. We shall not discuss the
contents of the database via the process of running carefully discovery of constants embedded in the query length nor-
selected queries and retrieving documents. In contrast, the malization factor.
problem we are addressing in this paper is to determine how
a (uncooperative) search engine (database) assigns similar- Suppose that the term frequency weight of a term t in a
ities to documents with respect to a query. The techniques query Q contains the unknown constants k1 ; : : : ; kp ; the
we developed for discovering term-weight formulas are also
query-based. term frequency weight of a term t in a document d~ has
the unknown constants a1 ; : : : ; ar ; the idf weight of term
The rest of this paper is structured as follows. Section 2 de- t involves the unknown constants b1 ; : : : ; bs ; and the doc-
scribes the various factors that aect the weight of a term. ument length normalization factor makes use of the un-
Section 3 discusses how to determine the constants embed- known constants c1 ; : : : ; ch . We shall denote the term fre-
ded in the formulas for computing the weight of a term. We quency weight of a term t in query Q by qtf t (Q; K ), the
assume that the system provides only the rank information term frequency weight of t in d~ by dtft (d~; A), the idf
of those documents retrieved, i.e., the similarity values of the weight of t by idft (B ) and the document length normaliza-
retrieved documents are not given. Section 4 shows how the tion factor by norm(d~; C ), where K = fk1 ; : : : ; kp g; A =
mathematical expressions for computing the dierent term fa1 ; : : : ; ar g; B = fb1 ; : : : ; bs g and C = fc1 ; : : : ; ch g. Be-
weight components may be derived from the results of probe low, we give some examples of term frequency weight for-
queries. Experimental results are presented in section 5. We mulas. More examples of formulas of the term weight com-
ponents can be found in [7, 10, 19] The above monotonicity property is reasonable and is sat-
ised by almost all qtft (Q; K ); dtft(d~; A) and idft (B ) ex-
pressions. It simply says that if the term frequency/document
Example 1. (i) Term frequency weight of a term t in a frequency of a term t is strictly higher than that of another
query Q; qtf t (Q; K ) term, the term frequency weight/idf weight of t must be
higher/lower. It can be easily veried that all the examples
1. k1 + k2 tft (Q) (k2 > 0) of term frequency weights given in example 1 also satisfy the
monotonicity property.
2. tft(Q))k1 (k1 > 0)
Another reasonable assumption we make is the following
3. k1 + log(tft (Q) + k2 ) query term independence property.
where tft (Q) is the term frequency of term t in Q, and k1 Property 2. (Query term independence) For a given query
and k2 are constants. Q, two query terms in Q have the same query term fre-
quency weight if they have the same term frequency in Q
We were not aware of systems using formula 2. In our and the query term frequency weight qtft (Q; K ) of a term
experiment, we actually discover the use of formula 2 in t in Q is independent of the term frequencies of other query
WebCrawler. terms present in Q.
(ii) Term frequency weight for a term in a document d~,
dtft(d~; A) : Property 2 essentially says that for a term t in Q, qtft (Q; K )
depends on its term frequency tft (Q) and not on the term
t itself, and its value is unaected by the occurrences of
~ other terms in Q. It can be easily checked that all examples
1. a1 + a2 tft (d) ~ [17] of query term frequency weight given in Example 1(i) have
max tf (d) the above property.
2. a1 + a2 tft (d~) [2]
dl(d~)
tft (d~) + a3 + a4 avg 3. DETERMINATION OF UNKNOWN CON-
dl STANTS IN TERM WEIGHT FORMU-
~
3. a1 + a2 a3 + log tft (d) ~
LAS
[19] As indicated in the previous section, the formulas for the
a4 + log max tf (d)
term weight components, qtft (Q; K ); dtft(d~; A); idft (B ) and
4. a1 + [tft (d~)]a2 norm(d~; C ), may contain adjustable constants. In this sec-
tion, we discuss how to nd out these constants once we
know the mathematical expressions of these formulas (but
where tft (d~) is the term frequency of a term t in a docu- not the embedded constants).
ment d~, max tf (d~) is the maximum frequency of all terms in
d~, dl(d~) is the total number of occurrences of of all terms in After a user has submitted a query Q to a search engine, the
document d~, avg dl is the average number of terms in a doc- retrieved documents are usually ranked in non-increasing or-
ument of the database, each ai is a constant (i = 1; 2; 3; 4) der of their similarity values to Q before they are presented
and a2 > 0. to the user. If the search engine also provides the similari-
ties of retrieved documents, we can readily form equations
involving the unknown constants. When a sucient number
We assume that the expressions for qtf t (Q; K ); dtft (d~; A) of such equations are obtained, these unknown constants can
and idft (B ) have the following property. be solved (either analytically or using numerical methods).
However, the document similarities may not be available or
they are transformed from the dot-product similarities by
Property 1. (Monotonicity) some unknown function. In this case, we do not know pre-
cisely how the similarities of dierent documents to a query
or the similarities of the same document to dierent queries
(i) For a given query Q, qtft (Q; K ) is a strictly increas- are related. Forming equations involving the embedded un-
ing function of tft (Q), the term frequency of term known constants (in K; A; B and C ) becomes a non-trivial
t in Q (i.e., if t1 and t2 are two terms such that task. We describe how this can be done in the following
tft1 (Q) < tft2 (Q), then qtft1 (Q; K ) < qtft2 (Q; K )). subsection when we know only the rank order of retrieved
documents.
(ii) For a given document d~, dtft(d~; A) is a strictly in-
creasing function of tft (d~), the term frequency of t in We assume that the document frequencies of the terms in a
d~. search engine are obtainable. Note that many search engines
on the Internet provide the document frequency information
(iii) idft (B ) is a strictly decreasing function of dft , the doc- to the user. Alta Vista and Hotbot, for example, give the
ument frequency of term t. document frequency of any term t in the retrieval result in
the form of the number of hits if term t is used as a single- is repeated until the dierence between U and R is less than
term query. some pre-determined positive number. The details of the
algorithm is given in [12].
Setting up Equations in the Unknown Constants
As discussed earlier, if the similarities of retrieved docu- Using the above steps, we can form as many equations in-
ments are not provided, it is dicult to form equations in volving the embedded constants as we want. When enough
the constants embedded in the term weight formulas from equations are constructed, it is possible to solve for the sets
only the rank order of the documents. We show how this of constants K; A; B and C . We have developed a system-
can be overcome in this subsection. atic procedure which facilitates the solution of the unknown
constants [12].
A key step in our procedure to discover the embedded con-
stants is to construct a query Q such that the similarity 4. DERIVATION OF TERM WEIGHT FOR-
of Q with two documents are (or approximately) the same. MULAS
This will allow us to form an equation relating their simi- The possible formulas for determining the term weight com-
larities. We now sketch this key step. ponents are theoretically innite. However, from the re-
trieval results of probe queries, we can often derive approx-
Let Qt be a single-term query containing a single occurrence imately (sometimes precisely) for a term weight component
of term t and Qt1 ;t2 a two-term query involving terms t1 its mathematical formula or rule out certain groups of math-
and t2 which may have multiple occurrences in the query. ematical expressions the formula can take. To illustrate the
The similarity of a document d~ to a query Q is denoted by idea of our techniques, we show how this can be done for
sim(Q; d~). the term frequency weight formulas.
The idea of our approach is this. First, we nd two terms Throughout this section, a query having fj instances of
t1 and t2 and two documents d~1 and d~2 such that document term tj , j = 1; : : : ; n, will be denoted by Qt1;::: ;t (f1 ; : : :; fn ).
d~1 contains t1 but not t2 and document d~2 contains t2 but Qt1;t2 (v1 ; v2 ), for example, denotes a query with term fre-
n
not t1 . As d~1 has t1 and d~2 does not, with respect to the quency of tj equal to vj (j = 1; 2) (i.e., tft (Qt1 ;t2 (v1 ; v2 )) =
single-term query Qt1 , the rank of d~1 must be higher than vj , j = 1; 2).
j