You are on page 1of 2

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

HYDERABAD CAMPUS
FIRST SEMESTER 2016 2017
INFORMATION RETRIEVAL (CS F469)
TEST 1 Regular
Date: 09.09.2016 Weightage: 20 %( 60 M)
Duration: 60min. Type: Closed Book
Instructions: Answer all parts of the question together. Your answers should be brief.
Q1. Boolean retrieval [3+4+3 = 10M]
A. Consider the fragment of a positional index of three terms given below which has the following
format: word: doc#: <posn, posn,...>; doc#: <posn, posn,...>; .
KUMARAMANGALAM: 7: <1,6,8>; 8: <6>; 9: <2,15>; 10: <1>.
VIT: 4: <3>; 7: <14>.
BITS: 7: <2,23,56>; 8: <12,16,21>; 9: <13>; 11: <21,25>.
The /n operator, word1 /n word2 finds occurrences of word1 within n words of word2 (on
either side), where n is a positive integer argument. Thus n = 1 demands that word1 be adjacent
to word2.
i.Identify the set of documents that satisfy the query: KUMARAMANGALAM /2 BITS.
ii.Given the query KUMARAMANGALAM /n BITS identify the set of values for n where
documents {7,9} are returned as the answer.
iii.Identify the set of values for n for which the query BITS /n BITS returns a non-empty set of
documents as the answer.

B. Assume that our search engine lets us enter a query, which is a set of words, and returns the set
of documents that contain all the words in the query.
Imagine that we configure the system in four different modes, and for each mode we ask the same
query.
Mode 1: We dont remove stopwords and we dont stem neither documents nor queries. Let A1 be
the set of returned documents.
Mode 2: We dont remove stopwords, but we stem both documents and queries. Let A2 be the set
of returned documents.
Mode 3: We remove stopwords, but dont stem. Let A3 be the set of returned documents.
Mode 4: We remove stopwords, and then we stem both documents and queries. Let A4 be the set
of returned documents.
Identify the relations among A1, A2, A3, and A4? For example, is A1 = A2? Is A2 a subset of
A4?, etc.

C. In a corpus of size 3,00, 000 documents we have the following term frequencies for some of the
terms:
ShivKera RuskinBond ChetanBhagat VikramSeth RabindranathTagore KiranDesai
24,000 1,000 10,000 4,000 13,000 7,000
Propose an evaluation plan for the following query: (ShivKera AND RuskinBond) AND
(ChetanBhagat AND VikramSeth) OR (RabindranathTagore AND KiranDesai) in order to
minimize the list processing time. Justify your answer.

Q2. Dictionaries and tolerant retrieval [3+10=13 M]


A. If there are N terms in the inverted index
i. Theoretically how many terms will be in a bi-word dictionary?
ii. While designing a bi-word index for a search engine do you think that practically so many
terms exist? Why or why not?
iii. Since maintaining the bi-word index for all terms in the inverted index is expensive suggest
how you will use this while designing a search engine.
B. Using permuterm index and 3-gram index, show how we can answer wildcard query CS*46*
on strings CSF469, CSF 469, CS 469, CF 469, and CSF 46. Note that in the second, third
and fourth string there is a space.

Q3. Vector Space Model


A. Explain the notation ddd.qqq. Why there is a need to distinguish and treat the document and
query in different notations? [3 M]

B. Give an example scenario that illustrates when the retrieval system is likely to fail to accurately
retrieve the top k documents for a query. [3 M]

C. Describe the effect of adding new documents or changing existing documents within the VSM.
[Hint: What values have to be recomputed?] [3 M]

D. Euclidean distance is a measure that may be used to compute the similarity between two
vectors. Given a query q and documents d1, . . . , dn, we may rank the documents d1, . . . , dn
in the increasing order of Euclidean distance from q. Show that if q and the document vectors
di are all normalized to unit vectors, then the rank ordering produced by Euclidean distance is
identical to that produced by cosine similarity. [5 M]

E. Consider the following collection of just two documents: [2+3+4=9M]


d1 State space search is a classical artificial intelligence paradigm, with an initial state and a
goal state and..
d2 NASA will search for its lost Martian space probe which never made it to martian orbit 10
months after launch...
i. Represent each document as a vector: extract all unique words from the collection for your full
vocabulary, alphabetize, remove stopwords, i.e., words in the set {a an and the of in to it its is
for from which that}, and represent the vectors using only term frequencies.
ii. Represent the query state space as a vector using term frequencies and then calculate the
cosine similarity of the query with each document.
iii. If the query is seek missing Mars spacecraft what will the cosine similarity be to both
documents? Which of the preprocessing techniques learnt in the course would you apply to do
so that this query has a chance of finding d2?

Q4. Probabilistic IR
A. What are the differences between standard vector space tf-idf weighting and the Binary
Independence Model of probabilistic retrieval model (in the case where no document relevance
is available)? [3 M]
B. Given the following term incidence matrix as shown in Table 1 below, rank order the
documents using the probabilistic retrieval model where relevance estimates are not given for
a query containing terms {T2, T5, T6}. [11 M]
T1 T2 T3 T4 T5 T6

D1 1 0 0 1 1 0

D2 0 1 0 1 1 0

D3 1 0 1 0 1 1

D4 1 0 1 0 1 1

Table 1

******************************* ALL THE BEST *********************************

You might also like