You are on page 1of 4

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

HYDERABAD CAMPUS
FIRST SEMESTER 2016 2017
INFORMATION RETRIEVAL (CS F469) - COMPREHENSIVE MAKEUP EXAM

Date: 10.01.2017 Weightage: 35 %( 105 M)


Duration: 3 Hours. Type: Closed Book
Instructions:
Answer all parts of the question together. Your answers should be brief.
Q1. Boolean Retrieval
A. If the query is friends AND romans AND (NOT countrymen), how could we use the freq of
countrymen? [3 M]
B. Can you process the query with only one traversal if all posting lists are in main memory?
[3 M]
Q2. Dictionaries and tolerant retrieval
A. Write down the entries in the permuterm index dictionary that are generated by the term good.
[2 M]
B. If you wanted to search for s*n*g in a permuterm wildcard index, what key(s) would one do
the lookup on? What is the problem when this sort of a term is used which has more than one
wild card operator? [3 M]
C. What is the potential problem of using stop words in combination with positional indexing?
How could you solve this problem? [2 M]

Q3. Vector space model


You are hired by Youtube to implement a search engine using Probabilistic retrieval model where
you are also given the documents that are relevant and non-relevant for queries. Answer questions
A-D [4X2=8M]
A. Do you think the inverted index discussed in the class is used while computing the rank of the
document? If no why? If yes, how is it used?
B. If you are asked to build an inverted index discussed in the class, while building your inverted
index for videos what problems do you forsee in the preprocessing phase? Think in terms of
semantics.
C. In addition to the ranking used in Probabilistic retrieval model you are asked to consider other
factors of videos like number of likes and the sentiment of the comments for videos, the
freshness of the video (or how recently it was created), etc. Devise a modified similarity
model that takes all the above mentioned factors into consideration while computing the score.
D. What additional information has to be stored in the inverted index to compute your new score?

Q4. Cross Language Information Retrieval(CLIR) / Machine Translation


Using the following phrase aligned sentences (f,e) below, Answer questions A-H

A. Construct the phrase alignment matrix with English words as rows and Hindi words as
columns. [4 M]
B. Assuming that the alignment matrix from question A is the intersection of P(f|e) and P(e|f),
identify whether the following phrase pairs are consistent with the alignment [2 M]
i.
ii. ( )
C. Which is the longest phrase pair that is consistent with the alignment? [2 M]
D. Compute the reordering distance between the following 2 phrase pairs given in question B
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv[4 M]
E. In the mathematical model of the phrase based translation why is the reordering distance not
directly used but an exponentially decaying cost function d = |startiendi11)| is used? [2 M]
|start end 1)|
F. In the exponentially decaying cost function d = i i1 , What should the value of if
the movement of the phrases have to be penalized? [2 M]
G. If a spurious phrase pair occurs only once in the whole parallel corpus, what will the value of
(f,e) and (e,f)? [1 M]
H. If a spurious phrase translation pair occurs only once how will you compute the phrase
translation values show with the help of an example. [3 M]

Q5. Recommender systems


A. In the Latent factor model for recommender systems what is the use of regularization and how is
it modeled? [3M]
T
B. Given the following P and Q matrices for a latent factor model, compute the rating for user2,
item3. [3M]
1.18 -0.73 0.72 -0.47 1.03 0.73 -0.23 1.01 1.16
-0.79 1.02 0.26 Q = 1.28 -0.10 0.63 1.10 -0.37 0.07
PT = 0.72 0.08 -0.06 0.59 0.69 0.91 -0.03 0.52 0.79
0.46 -0.26 0.14

C. Find the CUR-decomposition of Matrix Alien Star Wars Sky Titanic


the matrix if the two random Fall
rows are both Jack and the two Joe 1 1 1 0 0
columns are Star Wars and Sky Jim 3 3 3 0 0
Fall. [Note: You will only show John 4 4 4 0 0
the C and R matrices Jack 5 5 5 0 0
construction] [4 M] Jill 0 0 0 4 4
Jenny 0 0 0 5 5
Jane 0 0 0 2 2

D. Below is a table giving the profile of three items. [4+4=8M]


A101012
B110016
C010102
The first five attributes are Boolean, and the last is an integer "rating." Assume that the scale
factor for the rating is a. Compute, as a function of a, the cosine distances between each pair of
profiles and find which of the following is FALSE?
i. For a = 0.5, B is closer to C than A.
ii. For a = 2, A is closer to C than B.

E. You are hired by Bing to work on its search engine to use the concept of collaborative filtering to
recommend documents to a query. [Hint: Here the query is considered as an active user to
whom you will recommend items]. Answer questions i-iv [1+1+2+6 = 10 M]
i. What do you mean by neighboring users in this scenario?
ii. What do you mean by the items in this scenario?
iii. What is the rating in this scenario?
iv. Briefly sketch the algorithm, preferably with some formulas. Assume that r(Q, D) is a retrieval
function that can give you a positive similarity value for any query and document.
[Hint: map the given problem to the user-item matrix and find analogies to the problem]

F. Given the following SVD for M where the columns of M are the ratings of Matrix, Alien, Star
Wars, Casablanca and Titanic answer questions i and v .
Suppose Leslie assigns rating 3 to Alien and rating 4 to Titanic. [2+2+2+3+3=12 M]
i.Show how can we represent Leslie as a vector?
ii.What is her representation in movie space?
iii.Find the representation of Leslie in concept space.
iv.What does that representation predict about how well Leslie would like the other movies
appearing in our example data?
v.How to guess the movies a person would most like. How would you use a similar technique to
guess the people that would most like a given movie, if all you had were the ratings of that
movie by a few people?
Q6. Link Analysis
For the web graph given in Figure1, Answer questions A-F. [4+4+4+2+3=17M]
A. Write the flow equations for calculating the page rank for all
the pages in the web graph.
B. Using power iteration method what will be the page rank of
all the pages after 2 iterations.
C. Show the transition matrix A, that will be used by the
PageRank algorithm, assuming that with probability a
random surfer will follow the links on the current page, and
with (1- ) probability he/she will transition to any of the
(three) pages with uniform probability; where is set to 0.5.

Figure 1
D. Suppose we set to 0, then what will be the page ranks associated with the three pages?
[Note: You need not compute the page rank just a 2-line justification is expected]
E. Show the working of HITS algorithm in vector notation for two iterations on the web graph
in Figure 1.

F. Does the Web graph in Figure 2 have spider traps and Dead ends? [2 M]

Figure 2

Q7. Multimedia Information retrieval(MIR)


A. Given the following grid representing the boundary of a shape. What is the Freeman Chain
Code starting from the arrow shown in the grid? [2 M]
B. Given the color histograms for the query and the three images named a, b and c with each
histogram having four colors: red, blue, purple, and yellow where the first bin shows number
of red pixels, second bin shows blue, third bin shows purple and fourth bin shows yellow.
Compute the Bray Curtis dissimilarity and Squared chord and rank the images based on both
the distances. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [4 M]

Bray Curtis dissimilarity

Squared chord

C. Using the concepts learnt in this course suggest an application of your choice that could be
useful and ease the life of common man, also show the architecture of your proposed system.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> [3 M]

************************* Thats all folks********************************

You might also like