You are on page 1of 5

Stanford University Computer Science Department Solved CS347 Spring 2001 Mid-term

!uestion 1" (4 points) Shown below is a portion of the positional index in the format term: doc1: position1,position2 ; doc2: position1, position2 ; etc. angels: fools: fear: in: rush: to: tread: where: 2:3 ,1!4,2"2, "1; 2:1,1!,!4,222; 2:$!,!#4,!22,%#1; 2:3,3!,! ,444,$"1; 2:2, ,1%4,321,!#2; 2:4!,$ ,234,%%%; 2:"!,%4,333; 2: !,124,3%3,1##1; 4:12,22,1#2,432; 4:$,!$,1#$,4"$; 4:13,43,113,433; 4:1#,2#,11#,4!#,"##; 4:%, %,14%,42%," %; 4:14,24,!!4,%44; 4:1",3",1""; 4:11,41,1#1,421,431; 7:1!; !:3,13,23,1%3; 7:1$,32$,"2$; 7:",1",2",1%"; 7:4,14,4#4; 7:1%%,31%,"%%,!#%; 7:2#,32#; 7:1 ,3 ,!3 ;

&hich doc'ment(s) if an( meet each of the followin) *'eries, where each expression within *'otes is a phrase *'er(+ (i) fools rush in #ns$er: ,ll three doc'ments (2, 4, and !) satisf( the *'er(. (2 points for a ri)ht answer, 1 point for a partiall( correct answer s'ch as 2 and 4) (ii) fools rush in AND angels fear to tread #ns$er: -nl( doc'ment 4. !uestion 2" (1.1.3 points) (i) /ow man( postin)s entries are )enerated in the bi)ram index for the followin) text ('nderlined portion onl(): Far from the madding crowds ignoble strife + #ns$er: 0here are 43 bi)rams in the sentence. 3% of them are distinct, so the bi)ram index wo'ld consist of 3% entries. &e accepted either 3% or 43 as the answer. (0hose who explicitl( said 123 was different from 1f3 also )ot credit for 4#) (ii) 2or this bi)ram index, how wo'ld the wild4card *'er( madd*ing be expressed as an AND *'er(+ #ns$er: 5m ,67 ma ,67 ad ,67 dd ,67 in ,67 n) ,67 )5 6o other answers were accepted.

(iii)

8onsider a corp's with 1# million doc'ments, with an a9era)e of 1### words each. 0he a9era)e word len)th across the corp's is 4 characters, not co'ntin) spaces or p'nct'ation. :stimate the n'mber of tri)ram occ'rrences for this corp's. #ns$er: 8onsider one doc'ment: there are on a9era)e 1,### words of an a9era)e of 4 characters each, with aro'nd 1,### word bo'ndaries (spaces between words and be)innin);end of the doc'ment). So each doc'ment contains ro')hl( 1,### x 4 . 1### < ",### tri)rams (not necessaril( distinct). 0h's the corp's contains ",### x 1#= < "# billion tri)rams (not necessaril( distinct). &e also accepted 4# billion tri)rams, beca'se some st'dents ass'med 1hello world3 wo'ld not )enerate the 1o5w3 tri)ram. &e did not accept # billion, beca'se do'ble co'ntin) that occ'rrence 1o5w3 is incorrect (41 pt for # billion)

!uestion 3" (" points) >ecall the estimate of the total si?e of the postin)s entries (4"=b(tes 'sin) codes) from @ect're 1 'sin) Aipf3s law. Bsin) the same parameters (1 million doc'ments, "##,### terms), re4comp'te this estimate if we were to omit from indexin) the 1C of the most fre*'entl( occ'rrin) terms. #ns$er: 0his problem can be sol9ed 'sin) the e*'ations )i9en in the /omeworD 1 sol'tion: i . n 1 < 21! x n bits < 21! =b < 2!.12" =E 2n
i= 13 13 1% 1%

0his is the 'pper bo'nd. ,n approximate lower bo'nd is i . n 1 < 1%2 x n bits < 1%2 =b < 24 =E 2n
i= 14 14 1% 1%

Fes, it3s i<13 and i<14, beca'se (referrin) to the /& sol'tion), D ran)es from G2i-1,2i). &e )a9e credit for those who 'sed i<12 and i<13 tho')h. (41 pt for minor error, 42 pt for s'bstantial error, 43 pt for describin) how to sol9e witho't properl( sol9in)) !uestion 4" (" points) =arD each of the followin) assertions as 0r'e or 2alse. ,ssertion (i) 0he optimal order for *'er( processin) in an AND *'er( is alwa(s reali?ed b( startin) with the term occ'rrin) in the fewest doc'ments. (ii) 0he code for 1! is 111#####1. (iii) (i9) 0he base of the lo)arithm 'sed in the tfidf form'la maDes no difference to the cosine distance between two doc'ments (pro9ided the( both 'se the same base). Hf we were to taDe a doc'ment and do'ble its len)th b( repeatin)

0;2 2alse 2alse

0r'e

(9)

e9er( occ'rrence of e9er( word, then the normali?ed tfidf 9al'es for all terms in this doc'ment remain 'nchan)ed. 0he tf x idf 9al'e for a term t in a doc'ment 7 is non4?ero if and onl( if there is at least on occ'rrence of t in 7.

0r'e 2alse

%&planation" (i) 8o'nter example: for *'er( Ix ,67 ( ,67 ?J, s'ppose ((x ,67 () ,67 ?) is the processin) order based on fre*'enc( co'nts. Ht is possible that (x ,67 () is non4 empt( whereas (( ,67 ?) is in fact empt(. 0hen, 'sin) the order ((( ,67 ?) ,67 x), we can alto)ether a9oid readin) the postin)s list for x. (ii) 0he correct code is 1111####1. (iii) , different base for the lo)arithm onl( chan)es the len)th of the doc'ment 9ectors, not the an)le between them. 0herefore the cosine distance, which is comp'ted after normali?in) the 9ectors, remains 'nchan)ed. (i9) 8learl(, idfs do not chan)e as the( depend onl( on the n'mber of doc'ments containin) each word. ,lso, since e9er( occ'rrence of e9er( word is repeated, tf, does not chan)e. /ence, the normali?ed tfidf 9al'es remain 'nchan)ed. (9) Ht is tr'e that if tfidf 9al'e is non4?ero, then there is at least one occ'rrence of t in 7. /owe9er, the con9erse is false, since idf can be # if t occ'rs in e9er( doc'ment in the corp's. !uestion '" (3.1.2 points) (i) 8onsider the followin) postin)s list a')mented with sDip pointers at a 'niform sDip si?e of 4. 3 " % 1" 24 3% # $ !" $1 $4 $% %2 % %! 1## 11"

0he entries in the list are 6-0 )ap encoded; the( directl( represent doc'ment H7s. ,t some sta)e in processin) an ,67 *'er(, we need to mer)e the entries in this postin)s list with the entries in the candidate list (3,",$%,%",%!,%%,1##,1#1). &e define the followin) fo'r operations: Compare(#)*+" compare entr( , in postin)s list with entr( E in the candidate list ,utput(-+" o'tp't 9al'e K as a res'lt of the mer)e process .oo/#0ead1o(-+" peeD ahead to taDe a looD at the tar)et K of a sDip pointer S/ip1o(-+" follow a sDip pointer to reach entr( K

a) Hn performin) this mer)e, list the instances of -'tp't, @ooD,head0o, and SDip0o operations in the order in which the( wo'ld be exec'ted. (6ote: 7o not list instances of 8ompare). #ns$er" 0his *'estion has a n'mber of possible sol'tions, dependin) on the details of the mer)in) al)orithm. 2or example, at sDip points on the postin)s list, one co'ld first looD ahead and then compare with the c'rrent postin)s entr( or do the operations in the re9erse order. Similarl(, one co'ld either discard or remember the last looDahead 9al'e (and the associated pointer) between s'ccessi9e sDip points. 2'll points were awarded for an( consistent order of operations accompanied b( clearl( stated ass'mptions, if an(. Eelow are two possible answers:

8ase 1: 0he last looD ahead 9al'e is discarded; ,t sDip points, looDahead before operatin) on the c'rrent postin)s entr( @ooD,head0o(24) -'tp't(3) -'tp't(") @ooD,head0o(!") SDip0o(!") @ooD,head0o(%2) -'tp't($%) @ooD,head0o(11") -'tp't(%!) -'tp't(1##) 8ase 2: 0he last looD ahead 9al'e and pointer are stored; ,t sDip points, looD ahead before operatin) on c'rrent postin)s entr( @ooD,head0o(24) -'tp't(3) -'tp't(") SDip0o(24) @ooD,head0o(!") SDip0o(!") @ooD,head0o(%2) -'tp't($%) SDip0o(%2) @ooD,head0o(11") -'tp't(%!) -'tp't(1##) 8ommon error: 8omp'tin) the 'nion, instead of intersectin) the postin)s and candidate lists b) /ow man( 8ompare operations wo'ld be exec'ted+ (6ote: 7o not list the act'al instances). #ns$er" /ere the answer is ob9io'sl( a f'nction of the ass'mptions made for Lart (a). 2or instance, for 8ase 1, there are 12 compares: (24 3) (3 3) (" ") (% $%) (1" $%) (24 $%) (!" $%) (%2 $%) ($1 $%) ($4 $%) ($% $%) (11" %") (%2 %") (% %") (% %!) (%! %!) (1## %%) (1## 1##) (11" 1#1) 2or 8ase 2, there are 21 compares: (24 3) (3 3) (24 ") (" ") (24 $%) (!" $%) (%2 $%) ($1 $%) ($4 $%) ($% $%) (%2 %") (11" %") (% %") (11" %!) (% %!) (%! %!) (11" %%) (1## %%) (11" 1##) (1## 1##) (11" 1#1)

8ommon error: 2or 8ase 2, not co'ntin) the compares between the last looD ahead 9al'e and the c'rrent 9al'e in the candidate list (ii) S'ppose we wish to )ap4encode postin)s list containin) sDip pointers. -ne approach is to store all entries in the list as )aps, except for tar)ets of the sDip pointers which will be stored as absol'te 9al'es. 2or example, in the postin)s list of Lart (i), 24, !", %2, and 11" will be stored as is, whereas the remainin) entries will be )ap4encoded, (ieldin) (3 2 4 24 1" 21 .... 4 1 3 11"). S'ppose (! $ % 3% 2$ # 13# !# 1# 21") represents a )ap4encoded postin)s list with sDip pointers at s/ip si3e 3. &hat will be the res'lt of mer)in) this postin)s list with the candidate list (% 24 12! 13" 21#)+ #ns$er" (24 127 210+ (! $ % 3% 2$ # 13# !# 1# 21") decodes to (! 1" 24 3% ! 12! 13# 2## 21# 21"). Hntersectin) this with the candidate list, we )et (24 12! 21#). 1 point was awarded as lon) as the postin)s list was decoded correctl(, e9en if the final answer was incorrect.

!uestion 4" (" points) (i) >epresent the followin) simplified )raph of the web as a =arDo9 chain b( pro9idin) the correspondin) transition probabilit( matrix. ,ss'me teleportation to a random pa)e (incl'din) the start pa)e) occ'rs with '05 probabilit(. #ns$er"
# #." x # #." 1 # #." # 1 . #." x # #.333 #.333 #.333 #.333 #.333 #.333 #.333 #.1 ! #.333 < #.1 ! #.333 #.41! #. ! #. 1 ! #.41! #. 1 ! #. ! #. 1 !

(this was worth 3 pts. M1 pt for each error) (ii) Bsin) the initial probabilit( 9ector G# 1 #N, carr( forward the =arDo9 chain 1 time step. (H.e., )i9e the probabilit( 9ector for time t < 1) #ns$er:
#.1 ! #] x #.1 ! #.41! #. ! #. 1 ! #. ! < [ # .1 ! #. 1 !

[#

#. 1 ! #.41!

# .1 !

#.

!]

(this was worth 2 pts. =ostl( all or nothin))

You might also like