Search Engines: Information Retrieval in Practice

SearchEngines
InformationRetrievalinPractice
AllslidesAddisonWesley,2008
WebCrawler
Findsanddownloadswebpagesautomatically
providesthecollectionforsearching
Webishugeandconstantlygrowing Webisnotunderthecontrolofsearchengine providers Webpagesareconstantlychanging Crawlersalsousedforothertypesofdata
RetrievingWebPages
Everypagehasauniqueuniformresource locator (URL) Webpagesarestoredonwebserversthatuse HTTPtoexchangeinformationwithclient software e.g.,
RetrievingWebPages
Webcrawlerclientprogramconnectstoa domainnamesystem(DNS)server DNSservertranslatesthehostnameintoan internetprotocol(IP)address Crawlerthenattemptstoconnecttoserver hostusingspecificport Afterconnection,crawlersendsanHTTP requesttothewebservertorequestapage
usuallyaGETrequest
CrawlingtheWeb
WebCrawler
Startswithasetofseeds,whichareasetof URLsgiventoitas parameters SeedsareaddedtoaURLrequestqueue Crawlerstartsfetchingpagesfromtherequest queue Downloadedpagesareparsedtofindlinktags thatmightcontainotherusefulURLstofetch NewURLsaddedtothecrawlersrequest queue,orfrontier ContinueuntilnomorenewURLsordiskfull
WebCrawling
Webcrawlersspendalotoftimewaitingfor responsestorequests Toreducethisinefficiency,webcrawlersuse threadsandfetchhundredsofpagesatonce Crawlerscouldpotentiallyfloodsiteswith requestsforpages Toavoidthisproblem,webcrawlersuse politenesspolicies
e.g.,delaybetweenrequeststosamewebserver
ControllingCrawling
Evencrawlingasiteslowlywillangersome webserveradministrators,whoobjecttoany copyingoftheirdata Robots.txtfilecanbeusedtocontrolcrawlers
SimpleCrawlerThread
Freshness
Webpagesareconstantlybeingadded, deleted,andmodified Webcrawlermustcontinuallyrevisitpagesit hasalreadycrawledtoseeiftheyhave changedinordertomaintainthefreshnessof thedocumentcollection
stale copiesnolongerreflecttherealcontentsof thewebpages
Freshness
HTTPprotocolhasaspecialrequesttype calledHEADthatmakesiteasytocheckfor pagechanges
returnsinformationaboutpage,notpageitself
Freshness
Notpossibletoconstantlycheckallpages
mustcheckimportantpagesandpagesthat changefrequently
Freshnessistheproportionofpagesthatare fresh Optimizingforthismetriccanleadtobad decisions,suchasnotcrawlingpopularsites Age isabettermetric
Freshnessvs.Age
Age
Expectedageofapaget daysafteritwaslast crawled:
WebpageupdatesfollowthePoisson distributiononaverage
timeuntilthenextupdateisgovernedbyan exponentialdistribution
Age
Olderapagegets,themoreitcostsnotto crawlit
e.g.,expectedagewithmeanchangefrequency =1/7(onechangeperweek)
FocusedCrawling
Attemptstodownloadonlythosepagesthat areaboutaparticulartopic
usedbyverticalsearchapplications
Relyonthefactthatpagesaboutatopictend tohavelinkstootherpagesonthesametopic
popularpagesforatopicaretypicallyusedas seeds
Crawlerusestextclassifiertodecidewhether apageisontopic
DeepWeb
Sitesthataredifficultforacrawlertofindare collectivelyreferredtoasthedeep(orhidden) Web
muchlargerthanconventionalWeb
Threebroadcategories:
privatesites
noincominglinks,ormayrequireloginwithavalidaccount
formresults
sitesthatcanbereachedonlyafterenteringsomedatainto aform
scriptedpages
pagesthatuseJavaScript,Flash,oranotherclientside languagetogeneratelinks
Sitemaps
SitemapscontainlistsofURLsanddataabout thoseURLs,suchasmodificationtimeand modificationfrequency Generatedbywebserveradministrators Tellscrawleraboutpagesitmightnot otherwisefind Givescrawlerahintaboutwhentochecka pageforchanges
SitemapExample
DistributedCrawling
Threereasonstousemultiplecomputersfor crawling
Helpstoputthecrawlerclosertothesitesitcrawls Reducesthenumberofsitesthecrawlerhasto remember Reducescomputingresourcesrequired
Distributedcrawlerusesahashfunctiontoassign URLstocrawlingcomputers
hashfunctionshouldbecomputedonthehostpartof eachURL
DesktopCrawls
Usedfordesktopsearchandenterprisesearch Differencestowebcrawling:
Mucheasiertofindthedata Respondingquicklytoupdatesismoreimportant MustbeconservativeintermsofdiskandCPU usage Manydifferentdocumentformats Dataprivacyveryimportant
DocumentFeeds
Manydocumentsarepublished
createdatafixedtimeandrarelyupdatedagain e.g.,newsarticles,blogposts,pressreleases, email
Publisheddocumentsfromasinglesourcecan beorderedinasequencecalledadocument feed

newdocumentsfoundbyexaminingtheendof thefeed
DocumentFeeds
Twotypes:
Apushfeedalertsthesubscribertonew documents Apullfeedrequiresthesubscribertocheck periodicallyfor newdocuments
Mostcommonformatforpullfeedsiscalled RSS
ReallySimpleSyndication,RDFSiteSummary,Rich SiteSummary,or...
RSSExample
RSSExample
RSS
ttl tag(timetolive)
amountoftime(inminutes)contentsshouldbe cached
RSSfeedsareaccessedlikewebpages
usingHTTPGETrequeststowebserversthathost them
Easyforcrawlerstoparse Easytofindnewinformation
Conversion
Textisstoredinhundredsofincompatiblefile formats
e.g.,rawtext,RTF,HTML,XML,MicrosoftWord,ODF, PDF
Othertypesoffilesalsoimportant
e.g.,PowerPoint,Excel
Typicallyuseaconversiontool
convertsthedocumentcontentintoataggedtext formatsuchasHTMLorXML retainssomeoftheimportantformattinginformation
CharacterEncoding
Acharacterencodingisamappingbetween bitsandglyphs
i.e.,gettingfrombitsinafiletocharactersona screen Canbeamajorsourceofincompatibility
ASCIIisbasiccharacterencodingschemefor English
encodes128letters,numbers,specialcharacters, andcontrolcharactersin7bits,extendedwithan extrabitforstorageinbytes
CharacterEncoding
Otherlanguagescanhavemanymoreglyphs
e.g.,Chinesehasmorethan40,000characters,with over3,000incommonuse
Manylanguageshavemultipleencodingschemes
e.g.,CJK(ChineseJapaneseKorean)familyofEast Asianlanguages,Hindi,Arabic mustspecifyencoding canthavemultiplelanguagesinonefile
Unicodedevelopedtoaddressencoding problems
Unicode
Singlemappingfromnumberstoglyphsthat attemptstoincludeallglyphsincommonuse inallknownlanguages Unicodeisamappingbetweennumbersand glyphs
doesnotuniquelyspecifybitstoglyphmapping! e.g.,UTF8,UTF16,UTF32
Unicode
Proliferationofencodingscomesfromaneed forcompatibilityandtosavespace
UTF8usesonebyteforEnglish(ASCII),asmanyas 4bytesforsometraditionalChinesecharacters variablelengthencoding,moredifficulttodo stringoperations UTF32uses4bytesforeverycharacter
ManyapplicationsuseUTF32forinternaltext encoding(fastrandomlookup)andUTF8for diskstorage(lessspace)
Unicode
e.g.,Greekletterpi() isUnicodesymbolnumber 960 Inbinary,0000001111000000(3C0in hexadecimal) Finalencodingis1100111110000000(CF80in hexadecimal)
StoringtheDocuments
Manyreasonstostoreconverteddocument text
savescrawlingtimewhenpageisnotupdated providesefficientaccesstotextforsnippet generation,informationextraction,etc.
Databasesystemscanprovidedocument storageforsomeapplications
websearchenginesusecustomizeddocument storagesystems
StoringtheDocuments
Requirementsfordocumentstoragesystem:
Randomaccess
requestthecontentofadocumentbasedonitsURL hashfunctionbasedonURListypical
Compressionandlargefiles
reducingstoragerequirementsandefficientaccess
Update
handlinglargevolumesofnewandmodified documents addingnewanchortext
LargeFiles
Storemanydocumentsinlargefiles,rather thaneachdocumentinafile
avoidsoverheadinopeningandclosingfiles reducesseektimerelativetoreadtime
Compounddocumentsformats
usedtostoremultipledocumentsinafile e.g.,TRECWeb
TRECWebFormat
Compression
Textishighlyredundant(orpredictable) Compressiontechniquesexploitthisredundancy tomakefilessmallerwithout losinganyofthe content Compressionofindexescoveredlater PopularalgorithmscancompressHTMLandXML textby80%
e.g.,DEFLATE(zip,gzip)andLZW(UNIXcompress, PDF) maycompresslargefilesinblockstomakeaccess faster
BigTable
Googlesdocumentstoragesystem
Customizedforstoring,finding,andupdatingweb pages Handleslargecollectionsizesusinginexpensive computers
BigTable
Noquerylanguage,nocomplexqueriesto optimize Onlyrowleveltransactions Tabletsarestoredinareplicatedfilesystemthat isaccessiblebyallBigTable servers AnychangestoaBigTable tabletarerecordedto atransactionlog,whichisalsostoredinashared filesystem Ifanytabletservercrashes,anotherservercan immediatelyreadthetabletdataandtransaction logfromthefilesystemandtakeover
BigTable
Logicallyorganizedintorows Arowstoresdataforasinglewebpage
Combinationofarowkey,acolumnkey,anda timestamppointtoasinglecellintherow
BigTable
BigTable canhaveahugenumberofcolumns perrow
allrowshavethesamecolumngroups notallrowshavethesamecolumns importantforreducingdiskreadstoaccess documentdata
Rowsarepartitionedintotabletsbasedon theirrowkeys
simplifiesdeterminingwhichserverisappropriate
DetectingDuplicates
Duplicateandnearduplicatedocuments occurinmanysituations
Copies,versions,plagiarism,spam,mirrorsites 30%ofthewebpagesinalargecrawlareexactor nearduplicatesofpagesintheother70%
Duplicatesconsumesignificantresources duringcrawling,indexing,andsearch
Littlevaluetomostusers
DuplicateDetection
Exact duplicatedetectionisrelativelyeasy Checksumtechniques
Achecksumisavaluethatiscomputedbasedonthe contentofthedocument
e.g.,sumofthebytesinthedocumentfile
Possibleforfileswithdifferenttexttohavesame checksum
Functionssuchasacyclicredundancycheck (CRC),havebeendevelopedthatconsiderthe positionsofthebytes
NearDuplicateDetection
Morechallengingtask
Arewebpageswithsametextcontextbut differentadvertisingorformatnearduplicates?
Anearduplicatedocumentisdefinedusinga thresholdvalueforsomesimilaritymeasure betweenpairsofdocuments

e.g.,documentD1isanearduplicateof documentD2ifmorethan90%ofthewordsin thedocumentsarethesame
NearDuplicateDetection
Search:
findnearduplicatesofadocumentD O(N) comparisonsrequired
Discovery:
findallpairsofnearduplicatedocumentsinthe collection O(N2) comparisons
IRtechniquesareeffectiveforsearchscenario Fordiscovery,othertechniquesusedto generatecompactrepresentations
Fingerprints
FingerprintExample
Simhash
Similaritycomparisonsusingwordbased representationsmoreeffectiveatfindingnear duplicates
Problemisefficiency
Simhash combinestheadvantagesoftheword basedsimilaritymeasureswiththeefficiencyof fingerprintsbasedonhashing Similarityoftwopagesasmeasuredbythecosine correlationmeasureisproportionaltothe numberofbitsthatarethesameinthesimhash fingerprints
Simhash
Simhash Example
RemovingNoise
Manywebpagescontaintext,links,and picturesthatarenotdirectlyrelatedtothe maincontentofthepage Thisadditionalmaterialismostlynoisethat couldnegativelyaffecttherankingofthepage Techniqueshavebeendevelopedtodetect thecontentblocksinawebpage
Noncontentmaterialiseitherignoredorreduced inimportanceintheindexingprocess
NoiseExample
FindingContentBlocks
Cumulativedistributionoftagsintheexample webpage
Maintextcontentofthepagecorrespondstothe plateauinthemiddleofthedistribution
Representawebpageasasequenceofbits, wherebn =1indicatesthatthenthtokenisa tag Optimizationproblemwherewefindvaluesof i and jtomaximizeboththenumberoftags belowi andabovejandthenumberofnon tagtokensbetweeni andj i.e.,maximize
Otherapproaches useDOMstructure andvisual(layout) features

Search Engines: Information Retrieval in Practice

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Search Engines: Information Retrieval in Practice

Uploaded by

Copyright:

Available Formats

SearchEngines

Webishugeandconstantlygrowing Webisnotunderthecontrolofsearchengine providers Webpagesareconstantlychanging Crawlersalsousedforothertypesofdata

Freshnessistheproportionofpagesthatare fresh Optimizingforthismetriccanleadtobad decisions,suchasnotcrawlingpopularsites Age isabettermetric

Publisheddocumentsfromasinglesourcecan beorderedinasequencecalledadocument feed

ManyapplicationsuseUTF32forinternaltext encoding(fastrandomlookup)andUTF8for diskstorage(lessspace)

e.g.,Greekletterpi() isUnicodesymbolnumber 960 Inbinary,0000001111000000(3C0in hexadecimal) Finalencodingis1100111110000000(CF80in hexadecimal)

Functionssuchasacyclicredundancycheck (CRC),havebeendevelopedthatconsiderthe positionsofthebytes

Anearduplicatedocumentisdefinedusinga thresholdvalueforsomesimilaritymeasure betweenpairsofdocuments

IRtechniquesareeffectiveforsearchscenario Fordiscovery,othertechniquesusedto generatecompactrepresentations

Simhash combinestheadvantagesoftheword basedsimilaritymeasureswiththeefficiencyof fingerprintsbasedonhashing Similarityoftwopagesasmeasuredbythecosine correlationmeasureisproportionaltothe numberofbitsthatarethesameinthesimhash fingerprints

You might also like