Professional Documents
Culture Documents
InformationRetrievalinPractice
AllslidesAddisonWesley,2008
WebCrawler
Findsanddownloadswebpagesautomatically
providesthecollectionforsearching
RetrievingWebPages
Everypagehasauniqueuniformresource locator (URL) Webpagesarestoredonwebserversthatuse HTTPtoexchangeinformationwithclient software e.g.,
RetrievingWebPages
Webcrawlerclientprogramconnectstoa domainnamesystem(DNS)server DNSservertranslatesthehostnameintoan internetprotocol(IP)address Crawlerthenattemptstoconnecttoserver hostusingspecificport Afterconnection,crawlersendsanHTTP requesttothewebservertorequestapage
usuallyaGETrequest
CrawlingtheWeb
WebCrawler
Startswithasetofseeds,whichareasetof URLsgiventoitas parameters SeedsareaddedtoaURLrequestqueue Crawlerstartsfetchingpagesfromtherequest queue Downloadedpagesareparsedtofindlinktags thatmightcontainotherusefulURLstofetch NewURLsaddedtothecrawlersrequest queue,orfrontier ContinueuntilnomorenewURLsordiskfull
WebCrawling
Webcrawlersspendalotoftimewaitingfor responsestorequests Toreducethisinefficiency,webcrawlersuse threadsandfetchhundredsofpagesatonce Crawlerscouldpotentiallyfloodsiteswith requestsforpages Toavoidthisproblem,webcrawlersuse politenesspolicies
e.g.,delaybetweenrequeststosamewebserver
ControllingCrawling
Evencrawlingasiteslowlywillangersome webserveradministrators,whoobjecttoany copyingoftheirdata Robots.txtfilecanbeusedtocontrolcrawlers
SimpleCrawlerThread
Freshness
Webpagesareconstantlybeingadded, deleted,andmodified Webcrawlermustcontinuallyrevisitpagesit hasalreadycrawledtoseeiftheyhave changedinordertomaintainthefreshnessof thedocumentcollection
stale copiesnolongerreflecttherealcontentsof thewebpages
Freshness
HTTPprotocolhasaspecialrequesttype calledHEADthatmakesiteasytocheckfor pagechanges
returnsinformationaboutpage,notpageitself
Freshness
Notpossibletoconstantlycheckallpages
mustcheckimportantpagesandpagesthat changefrequently
Freshnessvs.Age
Age
Expectedageofapaget daysafteritwaslast crawled:
WebpageupdatesfollowthePoisson distributiononaverage
timeuntilthenextupdateisgovernedbyan exponentialdistribution
Age
Olderapagegets,themoreitcostsnotto crawlit
e.g.,expectedagewithmeanchangefrequency =1/7(onechangeperweek)
FocusedCrawling
Attemptstodownloadonlythosepagesthat areaboutaparticulartopic
usedbyverticalsearchapplications
Relyonthefactthatpagesaboutatopictend tohavelinkstootherpagesonthesametopic
popularpagesforatopicaretypicallyusedas seeds
Crawlerusestextclassifiertodecidewhether apageisontopic
DeepWeb
Sitesthataredifficultforacrawlertofindare collectivelyreferredtoasthedeep(orhidden) Web
muchlargerthanconventionalWeb
Threebroadcategories:
privatesites
noincominglinks,ormayrequireloginwithavalidaccount
formresults
sitesthatcanbereachedonlyafterenteringsomedatainto aform
scriptedpages
pagesthatuseJavaScript,Flash,oranotherclientside languagetogeneratelinks
Sitemaps
SitemapscontainlistsofURLsanddataabout thoseURLs,suchasmodificationtimeand modificationfrequency Generatedbywebserveradministrators Tellscrawleraboutpagesitmightnot otherwisefind Givescrawlerahintaboutwhentochecka pageforchanges
SitemapExample
DistributedCrawling
Threereasonstousemultiplecomputersfor crawling
Helpstoputthecrawlerclosertothesitesitcrawls Reducesthenumberofsitesthecrawlerhasto remember Reducescomputingresourcesrequired
Distributedcrawlerusesahashfunctiontoassign URLstocrawlingcomputers
hashfunctionshouldbecomputedonthehostpartof eachURL
DesktopCrawls
Usedfordesktopsearchandenterprisesearch Differencestowebcrawling:
Mucheasiertofindthedata Respondingquicklytoupdatesismoreimportant MustbeconservativeintermsofdiskandCPU usage Manydifferentdocumentformats Dataprivacyveryimportant
DocumentFeeds
Manydocumentsarepublished
createdatafixedtimeandrarelyupdatedagain e.g.,newsarticles,blogposts,pressreleases, email
DocumentFeeds
Twotypes:
Apushfeedalertsthesubscribertonew documents Apullfeedrequiresthesubscribertocheck periodicallyfor newdocuments
Mostcommonformatforpullfeedsiscalled RSS
ReallySimpleSyndication,RDFSiteSummary,Rich SiteSummary,or...
RSSExample
RSSExample
RSS
ttl tag(timetolive)
amountoftime(inminutes)contentsshouldbe cached
RSSfeedsareaccessedlikewebpages
usingHTTPGETrequeststowebserversthathost them
Easyforcrawlerstoparse Easytofindnewinformation
Conversion
Textisstoredinhundredsofincompatiblefile formats
e.g.,rawtext,RTF,HTML,XML,MicrosoftWord,ODF, PDF
Othertypesoffilesalsoimportant
e.g.,PowerPoint,Excel
Typicallyuseaconversiontool
convertsthedocumentcontentintoataggedtext formatsuchasHTMLorXML retainssomeoftheimportantformattinginformation
CharacterEncoding
Acharacterencodingisamappingbetween bitsandglyphs
i.e.,gettingfrombitsinafiletocharactersona screen Canbeamajorsourceofincompatibility
ASCIIisbasiccharacterencodingschemefor English
encodes128letters,numbers,specialcharacters, andcontrolcharactersin7bits,extendedwithan extrabitforstorageinbytes
CharacterEncoding
Otherlanguagescanhavemanymoreglyphs
e.g.,Chinesehasmorethan40,000characters,with over3,000incommonuse
Manylanguageshavemultipleencodingschemes
e.g.,CJK(ChineseJapaneseKorean)familyofEast Asianlanguages,Hindi,Arabic mustspecifyencoding canthavemultiplelanguagesinonefile
Unicodedevelopedtoaddressencoding problems
Unicode
Singlemappingfromnumberstoglyphsthat attemptstoincludeallglyphsincommonuse inallknownlanguages Unicodeisamappingbetweennumbersand glyphs
doesnotuniquelyspecifybitstoglyphmapping! e.g.,UTF8,UTF16,UTF32
Unicode
Proliferationofencodingscomesfromaneed forcompatibilityandtosavespace
UTF8usesonebyteforEnglish(ASCII),asmanyas 4bytesforsometraditionalChinesecharacters variablelengthencoding,moredifficulttodo stringoperations UTF32uses4bytesforeverycharacter
Unicode
StoringtheDocuments
Manyreasonstostoreconverteddocument text
savescrawlingtimewhenpageisnotupdated providesefficientaccesstotextforsnippet generation,informationextraction,etc.
Databasesystemscanprovidedocument storageforsomeapplications
websearchenginesusecustomizeddocument storagesystems
StoringtheDocuments
Requirementsfordocumentstoragesystem:
Randomaccess
requestthecontentofadocumentbasedonitsURL hashfunctionbasedonURListypical
Compressionandlargefiles
reducingstoragerequirementsandefficientaccess
Update
handlinglargevolumesofnewandmodified documents addingnewanchortext
LargeFiles
Storemanydocumentsinlargefiles,rather thaneachdocumentinafile
avoidsoverheadinopeningandclosingfiles reducesseektimerelativetoreadtime
Compounddocumentsformats
usedtostoremultipledocumentsinafile e.g.,TRECWeb
TRECWebFormat
Compression
Textishighlyredundant(orpredictable) Compressiontechniquesexploitthisredundancy tomakefilessmallerwithout losinganyofthe content Compressionofindexescoveredlater PopularalgorithmscancompressHTMLandXML textby80%
e.g.,DEFLATE(zip,gzip)andLZW(UNIXcompress, PDF) maycompresslargefilesinblockstomakeaccess faster
BigTable
Googlesdocumentstoragesystem
Customizedforstoring,finding,andupdatingweb pages Handleslargecollectionsizesusinginexpensive computers
BigTable
Noquerylanguage,nocomplexqueriesto optimize Onlyrowleveltransactions Tabletsarestoredinareplicatedfilesystemthat isaccessiblebyallBigTable servers AnychangestoaBigTable tabletarerecordedto atransactionlog,whichisalsostoredinashared filesystem Ifanytabletservercrashes,anotherservercan immediatelyreadthetabletdataandtransaction logfromthefilesystemandtakeover
BigTable
Logicallyorganizedintorows Arowstoresdataforasinglewebpage
Combinationofarowkey,acolumnkey,anda timestamppointtoasinglecellintherow
BigTable
BigTable canhaveahugenumberofcolumns perrow
allrowshavethesamecolumngroups notallrowshavethesamecolumns importantforreducingdiskreadstoaccess documentdata
Rowsarepartitionedintotabletsbasedon theirrowkeys
simplifiesdeterminingwhichserverisappropriate
DetectingDuplicates
Duplicateandnearduplicatedocuments occurinmanysituations
Copies,versions,plagiarism,spam,mirrorsites 30%ofthewebpagesinalargecrawlareexactor nearduplicatesofpagesintheother70%
Duplicatesconsumesignificantresources duringcrawling,indexing,andsearch
Littlevaluetomostusers
DuplicateDetection
Exact duplicatedetectionisrelativelyeasy Checksumtechniques
Achecksumisavaluethatiscomputedbasedonthe contentofthedocument
e.g.,sumofthebytesinthedocumentfile
Possibleforfileswithdifferenttexttohavesame checksum
NearDuplicateDetection
Morechallengingtask
Arewebpageswithsametextcontextbut differentadvertisingorformatnearduplicates?
NearDuplicateDetection
Search:
findnearduplicatesofadocumentD O(N) comparisonsrequired
Discovery:
findallpairsofnearduplicatedocumentsinthe collection O(N2) comparisons
Fingerprints
FingerprintExample
Simhash
Similaritycomparisonsusingwordbased representationsmoreeffectiveatfindingnear duplicates
Problemisefficiency
Simhash
Simhash Example
RemovingNoise
Manywebpagescontaintext,links,and picturesthatarenotdirectlyrelatedtothe maincontentofthepage Thisadditionalmaterialismostlynoisethat couldnegativelyaffecttherankingofthepage Techniqueshavebeendevelopedtodetect thecontentblocksinawebpage
Noncontentmaterialiseitherignoredorreduced inimportanceintheindexingprocess
NoiseExample
FindingContentBlocks
Cumulativedistributionoftagsintheexample webpage
Maintextcontentofthepagecorrespondstothe plateauinthemiddleofthedistribution
FindingContentBlocks
Representawebpageasasequenceofbits, wherebn =1indicatesthatthenthtokenisa tag Optimizationproblemwherewefindvaluesof i and jtomaximizeboththenumberoftags belowi andabovejandthenumberofnon tagtokensbetweeni andj i.e.,maximize
FindingContentBlocks
Otherapproaches useDOMstructure andvisual(layout) features