You are on page 1of 55

SearchEngines

InformationRetrievalinPractice

AllslidesAddisonWesley,2008

WebCrawler
Findsanddownloadswebpagesautomatically
providesthecollectionforsearching

Webishugeandconstantlygrowing Webisnotunderthecontrolofsearchengine providers Webpagesareconstantlychanging Crawlersalsousedforothertypesofdata

RetrievingWebPages
Everypagehasauniqueuniformresource locator (URL) Webpagesarestoredonwebserversthatuse HTTPtoexchangeinformationwithclient software e.g.,

RetrievingWebPages
Webcrawlerclientprogramconnectstoa domainnamesystem(DNS)server DNSservertranslatesthehostnameintoan internetprotocol(IP)address Crawlerthenattemptstoconnecttoserver hostusingspecificport Afterconnection,crawlersendsanHTTP requesttothewebservertorequestapage
usuallyaGETrequest

CrawlingtheWeb

WebCrawler
Startswithasetofseeds,whichareasetof URLsgiventoitas parameters SeedsareaddedtoaURLrequestqueue Crawlerstartsfetchingpagesfromtherequest queue Downloadedpagesareparsedtofindlinktags thatmightcontainotherusefulURLstofetch NewURLsaddedtothecrawlersrequest queue,orfrontier ContinueuntilnomorenewURLsordiskfull

WebCrawling
Webcrawlersspendalotoftimewaitingfor responsestorequests Toreducethisinefficiency,webcrawlersuse threadsandfetchhundredsofpagesatonce Crawlerscouldpotentiallyfloodsiteswith requestsforpages Toavoidthisproblem,webcrawlersuse politenesspolicies
e.g.,delaybetweenrequeststosamewebserver

ControllingCrawling
Evencrawlingasiteslowlywillangersome webserveradministrators,whoobjecttoany copyingoftheirdata Robots.txtfilecanbeusedtocontrolcrawlers

SimpleCrawlerThread

Freshness
Webpagesareconstantlybeingadded, deleted,andmodified Webcrawlermustcontinuallyrevisitpagesit hasalreadycrawledtoseeiftheyhave changedinordertomaintainthefreshnessof thedocumentcollection
stale copiesnolongerreflecttherealcontentsof thewebpages

Freshness
HTTPprotocolhasaspecialrequesttype calledHEADthatmakesiteasytocheckfor pagechanges
returnsinformationaboutpage,notpageitself

Freshness
Notpossibletoconstantlycheckallpages
mustcheckimportantpagesandpagesthat changefrequently

Freshnessistheproportionofpagesthatare fresh Optimizingforthismetriccanleadtobad decisions,suchasnotcrawlingpopularsites Age isabettermetric

Freshnessvs.Age

Age
Expectedageofapaget daysafteritwaslast crawled:

WebpageupdatesfollowthePoisson distributiononaverage
timeuntilthenextupdateisgovernedbyan exponentialdistribution

Age
Olderapagegets,themoreitcostsnotto crawlit
e.g.,expectedagewithmeanchangefrequency =1/7(onechangeperweek)

FocusedCrawling
Attemptstodownloadonlythosepagesthat areaboutaparticulartopic
usedbyverticalsearchapplications

Relyonthefactthatpagesaboutatopictend tohavelinkstootherpagesonthesametopic
popularpagesforatopicaretypicallyusedas seeds

Crawlerusestextclassifiertodecidewhether apageisontopic

DeepWeb
Sitesthataredifficultforacrawlertofindare collectivelyreferredtoasthedeep(orhidden) Web
muchlargerthanconventionalWeb

Threebroadcategories:
privatesites
noincominglinks,ormayrequireloginwithavalidaccount

formresults
sitesthatcanbereachedonlyafterenteringsomedatainto aform

scriptedpages
pagesthatuseJavaScript,Flash,oranotherclientside languagetogeneratelinks

Sitemaps
SitemapscontainlistsofURLsanddataabout thoseURLs,suchasmodificationtimeand modificationfrequency Generatedbywebserveradministrators Tellscrawleraboutpagesitmightnot otherwisefind Givescrawlerahintaboutwhentochecka pageforchanges

SitemapExample

DistributedCrawling
Threereasonstousemultiplecomputersfor crawling
Helpstoputthecrawlerclosertothesitesitcrawls Reducesthenumberofsitesthecrawlerhasto remember Reducescomputingresourcesrequired

Distributedcrawlerusesahashfunctiontoassign URLstocrawlingcomputers
hashfunctionshouldbecomputedonthehostpartof eachURL

DesktopCrawls
Usedfordesktopsearchandenterprisesearch Differencestowebcrawling:
Mucheasiertofindthedata Respondingquicklytoupdatesismoreimportant MustbeconservativeintermsofdiskandCPU usage Manydifferentdocumentformats Dataprivacyveryimportant

DocumentFeeds
Manydocumentsarepublished
createdatafixedtimeandrarelyupdatedagain e.g.,newsarticles,blogposts,pressreleases, email

Publisheddocumentsfromasinglesourcecan beorderedinasequencecalledadocument feed


newdocumentsfoundbyexaminingtheendof thefeed

DocumentFeeds
Twotypes:
Apushfeedalertsthesubscribertonew documents Apullfeedrequiresthesubscribertocheck periodicallyfor newdocuments

Mostcommonformatforpullfeedsiscalled RSS
ReallySimpleSyndication,RDFSiteSummary,Rich SiteSummary,or...

RSSExample

RSSExample

RSS
ttl tag(timetolive)
amountoftime(inminutes)contentsshouldbe cached

RSSfeedsareaccessedlikewebpages
usingHTTPGETrequeststowebserversthathost them

Easyforcrawlerstoparse Easytofindnewinformation

Conversion
Textisstoredinhundredsofincompatiblefile formats
e.g.,rawtext,RTF,HTML,XML,MicrosoftWord,ODF, PDF

Othertypesoffilesalsoimportant
e.g.,PowerPoint,Excel

Typicallyuseaconversiontool
convertsthedocumentcontentintoataggedtext formatsuchasHTMLorXML retainssomeoftheimportantformattinginformation

CharacterEncoding
Acharacterencodingisamappingbetween bitsandglyphs
i.e.,gettingfrombitsinafiletocharactersona screen Canbeamajorsourceofincompatibility

ASCIIisbasiccharacterencodingschemefor English
encodes128letters,numbers,specialcharacters, andcontrolcharactersin7bits,extendedwithan extrabitforstorageinbytes

CharacterEncoding
Otherlanguagescanhavemanymoreglyphs
e.g.,Chinesehasmorethan40,000characters,with over3,000incommonuse

Manylanguageshavemultipleencodingschemes
e.g.,CJK(ChineseJapaneseKorean)familyofEast Asianlanguages,Hindi,Arabic mustspecifyencoding canthavemultiplelanguagesinonefile

Unicodedevelopedtoaddressencoding problems

Unicode
Singlemappingfromnumberstoglyphsthat attemptstoincludeallglyphsincommonuse inallknownlanguages Unicodeisamappingbetweennumbersand glyphs
doesnotuniquelyspecifybitstoglyphmapping! e.g.,UTF8,UTF16,UTF32

Unicode
Proliferationofencodingscomesfromaneed forcompatibilityandtosavespace
UTF8usesonebyteforEnglish(ASCII),asmanyas 4bytesforsometraditionalChinesecharacters variablelengthencoding,moredifficulttodo stringoperations UTF32uses4bytesforeverycharacter

ManyapplicationsuseUTF32forinternaltext encoding(fastrandomlookup)andUTF8for diskstorage(lessspace)

Unicode

e.g.,Greekletterpi() isUnicodesymbolnumber 960 Inbinary,0000001111000000(3C0in hexadecimal) Finalencodingis1100111110000000(CF80in hexadecimal)

StoringtheDocuments
Manyreasonstostoreconverteddocument text
savescrawlingtimewhenpageisnotupdated providesefficientaccesstotextforsnippet generation,informationextraction,etc.

Databasesystemscanprovidedocument storageforsomeapplications
websearchenginesusecustomizeddocument storagesystems

StoringtheDocuments
Requirementsfordocumentstoragesystem:
Randomaccess
requestthecontentofadocumentbasedonitsURL hashfunctionbasedonURListypical

Compressionandlargefiles
reducingstoragerequirementsandefficientaccess

Update
handlinglargevolumesofnewandmodified documents addingnewanchortext

LargeFiles
Storemanydocumentsinlargefiles,rather thaneachdocumentinafile
avoidsoverheadinopeningandclosingfiles reducesseektimerelativetoreadtime

Compounddocumentsformats
usedtostoremultipledocumentsinafile e.g.,TRECWeb

TRECWebFormat

Compression
Textishighlyredundant(orpredictable) Compressiontechniquesexploitthisredundancy tomakefilessmallerwithout losinganyofthe content Compressionofindexescoveredlater PopularalgorithmscancompressHTMLandXML textby80%
e.g.,DEFLATE(zip,gzip)andLZW(UNIXcompress, PDF) maycompresslargefilesinblockstomakeaccess faster

BigTable
Googlesdocumentstoragesystem
Customizedforstoring,finding,andupdatingweb pages Handleslargecollectionsizesusinginexpensive computers

BigTable
Noquerylanguage,nocomplexqueriesto optimize Onlyrowleveltransactions Tabletsarestoredinareplicatedfilesystemthat isaccessiblebyallBigTable servers AnychangestoaBigTable tabletarerecordedto atransactionlog,whichisalsostoredinashared filesystem Ifanytabletservercrashes,anotherservercan immediatelyreadthetabletdataandtransaction logfromthefilesystemandtakeover

BigTable
Logicallyorganizedintorows Arowstoresdataforasinglewebpage

Combinationofarowkey,acolumnkey,anda timestamppointtoasinglecellintherow

BigTable
BigTable canhaveahugenumberofcolumns perrow
allrowshavethesamecolumngroups notallrowshavethesamecolumns importantforreducingdiskreadstoaccess documentdata

Rowsarepartitionedintotabletsbasedon theirrowkeys
simplifiesdeterminingwhichserverisappropriate

DetectingDuplicates
Duplicateandnearduplicatedocuments occurinmanysituations
Copies,versions,plagiarism,spam,mirrorsites 30%ofthewebpagesinalargecrawlareexactor nearduplicatesofpagesintheother70%

Duplicatesconsumesignificantresources duringcrawling,indexing,andsearch
Littlevaluetomostusers

DuplicateDetection
Exact duplicatedetectionisrelativelyeasy Checksumtechniques
Achecksumisavaluethatiscomputedbasedonthe contentofthedocument
e.g.,sumofthebytesinthedocumentfile

Possibleforfileswithdifferenttexttohavesame checksum

Functionssuchasacyclicredundancycheck (CRC),havebeendevelopedthatconsiderthe positionsofthebytes

NearDuplicateDetection
Morechallengingtask
Arewebpageswithsametextcontextbut differentadvertisingorformatnearduplicates?

Anearduplicatedocumentisdefinedusinga thresholdvalueforsomesimilaritymeasure betweenpairsofdocuments


e.g.,documentD1isanearduplicateof documentD2ifmorethan90%ofthewordsin thedocumentsarethesame

NearDuplicateDetection
Search:
findnearduplicatesofadocumentD O(N) comparisonsrequired

Discovery:
findallpairsofnearduplicatedocumentsinthe collection O(N2) comparisons

IRtechniquesareeffectiveforsearchscenario Fordiscovery,othertechniquesusedto generatecompactrepresentations

Fingerprints

FingerprintExample

Simhash
Similaritycomparisonsusingwordbased representationsmoreeffectiveatfindingnear duplicates
Problemisefficiency

Simhash combinestheadvantagesoftheword basedsimilaritymeasureswiththeefficiencyof fingerprintsbasedonhashing Similarityoftwopagesasmeasuredbythecosine correlationmeasureisproportionaltothe numberofbitsthatarethesameinthesimhash fingerprints

Simhash

Simhash Example

RemovingNoise
Manywebpagescontaintext,links,and picturesthatarenotdirectlyrelatedtothe maincontentofthepage Thisadditionalmaterialismostlynoisethat couldnegativelyaffecttherankingofthepage Techniqueshavebeendevelopedtodetect thecontentblocksinawebpage
Noncontentmaterialiseitherignoredorreduced inimportanceintheindexingprocess

NoiseExample

FindingContentBlocks
Cumulativedistributionoftagsintheexample webpage

Maintextcontentofthepagecorrespondstothe plateauinthemiddleofthedistribution

FindingContentBlocks
Representawebpageasasequenceofbits, wherebn =1indicatesthatthenthtokenisa tag Optimizationproblemwherewefindvaluesof i and jtomaximizeboththenumberoftags belowi andabovejandthenumberofnon tagtokensbetweeni andj i.e.,maximize

FindingContentBlocks
Otherapproaches useDOMstructure andvisual(layout) features

You might also like