You are on page 1of 18

TERM PAPER

OF
INT-301: WEB PROGRAMMING

Topic: SEARCH ENGINE

Submitted by:

AVINASH MANHAS Submitted to:

Roll.no- RE2801B46 Mr.SHAKUN GARG

Reg.no- 10809450

Course- B Tech-M.Tech( IT)


search engine -- the first such detailed public
Search Engine description we know of to date.
Apart from the problems of scaling
WHAT ARE SEARCH ENGINES? traditional search techniques to data of this
magnitude, there are new technical
Search engines are huge databases of web
page files that have been assembled challenges involved with using the
automatically by machine. additional information present in hypertext
to produce better search results. This paper
There are two types of search engines: addresses this question of how to build a
practical large-scale system which can
1. Individual. Individual search
exploit the additional information present in
engines compile their own
searchable databases on the web. hypertext. Also we look at the problem of
2. Meta. Meta searchers do not how to effectively deal with uncontrolled
compile databases. Instead, they hypertext collections where anyone can
search the databases of multiple sets publish anything they want.
of individual engines simultaneously
Keywords: World Wide Web, Search
Abstract Engines, Information Retrieval, PageRank,
In this , we present Google, a prototype of a Google
large-scale search engine which makes
heavy use of the structure present in
hypertext. Google is designed to crawl and
index the Web efficiently and produce much
more satisfying search results than existing
systems. The prototype with a full text and
hyperlink database of at least 24 million
pages is available at
http://google.stanford.edu/
To engineer a search engine is a
challenging task. Search engines index tens
to hundreds of millions of web pages
involving a comparable number of distinct
terms. They answer tens of millions of
queries every day. Despite the importance of
large-scale search engines on the web, very
little academic research has been done on
them. Furthermore, due to rapid advance in
technology and web proliferation, creating a
web search engine today is very different
from three years ago. This paper provides an
in-depth description of our large-scale web
a. Example - Search Terms
Which Yield Too Many
CONTENTS: Matches

b. Always Search As
1. Introduction
Specifically As Possible
Web Search Engines -- Scaling Up: 1994 – c. Think About Search Terms
2000
a. Google: Scaling with the
Web

2. History

8. Results and Performance


3. Design Goals a. Storage Requirements
b. System Performance
Improved Search Quality c. Search Performance
a. Academic Search Engine 9. Conclusions
Research a. Future Work
b. High Quality Search
c. Scalable Architecture
d. A Research Tool
4. System Features 10. References

a. Page Rank: Bringing Order to


the Web

b. Description of Page Rank


Calculation

c. Intuitive Justification

d. Other Features

How search engines work

5. PROS AND CONS OF SEARCH


ENGINES

6. Search Engine Features

7. Which is the Best Search Engine?


as well as the number of new users
inexperienced in the art of web research.
People are likely to surf the web using its
link graph, often starting with high quality
human maintained indices such as Yahoo! or
with search engines. Human maintained lists
cover popular topics effectively but are
subjective, expensive to build and maintain,
slow to improve, and cannot cover all
esoteric topics. Automated search engines
that rely on keyword matching usually
return too many low quality matches. To
make matters worse, some advertisers
attempt to gain people's attention by taking
measures meant to mislead automated
Introduction search engines. We have built a large-scale
search engine which addresses many of the
When we want to find something on the web problems of existing systems. It makes
we look to a search engine, such as those in especially heavy use of the additional
Figure 1. Sites like Google, MSN and structure present in hypertext to provide
Yahoo! let you search for web sites that much higher quality search results. We
contain information pertinent to topics of chose our system name, Google, because it
interest to you. Potential visitors looking is a common spelling of googol, or 10100 and
for your site are going to do the same fits well with our goal of building very
thing. This makes it imperative that your large-scale search engines.
site get ranked high enough for important
keywords that visitors can find it. Knowing Web Search Engines -- Scaling Up:
what keywords are important means 1994 - 2000
knowing what visitors are looking for when Search engine technology has had to scale
they find your site.
dramatically to keep up with the growth of
the web. In 1994, one of the first web search
engines, the World Wide Web Worm
(WWWW) [McBryan 94] had an index of
110,000 web pages and web accessible
documents. As of November, 1997, the top
search engines claim to index from 2 million
(WebCrawler) to 100 million web
documents (from Search Engine Watch). It
Figure 1. Search Engines are the most is foreseeable that by the year 2000, a
common tool for promoting a web site comprehensive index of the Web will
contain over a billion documents. At the
The web creates new challenges for same time, the number of queries search
information retrieval. The amount of engines handle has grown incredibly too. In
information on the web is growing rapidly, March and April 1994, the World Wide Web
Worm received an average of about 1500 relative to the amount that will be available.
queries per day. In November 1997, This will result in favorable scaling
Altavista claimed it handled roughly 20 properties for centralized systems like
Google.
million queries per day. With the increasing
number of users on the web, and automated
systems which query search engines, it is History
likely that top search engines will handle
hundreds of millions of queries per day by During the early development of the web,
there was a list of webservers edited by Tim
the year 2000. The goal of our system is to
Berners-Lee and hosted on the CERN
address many of the problems, both in webserver. One historical snapshot from
quality and scalability, introduced by scaling 1992 remains. As more webservers went
search engine technology to such online the central list could not keep up. On
extraordinary numbers. the NCSA site new servers were announced
under the title "What's New!"
Google: Scaling with the Web
Creating a search engine which scales even The very first tool used for searching on the
to today's web presents many challenges. Internet was Archie. The name stands for
"archive" without the "v." It was created in
Fast crawling technology is needed to gather
1990 by Alan Emtage, Bill Heelan and J.
the web documents and keep them up to Peter Deutsch, computer science students at
date. Storage space must be used efficiently McGill University in Montreal. The program
to store indices and, optionally, the downloaded the directory listings of all the
documents themselves. The indexing system files located on public anonymous FTP (File
must process hundreds of gigabytes of data Transfer Protocol) sites, creating a
efficiently. Queries must be handled searchable database of file names; however,
Archie did not index the contents of these
quickly, at a rate of hundreds to thousands sites since the amount of data was so limited
per second. it could be readily searched manually.

These tasks are becoming increasingly The rise of Gopher (created in 1991 by Mark
difficult as the Web grows. However, McCahill at the University of Minnesota)
hardware performance and cost have led to two new search programs, Veronica
improved dramatically to partially offset the and Jughead. Like Archie, they searched the
difficulty. There are, however, several file names and titles stored in Gopher index
notable exceptions to this progress such as systems. Veronica (Very Easy Rodent-
disk seek time and operating system Oriented Net-wide Index to Computerized
robustness. In designing Google, we have Archives) provided a keyword search of
considered both the rate of growth of the most Gopher menu titles in the entire
Web and technological changes. Google is Gopher listings. Jughead (Jonzy's Universal
designed to scale well to extremely large Gopher Hierarchy Excavation And Display)
data sets. It makes efficient use of storage was a tool for obtaining menu information
space to store the index. Its data structures from specific Gopher servers. While the
are optimized for fast and efficient access. name of the search engine "Archie" was not
Further, we expect that the cost to index and a reference to the Archie comic book series,
store text or HTML will eventually decline "Veronica" and "Jughead" are characters in
the series, thus referencing their webpage, which has become the standard for
predecessor. all major search engines since. It was also
the first one to be widely known by the
In the summer of 1993, no search engine public. Also in 1994, Lycos (which started
existed yet for the web, though numerous at Carnegie Mellon University) was
specialized catalogues were maintained by launched and became a major commercial
hand. Oscar Nierstrasz at the University of endeavor.
Geneva wrote a series of Perl scripts that
would periodically mirror these pages and Soon after, many search engines appeared
rewrite them into a standard format which and vied for popularity. These included
formed the basis for W3Catalog, the web's Magellan, Excite, Infoseek, Inktomi,
first primitive search engine, released on Northern Light, and AltaVista. Yahoo! was
September 2, 1993. among the most popular ways for people to
find web pages of interest, but its search
In June 1993, Matthew Gray, then at MIT, function operated on its web directory,
produced what was probably the first web rather than full-text copies of web pages.
robot, the Perl-based World Wide Web Information seekers could also browse the
Wanderer, and used it to generate an index directory instead of doing a keyword-based
called 'Wandex'. The purpose of the search.
Wanderer was to measure the size of the
World Wide Web, which it did until late In 1996, Netscape was looking to give a
1995. The web's second search engine single search engine an exclusive deal to be
Aliweb appeared in November 1993. Aliweb their featured search engine. There was so
did not use a web robot, but instead much interest that instead a deal was struck
depended on being notified by website with Netscape by five of the major search
administrators of the existence at each site engines, where for $5Million per year each
of an index file in a particular format. search engine would be in a rotation on the
Netscape search engine page. The five
JumpStation (released in December 1993) engines were Yahoo!, Magellan, Lycos,
used a web robot to find web pages and to Infoseek, and Excite.
build its index, and used a web form as the
interface to its query program. It was thus Search engines were also known as some of
the first WWW resource-discovery tool to the brightest stars in the Internet investing
combine the three essential features of a web frenzy that occurred in the late 1990s.
search engine (crawling, indexing, and Several companies entered the market
searching) as described below. Because of spectacularly, receiving record gains during
the limited resources available on the their initial public offerings. Some have
platform on which it ran, its indexing and taken down their public search engine, and
hence searching were limited to the titles are marketing enterprise-only editions, such
and headings found in the web pages the as Northern Light. Many search engine
crawler encountered. companies were caught up in the dot-com
bubble, a speculation-driven market boom
One of the first "full text" crawler-based that peaked in 1999 and ended in 2001.
search engines was WebCrawler, which
came out in 1994. Unlike its predecessors, it Around 2000, the Google search engine rose
let users search for any word in any to prominence. company achieved better
results for many searches with an innovation Navigators, "The best navigation service
called PageRank. This iterative algorithm should make it easy to find almost anything
ranks web pages based on the number and on the Web (once all the data is entered)."
PageRank of other web sites and pages that
However, the Web of 1997 is quite different.
link there, on the premise that good or
desirable pages are linked to more than Anyone who has used a search engine
others. Google also maintained a minimalist recently, can readily testify that the
interface to its search engine. In contrast, completeness of the index is not the only
many of its competitors embedded a search factor in the quality of search results. "Junk
engine in a web portal. results" often wash out any results that a
user is interested in. In fact, as of November
By 2000, Yahoo was providing search
1997, only one of the top four commercial
services based on Inktomi's search engine.
Yahoo! acquired Inktomi in 2002, and search engines finds itself (returns its own
Overture (which owned AlltheWeb and search page in response to its name in the
AltaVista) in 2003. Yahoo! switched to top ten results). One of the main causes of
Google's search engine until 2004, when it this problem is that the number of
launched its own search engine based on the documents in the indices has been increasing
combined technologies of its acquisitions. by many orders of magnitude, but the user's
ability to look at documents has not. People
Microsoft first launched MSN Search in the
fall of 1998 using search results from are still only willing to look at the first few
Inktomi. In early 1999 the site began to tens of results. Because of this, as the
display listings from Looksmart blended collection size grows, we need tools that
with results from Inktomi except for a short have very high precision (number of
time in 1999 when results from AltaVista relevant documents returned, say in the top
were used instead. In 2004, Microsoft began tens of results). Indeed, we want our notion
a transition to its own search technology,
of "relevant" to only include the very best
powered by its own web crawler (called
msnbot). documents since there may be tens of
thousands of slightly relevant documents.
Microsoft's rebranded search engine, Bing, This very high precision is important even at
was launched on June 1, 2009. On July 29, the expense of recall (the total number of
2009, Yahoo! and Microsoft finalized a deal relevant documents the system is able to
in which Yahoo! Search would be powered return). There is quite a bit of recent
by Microsoft Bing technology.
optimism that the use of more hypertextual
information can help improve search and
Design Goals other applications [Marchiori 97] [Spertus
Improved Search Quality 97] [Weiss 96] [Kleinberg 98]. In particular,
link structure and link text provide a lot of
Our main goal is to improve the quality of
information for making relevance judgments
web search engines. In 1994, some people
and quality filtering. Google makes use of
believed that a complete search index would
both link structure and anchor text.
make it possible to find anything easily.
According to Best of the Web 1994 --
Academic Search Engine Research and many others are underway. Another
Aside from tremendous growth, the Web has goal we have is to set up a Spacelab-like
environment where researchers or even
also become increasingly commercial over
students can propose and do interesting
time. In 1993, 1.5% of web servers were experiments on our large-scale web data.
on .com domains. This number grew to over
60% in 1997. At the same time, search System Features
engines have migrated from the academic
The Google search engine has two important
domain to the commercial. Up until now
features that help it produce high precision
most search engine development has gone
results. First, it makes use of the link
on at companies with little publication of
structure of the Web to calculate a quality
technical details. This causes search engine
ranking for each web page. This ranking is
technology to remain largely a black art and
called PageRank and is described in detail in
to be advertising oriented . With Google,
[Page 98]. Second, Google utilizes link to
we have a strong goal to push more
improve search results.
development and understanding into the
academic realm. Page Rank: Bringing Order to the
Web
Another important design goal was to build
The citation (link) graph of the web is an
systems that reasonable numbers of people
can actually use. Usage was important to us important resource that has largely gone
because we think some of the most unused in existing web search engines. We
interesting research will involve leveraging have created maps containing as many as
the vast amount of usage data that is 518 million of these hyperlinks, a significant
available from modern web systems. For sample of the total. These maps allow rapid
example, there are many tens of millions of
calculation of a web page's "Page Rank", an
searches performed every day. However, it
is very difficult to get this data, mainly objective measure of its citation importance
because it is considered commercially that corresponds well with people's
valuable. subjective idea of importance. Because of
this correspondence, Page Rank is an
Our final design goal was to build an excellent way to prioritize the results of web
architecture that can support novel research keyword searches. For most popular
activities on large-scale web data. To
subjects, a simple text matching search that
support novel research uses, Google stores
all of the actual documents it crawls in is restricted to web page titles performs
compressed form. One of our main goals in admirably when Page Rank prioritizes the
designing Google was to set up an results (demo available at
environment where other researchers can google.stanford.edu). For the type of full
come in quickly, process large chunks of the text searches in the main Google system,
web, and produce interesting results that Page Rank also helps a great deal.
would have been very difficult to produce
otherwise. In the short time the system has
been up, there have already been several
papers using databases generated by Google,
Description of Page Rank hitting "back" but eventually gets bored and
Calculation starts on another random page. The
Academic citation literature has been probability that the random surfer visits a
applied to the web, largely by counting page is its PageRank. And, the d damping
citations or backlinks to a given page. This factor is the probability at each page the
gives some approximation of a page's "random surfer" will get bored and request
importance or quality. PageRank extends another random page. One important
this idea by not counting links from all variation is to only add the damping factor d
pages equally, and by normalizing by the to a single page, or a group of pages. This
number of links on a page. PageRank is allows for personalization and can make it
defined as follows: nearly impossible to deliberately mislead the
system in order to get a higher ranking. We
We assume page A has pages T1...Tn which have several other extensions to PageRank.
point to it (i.e., are citations). The parameter
d is a damping factor which can be set Another intuitive justification is that a page
between 0 and 1. We usually set d to 0.85. can have a high PageRank if there are many
There are more details about d in the next pages that point to it, or if there are some
pages that point to it and have a high
section. Also C(A) is defined as the number
PageRank. Intuitively, pages that are well
of links going out of page A. The PageRank cited from many places around the web are
of a page A is given as follows: worth looking at. Also, pages that have
perhaps only one citation from something
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + like the Yahoo! homepage are also generally
PR(Tn)/C(Tn)) worth looking at. If a page was not high
quality, or was a broken link, it is quite
Note that the PageRanks form a probability likely that Yahoo's homepage would not link
distribution over web pages, so the sum of to it. PageRank handles both these cases and
all web pages' PageRanks will be one. everything in between by recursively
propagating weights through the link
PageRank or PR(A) can be calculated using structure of the web.
a simple iterative algorithm, and
corresponds to the principal eigenvector of Other Features
the normalized link matrix of the web. Also,
a PageRank for 26 million web pages can be Aside from PageRank and the use of anchor
computed in a few hours on a medium size text, Google has several other features. First,
it has location information for all hits and so
workstation. There are many other details
it makes extensive use of proximity in
which are beyond the scope of this paper. search. Second, Google keeps track of some
visual presentation details such as font size
Intuitive Justification of words. Words in a larger or bolder font
PageRank can be thought of as a model of are weighted higher than other words. Third,
user behavior. We assume there is a full raw HTML of pages is available in a
"random surfer" who is given a web page at repository.
random and keeps clicking on links, never
How search engines work contain data that may no longer be available
elsewhere.
A search engine operates, in the following
When a user enters a query into a search
order
engine (typically by using key words), the
engine examines its index and provides a
1. Web crawling
listing of best-matching web pages
2. Indexing
according to its criteria, usually with a short
3. Searching
summary containing the document's title and
sometimes parts of the text. The index is
Web search engines work by storing built from the information stored with the
information about many web pages, which data and the method by which the
they retrieve from the html itself. These information is indexed. Unfortunately, there
pages are retrieved by a Web crawler are currently no known public search
(sometimes also known as a spider) — an engines that allow documents to be searched
automated Web browser which follows by date. Most search engines support the use
every link on the site. Exclusions can be of the boolean operators AND, OR and NOT
made by the use of robots.txt. The contents to further specify the search query. Boolean
of each page are then analyzed to determine operators are for literal searches that allow
how it should be indexed (for example, the user to refine and extend the terms of the
words are extracted from the titles, headings, search. The engine looks for the words or
or special fields called meta tags). Data phrases exactly as entered. Some search
about web pages are stored in an index engines provide an advanced feature called
database for use in later queries. A query proximity search which allows users to
can be a single word. The purpose of an define the distance between keywords.
index is to allow information to be found as There is also concept-based searching where
quickly as possible. Some search engines, the research involves using statistical
such as Google, store all or part of the analysis on pages containing the words or
source page (referred to as a cache) as well phrases you search for. As well, natural
as information about the web pages, whereas language queries allow the user to type a
others, such as AltaVista, store every word question in the same form one would ask it
of every page they find. This cached page to a human. A site like this would be
always holds the actual search text since it is ask.com.
the one that was actually indexed, so it can
be very useful when the content of the
The usefulness of a search engine depends
current page has been updated and the
on the relevance of the result set it gives
search terms are no longer in it. This
back. While there may be millions of web
problem might be considered to be a mild
pages that include a particular word or
form of linkrot, and Google's handling of it
phrase, some pages may be more relevant,
increases usability by satisfying user
popular, or authoritative than others. Most
expectations that the search terms will be on
search engines employ methods to rank the
the returned webpage. This satisfies the
results to provide the "best" results first.
principle of least astonishment since the user
How a search engine decides which pages
normally expects the search terms to be on
are the best matches, and what order the
the returned pages. Increased search
results should be shown in, varies widely
relevance makes these cached pages very
from one engine to another. The methods
useful, even beyond the fact that they may
also change over time as Internet usage lengthy documents in which your keyword
changes and new techniques evolve. There appears only once. Additionally, many of
are two main types of search engine that these responses will be irrelevant to your
have evolved: one is a system of predefined search.
and hierarchically ordered keywords that
humans have programmed extensively. The ARE SEARCH ENGINES ALL THE
other is a system that generates an "inverted SAME?
index" by analyzing texts it locates. This
second form relies much more heavily on Search engines use selected software
the computer itself to do the bulk of the programs to search their indexes for
work. matching keywords and phrases, presenting
their findings to you in some kind of
Most Web search engines are commercial relevance ranking. Although software
ventures supported by advertising revenue programs may be similar, no two search
and, as a result, some employ the practice of engines are exactly the same in terms of
allowing advertisers to pay money to have size, speed and content; no two search
their listings ranked higher in search results. engines use exactly the same ranking
Those search engines which do not accept schemes, and not every search engine offers
money for their search engine results make you exactly the same search options.
money by running search related ads Therefore, your search is going to be
alongside the regular search engine results. different on every engine you use. The
The search engines make money every time difference may not be a lot, but it could be
someone clicks on one of these ads. significant. Recent estimates put search
engine overlap at approximately 60 percent
PROS AND CONS OF SEARCH and unique content at around 40 percent.
ENGINES
WHEN DO WE USE SEARCH
PROS: ENGINES?
Search engines provide access to a fairly
large portion of the publicly available pages Search engines are best at finding unique
on the Web, which itself is growing keywords, phrases, quotes, and information
exponentially. buried in the full-text of web pages. Because
they index word by word, search engines are
Search engines are the best means devised also useful in retrieving tons of documents.
yet for searching the web. Stranded in the If you want a wide range of responses to
middle of this global electronic library of specific queries, use a search engine.
information without either a card catalog or
any recognizable structure, how else are you NOTE: Today, the line between
going to find what you're looking for? search engines and subject
directoriesis blurring. Search engines
CONS: no longer limit themselves to a
On the down side, the sheer number of search mechanism alone. Across the
words indexed by search engines increases Web, they are partnering with
the likelihood that they will return hundreds subject directories, or creating their
of thousands of responses to simple search own directories, and returning results
requests. Remember, they will return
gathered from a variety of other word. There is an additional set of keywords
guides and services as well. just for searching Usenet.

Which is the Best Search


Search Engine Features Engine?
Web location services typically specialize in To decide which search engine I would
one of the following: their search tools (how choose as the best, I decided that nothing but
you specify a search and how the results are useful results would count. Previous articles
presented), the size of their database, or their have emphasized quantified measures for
catalog service. Most engines deliver too speed and database sizes, but I found these
many matches in a casual search, so the had little relevance for the best performance
overriding factor in their usefulness is the in actual searches. By now, all engines have
quality of their search tools. Every search great hardware and fast net links, and none
engine I used had a nice GUI interface that show any significant delay time to work on
allowed one to type words into their form, your search or return the results. Instead, I
such as "(burger not cheeseburger) or (pizza just came up with a few topics that
AND pepperoni)." They also allowed one to represented, I felt, tough but typical
form Boolean searches (except Hotbot as of problems encountered by people who work
7/1/96, which promises to install this feature on the net: First, I tried a search with
later), i.e. they allowed the user to specify "background noise", a topic where a lot of
combinations of words. In Alta Vista and closely related but unwanted information
Lycos, one does this by adding a "+" or a "-" exists. Next, I tried a search for something
sign before each word, or in Alta Vista you very obscure. Finally, I tried a search for
can choose to use the very strict syntax keywords which overlapped with a very,
Boolean "advanced search." This advanced very popular search keyword.
search was by far the hardest to use, but also
the one most completely in the user's control Example - Search Terms Which
(except for OpenText). In most other
engines, you just use the words AND, NOT, Yield Too Many Matches
and OR to get Boolean logic.
For the first type of search, I wanted to find
By far the best service for carefully a copy of Wusage to download, free
specifying a search was Open Text. This software that lets you keep track of how
form has great menus, making a complex often your server or a specific page is
Boolean search fast and easy. Best of all, accessed, a common tool for HTML
this service permits you to specify that you developers. This site is hard to find because
want to search only titles or URLs. But then output files are produced by the program on
there's Alta Vista's little known "keyword" every machine running it that have the string
search syntax, now as powerful as Open "wusage" in their title and text. When I
Text, but not as easy to use. You can simply typed "wusage" into search page
constrain a search to phrases in anchors, forms, Infoseek and Lycos were the only
pages from a specific host, image titles, engines to find the free version of the
links, text, document titles, or URLs using software I wanted. (Note I gave no credit for
this feature with the syntax keyword: search- finding the version for sale. A careful search
of the sale version's page, did not produce
any links to the free version's download to find the site at all, and HotBot found only
site.) Infoseek's summaries were very poor, 10 matches for statistics of a server in
however, and all matches had to be checked. Omaha.

Always Search As Curiously, a search for "download wusage"


did not improve the results over the single-
Specifically As
word searches for any of the search engines!
Possible (It may be time for rudimentary
standardized categories to be used on the
Most engines failed to find their quarry Web: e.g. this is a download archive, this is
because the search was too broad. After all, an information only site, this is an
how is the engine supposed to know I want authoritative site, etc.) The lesson here may
the free version? After spending a long time just be "if at first you don't succeed..."
to find out the exact name of what I wanted,
"wusage 3.2", Infoseek, Excite, Magellan, Results and
and Lycos all found the site I was interested
in. Alta Vista, Hotbot, and OpenText Performance
yielded nothing of interest on their first
page. Magellan came out the clear winner on The most important measure of a search
this search, as the site summary was by far engine is the quality of its search results.
the best. Infoseek and Excite performed While a complete user evaluation is beyond
well, but Lycos listed a much older version the scope of this paper, our own experience
of wusage first. with Google has shown it to produce better
results than the major commercial search
Think About Search Terms engines for most searches. As an example
which illustrates the use of PageRank,
It eventually occurred to me to search for anchor text, and proximity, Figure 4 shows
"wusage AND free" to find the free copy of Google's results for a search on "bill
wusage. In some sense, Lycos was the clinton". These results demonstrates some of
winner this time because the free version Google's features. The results are clustered
was the first match listed; however, its by server. This helps considerably when
summary was not very useful. While it did a sifting through result sets. A number of
better job than Infoseek, it didn't tell me results are from the whitehouse.gov domain
whether each site was relevant or not. which is what one may reasonably expect
Magellan's response was very good, as it from such a search. Currently, most major
included a link leading to the software on commercial search engines do not return any
the first page of matches, again with an results from whitehouse.gov, much less the
excellent summary. Yahoo and Alta Vista right ones. Notice that there is no title for the
also found it, but all these engines rated the first result. This is because it was not
fee version higher than the free version. crawled. Instead, Google relied on anchor
OpenText did very well here, but only in text to determine this was a good answer to
advanced search mode where it was possible the query. Similarly, the fifth result is an
to specify that wusage must be in the title, email address which, of course, is not
and "free" could be anywhere in the text. crawlable. It is also a result of anchor text.
Wusage3.2 was listed as the second of only
two entries - no digging here! Excite failed
All of the results are reasonably high quality the total size of the repository is about 53
pages and, at last check, none were broken GB, just over one third of the total data it
links. This is largely because they all have stores. At current disk prices this makes the
high PageRank. The PageRanks are the
repository a relatively cheap source of useful
percentages in red along with bar graphs.
Finally, there are no results about a Bill data. More importantly, the total of all the
other than Clinton or about a Clinton other data used by the search engine requires a
than Bill. This is because we place heavy comparable amount of storage, about 55 GB.
importance on the proximity of word Furthermore, most queries can be answered
occurrences. Of course a true test of the using just the short inverted index. With
quality of a search engine would involve an
extensive user study or results analysis Web Page Statistics
which we do not have room for here.
Instead, we invite the reader to try Google Number of Web
for themselves at http://google.stanford.edu. 24 million
Pages Fetched

Storage Requirements 76.5


Number of Urls Seen
Aside from search quality, Google is million
designed to scale cost effectively to the size
of the Web as it grows. One aspect of this is Number of Email 1.7
to use storage efficiently. Table 1 has a Addresses million
breakdown of some statistics and storage
1.6
requirements of Google. Due to compression Number of 404's
million

Storage Statistics Table 1. Statistics

Total Size of Fetched better encoding and compression of the


147.8 GB
Pages Document Index, a high quality web search
engine may fit onto a 7GB drive of a new
Compressed Repository 53.5 GB PC.
Short Inverted Index 4.1 GB

Full Inverted Index 37.2 GB

Lexicon 293 MB
System Performance
It is important for a search engine to crawl
Temporary Anchor Data
6.6 GB and index efficiently. This way information
(not in total)
can be kept up to date and major changes to
Document Index Incl. the system can be tested relatively quickly.
9.7 GB
Variable Width Data For Google, the major operations are
Crawling, Indexing, and Sorting. It is
Links Database 3.9 GB
difficult to measure how long crawling took
Total Without overall because disks filled up, name servers
55.2 GB
Repository

108.7
Total With Repository
GB
crashed, or any number of other problems <html
xmlns="http://www.w3.org/1999/xhtml
which stopped the system. In total it took " xml:lang="en" lang="en" dir="ltr">
roughly 9 days to download the 26 million <head>
<meta http-equiv="Content-Type"
pages (including errors). However, once the
content="text/html; charset=utf-
system was running smoothly, it ran much 8" />
faster, downloading the last 11 million pages <title>search engine</title>
<meta http-equiv="pragma"
in just 63 hours, averaging just over 4 content="no-cache" />
million pages per day or 48.5 pages per <meta name="title"
second. We ran the indexer and the crawler content="www.cybwll.ch
searchengine" />
simultaneously. The indexer ran just faster <meta name="description"
than the crawlers. This is largely because we content="www.cybwell.ch" />
<meta name="robots"
spent just enough time optimizing the content="index, follow" />
indexer so that it would not be a bottleneck. <meta name="revisit-after"
These optimizations included bulk updates content="3 days" />
<meta name="author"
to the document index and placement of content="www.guide-bleu.ch" />
critical data structures on the local disk. The <meta name="publisher"
content="CYBWELL MEDIA GmbH,
indexer runs at roughly 54 pages per second. Steinhausen, Switzerland" />
The sorters can be run completely in <meta name="copyright"
parallel; using four machines, the whole content="www.cybwell.ch" />
<meta name="keywords"
process of sorting takes about 24 hours. content="" />
<link
Search Performance href="/cms/_styles/layout.css"
rel="stylesheet" type="text/css"
Improving the performance of search was media="all" />
not the major focus of our research up to this <link
href="/cms/_styles/general.css"
point. The current version of Google rel="stylesheet" type="text/css"
answers most queries in between 1 and 10 media="all" />
seconds. This time is mostly dominated by <link
href="/globalfiles/css/styles.css"
disk IO over NFS (since disks are spread rel="stylesheet" type="text/css"
over a number of machines). Furthermore, media="all" />
</head>
Google does not have any optimizations <body bgcolor="#FFFFFF">
such as query caching, subindices on
common terms, and other common <p>&nbsp;</p>
optimizations. We intend to speed up <table border="0" cellspacing="0"
Google considerably through distribution width="970" cellpadding="0">
<tr>
and hardware, software, and algorithmic
<td align="center" colspan="7"
improvements. Our target is to be able to width="970"><img border="0"
handle several hundred queries per second. src="/images/cybwell.gif">
<p>&nbsp;</p>
</td>
CODE FOR SEARCH ENGINE </tr>
<tr>
<td colspan="7" width="970"
align="center">
<FORM method="POST" quality including page rank, anchor text, and
action="default.asp">
proximity information. Furthermore, Google
<INPUT class="forms" is a complete architecture for gathering web
name="q" size="22" pages, indexing them, and performing
id="layout1">&nbsp; <INPUT
class="forms" type="submit" search queries over them.
value="search" name="search"><br>
</FORM> Future Work
</td>
</tr> A large-scale web search engine is a
<tr> complex system and much remains to be
<td width="8" valign="top" done. Our immediate goals are to improve
align="left" bgcolor="#ECECEA"
bordercolor="#ECECEA"> search efficiency and to scale to
&nbsp;</td> approximately 100 million web pages. Some
<td width="718" valign="top"
align="left" bgcolor="#ECECEA" simple improvements to efficiency include
bordercolor="#ECECEA"> query caching, smart disk allocation, and
subindices. Another area which requires
</td> much research is updates. We must have
<td width="8" valign="top" smart algorithms to decide what old web
align="left" bgcolor="#ECECEA"
bordercolor="#ECECEA"> pages should be recrawled and what new
&nbsp; ones should be crawled. Work toward this
</td> goal has been done in One promising area of
<td width="18" valign="top"
align="left"></td> research is using proxy caches to build
<td width="8" valign="top" search databases, since they are demand
align="left"
bgcolor="#ECECEA">&nbsp;</td> driven. We are planning to add simple
<td width="222" valign="top" features supported by commercial search
align="left" bgcolor="#ECECEA" engines like boolean operators, negation,
bordercolor="#ECECEA">
<HR><H3>Related Searches</H3> and stemming. However, other features are
just starting to be explored such as relevance
<td width="8" valign="top"
align="left" bgcolor="#ECECEA"
feedback and clustering (Google currently
bordercolor="#ECECEA"> supports a simple hostname based
&nbsp; clustering). We also plan to support user
</tr>
</table> context (like the user's location), and result
summarization. We are also working to
</body>
extend the use of link structure and link text.
</html> Simple experiments indicate PageRank can
be personalized by increasing the weight of
Conclusions a user's home page or bookmarks. As for
Google is designed to be a scalable search link text, we are experimenting with using
engine. The primary goal is to provide high text surrounding links in addition to the link
quality search results over a rapidly growing text itself. A Web search engine is a very
World Wide Web. Google employs a rich environment for research ideas. We
number of techniques to improve search have far too many to list here so we do not
expect this Future Work section to become memory capacity, disk seeks, disk
much shorter in the near future. throughput, disk capacity, and network IO.
Google has evolved to overcome a number
High Quality Search of these bottlenecks during various
The biggest problem facing users of web operations. Google's major data structures
search engines today is the quality of the make efficient use of available storage
results they get back. While the results are space. Furthermore, the crawling, indexing,
often amusing and expand users' horizons, and sorting operations are efficient enough
they are often frustrating and consume to be able to build an index of a substantial
precious time. For example, the top result portion of the web -- 24 million pages, in
for a search for "Bill Clinton" on one of the less than one week. We expect to be able to
most popular commercial search engines build an index of 100 million pages in less
was the Bill Clinton Joke of the Day: April than a month.
14, 1997. Google is designed to provide
higher quality search so as the Web A Research Tool
continues to grow rapidly, information can
be found easily. In order to accomplish this In addition to being a high quality search
Google makes heavy use of hypertextual engine, Google is a research tool. The
information consisting of link structure and data Google has collected has already
link (anchor) text. Google also uses resulted in many other papers
proximity and font information. While submitted to conferences and many
evaluation of a search engine is difficult, we more on the way. Recent research
have subjectively found that Google returns such as [Abiteboul 97] has shown a
higher quality search results than current number of limitations to queries about
commercial search engines. The analysis of the Web that may be answered
link structure via PageRank allows Google without having the Web available
to evaluate the quality of web pages. The locally. This means that Google (or a
use of link text as a description of what the similar system) is not only a valuable
link points to helps the search engine return research tool but a necessary one for a
relevant (and to some degree high quality) wide range of applications. We hope
results. Finally, the use of proximity Google will be a resource for
information helps increase relevance a great searchers and researchers all around
deal for many queries. the world and will spark the next
generation of search engine
Scalable Architecture technology.
Aside from the quality of search, Google is
designed to scale. It must be efficient in both REFERENCES
space and time, and constant factors are very
• Google Search Engine
important when dealing with the entire Web. http://google.stanford.edu/
In implementing Google, we have seen
bottlenecks in CPU, memory access,
• Search Engine Watch
http://www.searchenginewatch.
com/
• www.monash.com/spidap3.html
• http://www.webreference.com
• uk.searchengine.com/
• en.wikipedia.org/wiki/Web_sear
ch_engine
• searchengine.com/
• www.searchenginecommando.
com/articles/titles/16.html
• www.searchengineguide.com

You might also like