Professional Documents
Culture Documents
OF
INT-301: WEB PROGRAMMING
Submitted by:
Reg.no- 10809450
b. Always Search As
1. Introduction
Specifically As Possible
Web Search Engines -- Scaling Up: 1994 – c. Think About Search Terms
2000
a. Google: Scaling with the
Web
2. History
c. Intuitive Justification
d. Other Features
These tasks are becoming increasingly The rise of Gopher (created in 1991 by Mark
difficult as the Web grows. However, McCahill at the University of Minnesota)
hardware performance and cost have led to two new search programs, Veronica
improved dramatically to partially offset the and Jughead. Like Archie, they searched the
difficulty. There are, however, several file names and titles stored in Gopher index
notable exceptions to this progress such as systems. Veronica (Very Easy Rodent-
disk seek time and operating system Oriented Net-wide Index to Computerized
robustness. In designing Google, we have Archives) provided a keyword search of
considered both the rate of growth of the most Gopher menu titles in the entire
Web and technological changes. Google is Gopher listings. Jughead (Jonzy's Universal
designed to scale well to extremely large Gopher Hierarchy Excavation And Display)
data sets. It makes efficient use of storage was a tool for obtaining menu information
space to store the index. Its data structures from specific Gopher servers. While the
are optimized for fast and efficient access. name of the search engine "Archie" was not
Further, we expect that the cost to index and a reference to the Archie comic book series,
store text or HTML will eventually decline "Veronica" and "Jughead" are characters in
the series, thus referencing their webpage, which has become the standard for
predecessor. all major search engines since. It was also
the first one to be widely known by the
In the summer of 1993, no search engine public. Also in 1994, Lycos (which started
existed yet for the web, though numerous at Carnegie Mellon University) was
specialized catalogues were maintained by launched and became a major commercial
hand. Oscar Nierstrasz at the University of endeavor.
Geneva wrote a series of Perl scripts that
would periodically mirror these pages and Soon after, many search engines appeared
rewrite them into a standard format which and vied for popularity. These included
formed the basis for W3Catalog, the web's Magellan, Excite, Infoseek, Inktomi,
first primitive search engine, released on Northern Light, and AltaVista. Yahoo! was
September 2, 1993. among the most popular ways for people to
find web pages of interest, but its search
In June 1993, Matthew Gray, then at MIT, function operated on its web directory,
produced what was probably the first web rather than full-text copies of web pages.
robot, the Perl-based World Wide Web Information seekers could also browse the
Wanderer, and used it to generate an index directory instead of doing a keyword-based
called 'Wandex'. The purpose of the search.
Wanderer was to measure the size of the
World Wide Web, which it did until late In 1996, Netscape was looking to give a
1995. The web's second search engine single search engine an exclusive deal to be
Aliweb appeared in November 1993. Aliweb their featured search engine. There was so
did not use a web robot, but instead much interest that instead a deal was struck
depended on being notified by website with Netscape by five of the major search
administrators of the existence at each site engines, where for $5Million per year each
of an index file in a particular format. search engine would be in a rotation on the
Netscape search engine page. The five
JumpStation (released in December 1993) engines were Yahoo!, Magellan, Lycos,
used a web robot to find web pages and to Infoseek, and Excite.
build its index, and used a web form as the
interface to its query program. It was thus Search engines were also known as some of
the first WWW resource-discovery tool to the brightest stars in the Internet investing
combine the three essential features of a web frenzy that occurred in the late 1990s.
search engine (crawling, indexing, and Several companies entered the market
searching) as described below. Because of spectacularly, receiving record gains during
the limited resources available on the their initial public offerings. Some have
platform on which it ran, its indexing and taken down their public search engine, and
hence searching were limited to the titles are marketing enterprise-only editions, such
and headings found in the web pages the as Northern Light. Many search engine
crawler encountered. companies were caught up in the dot-com
bubble, a speculation-driven market boom
One of the first "full text" crawler-based that peaked in 1999 and ended in 2001.
search engines was WebCrawler, which
came out in 1994. Unlike its predecessors, it Around 2000, the Google search engine rose
let users search for any word in any to prominence. company achieved better
results for many searches with an innovation Navigators, "The best navigation service
called PageRank. This iterative algorithm should make it easy to find almost anything
ranks web pages based on the number and on the Web (once all the data is entered)."
PageRank of other web sites and pages that
However, the Web of 1997 is quite different.
link there, on the premise that good or
desirable pages are linked to more than Anyone who has used a search engine
others. Google also maintained a minimalist recently, can readily testify that the
interface to its search engine. In contrast, completeness of the index is not the only
many of its competitors embedded a search factor in the quality of search results. "Junk
engine in a web portal. results" often wash out any results that a
user is interested in. In fact, as of November
By 2000, Yahoo was providing search
1997, only one of the top four commercial
services based on Inktomi's search engine.
Yahoo! acquired Inktomi in 2002, and search engines finds itself (returns its own
Overture (which owned AlltheWeb and search page in response to its name in the
AltaVista) in 2003. Yahoo! switched to top ten results). One of the main causes of
Google's search engine until 2004, when it this problem is that the number of
launched its own search engine based on the documents in the indices has been increasing
combined technologies of its acquisitions. by many orders of magnitude, but the user's
ability to look at documents has not. People
Microsoft first launched MSN Search in the
fall of 1998 using search results from are still only willing to look at the first few
Inktomi. In early 1999 the site began to tens of results. Because of this, as the
display listings from Looksmart blended collection size grows, we need tools that
with results from Inktomi except for a short have very high precision (number of
time in 1999 when results from AltaVista relevant documents returned, say in the top
were used instead. In 2004, Microsoft began tens of results). Indeed, we want our notion
a transition to its own search technology,
of "relevant" to only include the very best
powered by its own web crawler (called
msnbot). documents since there may be tens of
thousands of slightly relevant documents.
Microsoft's rebranded search engine, Bing, This very high precision is important even at
was launched on June 1, 2009. On July 29, the expense of recall (the total number of
2009, Yahoo! and Microsoft finalized a deal relevant documents the system is able to
in which Yahoo! Search would be powered return). There is quite a bit of recent
by Microsoft Bing technology.
optimism that the use of more hypertextual
information can help improve search and
Design Goals other applications [Marchiori 97] [Spertus
Improved Search Quality 97] [Weiss 96] [Kleinberg 98]. In particular,
link structure and link text provide a lot of
Our main goal is to improve the quality of
information for making relevance judgments
web search engines. In 1994, some people
and quality filtering. Google makes use of
believed that a complete search index would
both link structure and anchor text.
make it possible to find anything easily.
According to Best of the Web 1994 --
Academic Search Engine Research and many others are underway. Another
Aside from tremendous growth, the Web has goal we have is to set up a Spacelab-like
environment where researchers or even
also become increasingly commercial over
students can propose and do interesting
time. In 1993, 1.5% of web servers were experiments on our large-scale web data.
on .com domains. This number grew to over
60% in 1997. At the same time, search System Features
engines have migrated from the academic
The Google search engine has two important
domain to the commercial. Up until now
features that help it produce high precision
most search engine development has gone
results. First, it makes use of the link
on at companies with little publication of
structure of the Web to calculate a quality
technical details. This causes search engine
ranking for each web page. This ranking is
technology to remain largely a black art and
called PageRank and is described in detail in
to be advertising oriented . With Google,
[Page 98]. Second, Google utilizes link to
we have a strong goal to push more
improve search results.
development and understanding into the
academic realm. Page Rank: Bringing Order to the
Web
Another important design goal was to build
The citation (link) graph of the web is an
systems that reasonable numbers of people
can actually use. Usage was important to us important resource that has largely gone
because we think some of the most unused in existing web search engines. We
interesting research will involve leveraging have created maps containing as many as
the vast amount of usage data that is 518 million of these hyperlinks, a significant
available from modern web systems. For sample of the total. These maps allow rapid
example, there are many tens of millions of
calculation of a web page's "Page Rank", an
searches performed every day. However, it
is very difficult to get this data, mainly objective measure of its citation importance
because it is considered commercially that corresponds well with people's
valuable. subjective idea of importance. Because of
this correspondence, Page Rank is an
Our final design goal was to build an excellent way to prioritize the results of web
architecture that can support novel research keyword searches. For most popular
activities on large-scale web data. To
subjects, a simple text matching search that
support novel research uses, Google stores
all of the actual documents it crawls in is restricted to web page titles performs
compressed form. One of our main goals in admirably when Page Rank prioritizes the
designing Google was to set up an results (demo available at
environment where other researchers can google.stanford.edu). For the type of full
come in quickly, process large chunks of the text searches in the main Google system,
web, and produce interesting results that Page Rank also helps a great deal.
would have been very difficult to produce
otherwise. In the short time the system has
been up, there have already been several
papers using databases generated by Google,
Description of Page Rank hitting "back" but eventually gets bored and
Calculation starts on another random page. The
Academic citation literature has been probability that the random surfer visits a
applied to the web, largely by counting page is its PageRank. And, the d damping
citations or backlinks to a given page. This factor is the probability at each page the
gives some approximation of a page's "random surfer" will get bored and request
importance or quality. PageRank extends another random page. One important
this idea by not counting links from all variation is to only add the damping factor d
pages equally, and by normalizing by the to a single page, or a group of pages. This
number of links on a page. PageRank is allows for personalization and can make it
defined as follows: nearly impossible to deliberately mislead the
system in order to get a higher ranking. We
We assume page A has pages T1...Tn which have several other extensions to PageRank.
point to it (i.e., are citations). The parameter
d is a damping factor which can be set Another intuitive justification is that a page
between 0 and 1. We usually set d to 0.85. can have a high PageRank if there are many
There are more details about d in the next pages that point to it, or if there are some
pages that point to it and have a high
section. Also C(A) is defined as the number
PageRank. Intuitively, pages that are well
of links going out of page A. The PageRank cited from many places around the web are
of a page A is given as follows: worth looking at. Also, pages that have
perhaps only one citation from something
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + like the Yahoo! homepage are also generally
PR(Tn)/C(Tn)) worth looking at. If a page was not high
quality, or was a broken link, it is quite
Note that the PageRanks form a probability likely that Yahoo's homepage would not link
distribution over web pages, so the sum of to it. PageRank handles both these cases and
all web pages' PageRanks will be one. everything in between by recursively
propagating weights through the link
PageRank or PR(A) can be calculated using structure of the web.
a simple iterative algorithm, and
corresponds to the principal eigenvector of Other Features
the normalized link matrix of the web. Also,
a PageRank for 26 million web pages can be Aside from PageRank and the use of anchor
computed in a few hours on a medium size text, Google has several other features. First,
it has location information for all hits and so
workstation. There are many other details
it makes extensive use of proximity in
which are beyond the scope of this paper. search. Second, Google keeps track of some
visual presentation details such as font size
Intuitive Justification of words. Words in a larger or bolder font
PageRank can be thought of as a model of are weighted higher than other words. Third,
user behavior. We assume there is a full raw HTML of pages is available in a
"random surfer" who is given a web page at repository.
random and keeps clicking on links, never
How search engines work contain data that may no longer be available
elsewhere.
A search engine operates, in the following
When a user enters a query into a search
order
engine (typically by using key words), the
engine examines its index and provides a
1. Web crawling
listing of best-matching web pages
2. Indexing
according to its criteria, usually with a short
3. Searching
summary containing the document's title and
sometimes parts of the text. The index is
Web search engines work by storing built from the information stored with the
information about many web pages, which data and the method by which the
they retrieve from the html itself. These information is indexed. Unfortunately, there
pages are retrieved by a Web crawler are currently no known public search
(sometimes also known as a spider) — an engines that allow documents to be searched
automated Web browser which follows by date. Most search engines support the use
every link on the site. Exclusions can be of the boolean operators AND, OR and NOT
made by the use of robots.txt. The contents to further specify the search query. Boolean
of each page are then analyzed to determine operators are for literal searches that allow
how it should be indexed (for example, the user to refine and extend the terms of the
words are extracted from the titles, headings, search. The engine looks for the words or
or special fields called meta tags). Data phrases exactly as entered. Some search
about web pages are stored in an index engines provide an advanced feature called
database for use in later queries. A query proximity search which allows users to
can be a single word. The purpose of an define the distance between keywords.
index is to allow information to be found as There is also concept-based searching where
quickly as possible. Some search engines, the research involves using statistical
such as Google, store all or part of the analysis on pages containing the words or
source page (referred to as a cache) as well phrases you search for. As well, natural
as information about the web pages, whereas language queries allow the user to type a
others, such as AltaVista, store every word question in the same form one would ask it
of every page they find. This cached page to a human. A site like this would be
always holds the actual search text since it is ask.com.
the one that was actually indexed, so it can
be very useful when the content of the
The usefulness of a search engine depends
current page has been updated and the
on the relevance of the result set it gives
search terms are no longer in it. This
back. While there may be millions of web
problem might be considered to be a mild
pages that include a particular word or
form of linkrot, and Google's handling of it
phrase, some pages may be more relevant,
increases usability by satisfying user
popular, or authoritative than others. Most
expectations that the search terms will be on
search engines employ methods to rank the
the returned webpage. This satisfies the
results to provide the "best" results first.
principle of least astonishment since the user
How a search engine decides which pages
normally expects the search terms to be on
are the best matches, and what order the
the returned pages. Increased search
results should be shown in, varies widely
relevance makes these cached pages very
from one engine to another. The methods
useful, even beyond the fact that they may
also change over time as Internet usage lengthy documents in which your keyword
changes and new techniques evolve. There appears only once. Additionally, many of
are two main types of search engine that these responses will be irrelevant to your
have evolved: one is a system of predefined search.
and hierarchically ordered keywords that
humans have programmed extensively. The ARE SEARCH ENGINES ALL THE
other is a system that generates an "inverted SAME?
index" by analyzing texts it locates. This
second form relies much more heavily on Search engines use selected software
the computer itself to do the bulk of the programs to search their indexes for
work. matching keywords and phrases, presenting
their findings to you in some kind of
Most Web search engines are commercial relevance ranking. Although software
ventures supported by advertising revenue programs may be similar, no two search
and, as a result, some employ the practice of engines are exactly the same in terms of
allowing advertisers to pay money to have size, speed and content; no two search
their listings ranked higher in search results. engines use exactly the same ranking
Those search engines which do not accept schemes, and not every search engine offers
money for their search engine results make you exactly the same search options.
money by running search related ads Therefore, your search is going to be
alongside the regular search engine results. different on every engine you use. The
The search engines make money every time difference may not be a lot, but it could be
someone clicks on one of these ads. significant. Recent estimates put search
engine overlap at approximately 60 percent
PROS AND CONS OF SEARCH and unique content at around 40 percent.
ENGINES
WHEN DO WE USE SEARCH
PROS: ENGINES?
Search engines provide access to a fairly
large portion of the publicly available pages Search engines are best at finding unique
on the Web, which itself is growing keywords, phrases, quotes, and information
exponentially. buried in the full-text of web pages. Because
they index word by word, search engines are
Search engines are the best means devised also useful in retrieving tons of documents.
yet for searching the web. Stranded in the If you want a wide range of responses to
middle of this global electronic library of specific queries, use a search engine.
information without either a card catalog or
any recognizable structure, how else are you NOTE: Today, the line between
going to find what you're looking for? search engines and subject
directoriesis blurring. Search engines
CONS: no longer limit themselves to a
On the down side, the sheer number of search mechanism alone. Across the
words indexed by search engines increases Web, they are partnering with
the likelihood that they will return hundreds subject directories, or creating their
of thousands of responses to simple search own directories, and returning results
requests. Remember, they will return
gathered from a variety of other word. There is an additional set of keywords
guides and services as well. just for searching Usenet.
Lexicon 293 MB
System Performance
It is important for a search engine to crawl
Temporary Anchor Data
6.6 GB and index efficiently. This way information
(not in total)
can be kept up to date and major changes to
Document Index Incl. the system can be tested relatively quickly.
9.7 GB
Variable Width Data For Google, the major operations are
Crawling, Indexing, and Sorting. It is
Links Database 3.9 GB
difficult to measure how long crawling took
Total Without overall because disks filled up, name servers
55.2 GB
Repository
108.7
Total With Repository
GB
crashed, or any number of other problems <html
xmlns="http://www.w3.org/1999/xhtml
which stopped the system. In total it took " xml:lang="en" lang="en" dir="ltr">
roughly 9 days to download the 26 million <head>
<meta http-equiv="Content-Type"
pages (including errors). However, once the
content="text/html; charset=utf-
system was running smoothly, it ran much 8" />
faster, downloading the last 11 million pages <title>search engine</title>
<meta http-equiv="pragma"
in just 63 hours, averaging just over 4 content="no-cache" />
million pages per day or 48.5 pages per <meta name="title"
second. We ran the indexer and the crawler content="www.cybwll.ch
searchengine" />
simultaneously. The indexer ran just faster <meta name="description"
than the crawlers. This is largely because we content="www.cybwell.ch" />
<meta name="robots"
spent just enough time optimizing the content="index, follow" />
indexer so that it would not be a bottleneck. <meta name="revisit-after"
These optimizations included bulk updates content="3 days" />
<meta name="author"
to the document index and placement of content="www.guide-bleu.ch" />
critical data structures on the local disk. The <meta name="publisher"
content="CYBWELL MEDIA GmbH,
indexer runs at roughly 54 pages per second. Steinhausen, Switzerland" />
The sorters can be run completely in <meta name="copyright"
parallel; using four machines, the whole content="www.cybwell.ch" />
<meta name="keywords"
process of sorting takes about 24 hours. content="" />
<link
Search Performance href="/cms/_styles/layout.css"
rel="stylesheet" type="text/css"
Improving the performance of search was media="all" />
not the major focus of our research up to this <link
href="/cms/_styles/general.css"
point. The current version of Google rel="stylesheet" type="text/css"
answers most queries in between 1 and 10 media="all" />
seconds. This time is mostly dominated by <link
href="/globalfiles/css/styles.css"
disk IO over NFS (since disks are spread rel="stylesheet" type="text/css"
over a number of machines). Furthermore, media="all" />
</head>
Google does not have any optimizations <body bgcolor="#FFFFFF">
such as query caching, subindices on
common terms, and other common <p> </p>
optimizations. We intend to speed up <table border="0" cellspacing="0"
Google considerably through distribution width="970" cellpadding="0">
<tr>
and hardware, software, and algorithmic
<td align="center" colspan="7"
improvements. Our target is to be able to width="970"><img border="0"
handle several hundred queries per second. src="/images/cybwell.gif">
<p> </p>
</td>
CODE FOR SEARCH ENGINE </tr>
<tr>
<td colspan="7" width="970"
align="center">
<FORM method="POST" quality including page rank, anchor text, and
action="default.asp">
proximity information. Furthermore, Google
<INPUT class="forms" is a complete architecture for gathering web
name="q" size="22" pages, indexing them, and performing
id="layout1"> <INPUT
class="forms" type="submit" search queries over them.
value="search" name="search"><br>
</FORM> Future Work
</td>
</tr> A large-scale web search engine is a
<tr> complex system and much remains to be
<td width="8" valign="top" done. Our immediate goals are to improve
align="left" bgcolor="#ECECEA"
bordercolor="#ECECEA"> search efficiency and to scale to
</td> approximately 100 million web pages. Some
<td width="718" valign="top"
align="left" bgcolor="#ECECEA" simple improvements to efficiency include
bordercolor="#ECECEA"> query caching, smart disk allocation, and
subindices. Another area which requires
</td> much research is updates. We must have
<td width="8" valign="top" smart algorithms to decide what old web
align="left" bgcolor="#ECECEA"
bordercolor="#ECECEA"> pages should be recrawled and what new
ones should be crawled. Work toward this
</td> goal has been done in One promising area of
<td width="18" valign="top"
align="left"></td> research is using proxy caches to build
<td width="8" valign="top" search databases, since they are demand
align="left"
bgcolor="#ECECEA"> </td> driven. We are planning to add simple
<td width="222" valign="top" features supported by commercial search
align="left" bgcolor="#ECECEA" engines like boolean operators, negation,
bordercolor="#ECECEA">
<HR><H3>Related Searches</H3> and stemming. However, other features are
just starting to be explored such as relevance
<td width="8" valign="top"
align="left" bgcolor="#ECECEA"
feedback and clustering (Google currently
bordercolor="#ECECEA"> supports a simple hostname based
clustering). We also plan to support user
</tr>
</table> context (like the user's location), and result
summarization. We are also working to
</body>
extend the use of link structure and link text.
</html> Simple experiments indicate PageRank can
be personalized by increasing the weight of
Conclusions a user's home page or bookmarks. As for
Google is designed to be a scalable search link text, we are experimenting with using
engine. The primary goal is to provide high text surrounding links in addition to the link
quality search results over a rapidly growing text itself. A Web search engine is a very
World Wide Web. Google employs a rich environment for research ideas. We
number of techniques to improve search have far too many to list here so we do not
expect this Future Work section to become memory capacity, disk seeks, disk
much shorter in the near future. throughput, disk capacity, and network IO.
Google has evolved to overcome a number
High Quality Search of these bottlenecks during various
The biggest problem facing users of web operations. Google's major data structures
search engines today is the quality of the make efficient use of available storage
results they get back. While the results are space. Furthermore, the crawling, indexing,
often amusing and expand users' horizons, and sorting operations are efficient enough
they are often frustrating and consume to be able to build an index of a substantial
precious time. For example, the top result portion of the web -- 24 million pages, in
for a search for "Bill Clinton" on one of the less than one week. We expect to be able to
most popular commercial search engines build an index of 100 million pages in less
was the Bill Clinton Joke of the Day: April than a month.
14, 1997. Google is designed to provide
higher quality search so as the Web A Research Tool
continues to grow rapidly, information can
be found easily. In order to accomplish this In addition to being a high quality search
Google makes heavy use of hypertextual engine, Google is a research tool. The
information consisting of link structure and data Google has collected has already
link (anchor) text. Google also uses resulted in many other papers
proximity and font information. While submitted to conferences and many
evaluation of a search engine is difficult, we more on the way. Recent research
have subjectively found that Google returns such as [Abiteboul 97] has shown a
higher quality search results than current number of limitations to queries about
commercial search engines. The analysis of the Web that may be answered
link structure via PageRank allows Google without having the Web available
to evaluate the quality of web pages. The locally. This means that Google (or a
use of link text as a description of what the similar system) is not only a valuable
link points to helps the search engine return research tool but a necessary one for a
relevant (and to some degree high quality) wide range of applications. We hope
results. Finally, the use of proximity Google will be a resource for
information helps increase relevance a great searchers and researchers all around
deal for many queries. the world and will spark the next
generation of search engine
Scalable Architecture technology.
Aside from the quality of search, Google is
designed to scale. It must be efficient in both REFERENCES
space and time, and constant factors are very
• Google Search Engine
important when dealing with the entire Web. http://google.stanford.edu/
In implementing Google, we have seen
bottlenecks in CPU, memory access,
• Search Engine Watch
http://www.searchenginewatch.
com/
• www.monash.com/spidap3.html
• http://www.webreference.com
• uk.searchengine.com/
• en.wikipedia.org/wiki/Web_sear
ch_engine
• searchengine.com/
• www.searchenginecommando.
com/articles/titles/16.html
• www.searchengineguide.com