Seminar Report

A
Seminar Report
On
Working of web search engine
Submitted in partial fulfillment of the requirement
For the award of the degree
Of
Bachelor of Engineering
In
Information Technology
Submitted to: Guide:

Sachin Sharma Dr. K.R. Chowdhary
B.E. Final Year Professor, CSE Dept.
Department of Computer Science and Engineering

M.B.M. Engineering College, Faculty of Engineering,
Jai Narain Vyas University
Jodhpur (Rajasthan) – 342001
Session 2008-09
i
CANDIDATE’S DECLARATION
I hereby declare that the work which is being presented in the Seminar entitled
“Working of Web Search engine” in the partial fulfillment of the requirement for the
award of degree of Bachelor of engineering in Information Technology, submitted in
the department of Computer science and engineering, M.B.M. Engineering college,
Jodhpur (Rajasthan), is an authentic record of my own work carried out during the
period from February 2009 to May 2009, under the supervision of Dr. K.R. Chowdhary,
Professor, Department of Computer science and engineering, M.B.M. Engineering
college, Jodhpur (Rajasthan).
The matter embodied in this project has not been submitted by me for the award of any
other degree. I also declare that the matter of seminar is not ‘reproduced as it is’ from
any source.
Date:
Place: Jodhpur (SACHIN SHARMA)
CERTIFICATE
This is certified that the above statement made by the candidate is correct to the best of
my knowledge.
Dr. K.R. Chowdhary
Professor
Department of Computer science and engineering
M.B.M. Engineering College,
Jodhpur (Rajasthan) – 342001
ii
Contents
1. Introduction…………………………………………………………….1
2. Types of search engine………………………………………………2
3. General system architecture of web search engine………………2
3.1. Web crawling……………………………………………………..4
3.1.1. Types of crawling…………………………………………6
3.1.1.1. Focused crawling…………………………………6
3.1.1.2. Distributed crawling……………………………....6
3.1.2. Robot exclusion protocol…………………………………7
3.1.3. Resource constraints……………………………………..8
3.2. Web indexing……………………………………………………..8
3.2.1. Index design factors………………………………………9
3.2.2. Index data structures……………………………………..10
3.2.3. Types of indexing…………………………………………11
3.2.3.1. Inverted Index……………………………………...11
3.2.3.2. Forward index……………………………………..12
3.2.4. Latent Semantic Indexing (LSI)………………………….13
3.2.4.1. What is LSI…………………………………………13
3.2.4.2. How LSI Works…………………………………….14
3.2.4.3. Singular Value Decomposition (SVD)…………..17
iii
3.2.4.4. Stemming…………………………………………..20
3.2.4.5. The Term Document Matrix………………………22
3.2.5. Challenges in parallelism…………………………………27
4. Meta search engine……………………………………………………27
5. Search engine optimization…………………………………………..29
5.1. Page Rank…………………………………………………………29
5.2. The ranking algorithm simplified………………………………...30
5.3. Damping factor…………………………………………………….32
5.4. Uses of page Rank………………………………………………..35
6. Marketing of search engines………………………………………….36
7. Summary………………………………………………………………..37
Abstract
Exploring the content of web pages for automatic indexing is of fundamental importance
for efficient e-commerce and other applications of the Web. It enables users, including
customers and businesses, to locate the best sources for their use. Today’s search
engines use one of two approaches to indexing web pages. They either:
 Analyze the frequency of the words (after filtering out common or meaningless
words) appearing in the entire or a part (typically, a title, an abstract or the first
300 words) of the text of the target web page,
 They use sophisticated algorithms to take into account associations of words in
the indexed web page. In both cases only words appearing in the web page in
question are used in analysis. Often, to increase relevance of the selected terms
to the potential searches, the indexing is refined by human processing.
iv
To identify so called “authority” or “expert” pages, some search engines use the
structure of the links between pages to identify pages that are often referenced by other
pages. The approach used in the Google Search Engine implementation, assign each
page a score that depends on frequency with which this page is visited by web surfers.
1. Introduction
A search engine is an information retrieval system designed to help find information

stored on a computer system. Search engines help to minimize the time required to find
information and the amount of information which must be consulted, akin to other
techniques for managing information overload. The most public, visible form of a search
engine is a Web search engine which searches for information on the World Wide Web.
Engineering a web search engine is a challenging task. Search engines index
tens to hundreds of millions of web pages involving a comparable number of distinct
terms. They answer tens of millions of queries every day. Despite the importance of
large-scale search engines on the web, very little academic research has been
conducted on them. Furthermore, due to rapid advance in technology and web
proliferation, creating a web search engine today is very different from three years ago.
v
There are differences in the ways various search engines work, but they all perform
three basic tasks:
 They search the Internet or select pieces of the Internet based on

important words.
 They keep an index of the words they find, and where they find them.
 They allow users to look for words or combinations of words found in that
index.
The most important measure for a search engine is the search performance, quality of
the results and ability to crawl, and index the web efficiently. The primary goal is to
provide high quality search results over a rapidly growing World Wide Web. Some of the
efficient and recommended search engines are Google, Yahoo and Teoma, which
share some common features and are standardized to some extent.
vi
Types of search engine
2. Types of search engine
Search engines provide an interface to a group of items that enables users to specify
criteria about an item of interest and have the engine find the matching items. The
criteria are referred to as a search query. In the case of text search engines, the search
query is typically expressed as a set of words that identify the desired concept that one
or more documents may contain. There are several styles of search query syntax that
vary in strictness. It can also switch names within the search engines from previous
sites. Whereas some text search engines require users to enter two or three words
separated by white space, other search engines may enable users to specify entire
documents, pictures, sounds, and various forms of natural language. Some search
engines apply improvements to search queries to increase the likelihood of providing a
quality set of items through a process known as query expansion.
3. General system architecture of web search engine
This section provides an overview of how the whole system of a search engine works.
The major functions of the search engine crawling, indexing and searching are also
covered in detail in the later sub-sections.
Before a search engine can tell you where a file or document is, it must be found.
To find information on the hundreds of millions of Web pages that exist, a typical search
engine employs special software robots, called spiders, to build lists of the words found
on Websites. When a spider is building its lists, the process is called Web crawling. A
Web crawler is a program, which automatically traverses the web by downloading
documents and following links from page to page. They are mainly used by web search
engines to gather data for indexing. Other possible applications include page validation,
structural analysis and visualization; update notification, mirroring and personal web
assistants/agents etc. Web crawlers are also known as spiders, robots, worms etc.
Crawlers are automated programs that follow the links found on the web pages. There
is a URL Server that sends lists of URLs to be fetched to the crawlers. The web pages
General system architecture of search engine
that are fetched are then sent to the store server. The store server then compresses
and stores the web pages into a repository. Every web page has an associated ID
number called a doc ID, which is assigned whenever a new URL is parsed out of a web
page. The indexer and the sorter perform the indexing function.
The indexer performs a number of functions. It reads the repository,

uncompresses the documents, and parses them. Each document is converted into a set
of word occurrences called hits. The hits record the word, position in document, an
approximation of font size, and capitalization. The indexer distributes these hits into a
set of "barrels", creating a partially sorted forward index.
The indexer performs another important function. It parses out all the links in
every web page and stores important information about them in an anchors file. This file
contains enough information to determine where each link points from and to, and the
text of the link. The URL Resolver reads the anchors file and converts relative URLs into
absolute URLs and in turn into doc IDs. It puts the anchor text into the forward index,
associated with the doc ID that the anchor points to.
It also generates a database of links, which are pairs of doc IDs. The links
database is used to compute Page Ranks for all the documents. The sorter takes the
barrels, which are sorted by doc ID and resorts them by word ID to generate the
inverted index. This is done in place so that little temporary space is needed for this
operation. The sorter also produces a list of word IDs and offsets into the inverted index.
A program called Dump Lexicon takes this list together with the lexicon produced by the
indexer and generates a new lexicon to be used by the searcher.
A lexicon lists all the terms occurring in the index along with some term-level
statistics (e.g., total number of documents in which a term occurs) that are used by the
ranking algorithms The searcher is run by a web server and uses the lexicon built by
Dump Lexicon together with the inverted index and the Page Ranks to answer queries.
3.1. Web crawling
Web crawlers are an essential component to search engines; running a web crawler is a
challenging task. There are tricky performance and reliability issues and even more
importantly, there are social issues. Crawling is the most fragile application since it
involves interacting with hundreds of thousands of web servers and various name
servers, which are all beyond the control of the system. Web crawling speed is
governed not only by the speed of one’s own Internet connection, but also by the speed
of the sites that are to be crawled. Especially if one is a crawling site from multiple
servers, the total crawling time can be significantly reduced, if many downloads are
done in parallel. Despite the numerous applications for Web crawlers, at the core they
are all fundamentally the same. Following is the process by which Web crawlers work:
 Download the Web page.
 Parse through the downloaded page and retrieve all the links.
 For each link retrieved, repeat the process.
The Web crawler can be used for crawling through a whole site on the Inter-/Intranet.
You specify a start-URL and the Crawler follows all links found in that HTML page. This
usually leads to more links, which will be followed again, and so on. A site can be seen
as a tree-structure, the root is the start-URL; all links in that root-HTML-page are direct
sons of the root. Subsequent links are then sons of the previous sons.
3.1.1. Types of crawling Crawlers are of two types basically.
3.1.1.1. Focused crawling
A general purpose Web crawler gathers as many pages as it can from a particular set of
URL’s. Where as a focused crawler is designed to gather documents only on a specific
topic, thus reducing the amount of network traffic and downloads. The goal of the
focused crawler is to selectively seek out pages that are relevant to a pre-defined set of
topics. The topics are specified not using keywords, but using exemplary documents.
Rather than collecting and indexing all accessible web documents to be able to
answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to
find the links that are likely to be most relevant for the crawl, and avoids irrelevant
regions of the web. This leads to significant savings in hardware and network resources,
and helps keep the crawl more up-to-date. The focused crawler has three main
components: a classifier, which makes relevance judgments on pages, crawled to
decide on link expansion, a distiller which determines a measure of centrality of crawled
pages to determine visit-priorities, and a crawler with dynamically reconfigurable priority
controls which is governed by the classifier and distiller.
The most crucial evaluation of focused crawling is to measure the harvest ratio,
which is the rate at which relevant pages are acquired and irrelevant pages are
effectively filtered off from the crawl. This harvest ratio must be high, otherwise the
focused crawler would spend a lot of time merely eliminating irrelevant pages, and it
may be better to use an ordinary crawler instead.
3.1.1.2. Distributed crawling
Indexing the web is a challenge due to its growing and dynamic nature. As the size of
the Web is growing it has become imperative to parallelize the crawling process in order
to finish downloading the pages in a reasonable amount of time. A single crawling
process even if multithreading is used will be insufficient for large – scale engines that
need to fetch large amounts of data rapidly. When a single centralized crawler is used
all the fetched data passes through a single physical link. Distributing the crawling
activity via multiple processes can help build a scalable, easily configurable system,
which is fault tolerant system. Splitting the load decreases hardware requirements and
at the same time increases the overall download speed and reliability. Each task is
performed in a fully distributed fashion, that is, no central coordinator exists.
3.1.2. Robot exclusion protocol
Web sites also often have restricted areas that crawlers should not crawl. To address
these concerns, many Web sites adopted the Robot protocol, which establishes
guidelines that crawlers should follow. Over time, the protocol has become the unwritten
law of the Internet for Web crawlers. The Robot protocol specifies that Web sites
wishing to restrict certain areas or pages from crawling have a file called robots.txt
placed at the root of the Web site. The ethical crawlers will then skip the disallowed
areas. Following is an example robots.txt file and an explanation of its format:
# Robots.txt for http://somehost.com/
User-agent: *
Disallow: /cgi-bin/
Disallow: /registration # Disallow robots on registration page
Disallow: /login
The first line of the sample file has a comment on it, as denoted by the use of a hash (#)
character. Crawlers reading robots.txt files should ignore any comments. The third line
of the sample file specifies the User-agent to which the Disallow rules following it apply.
User-agent is a term used for the programs that access a Web site. Each browser has a
unique User-agent value that it sends along with each request to a Web server.
However, typically Web sites want to disallow all robots (or User-agents) access to
certain areas, so they use a value of asterisk (*) for the User-agent. This specifies that
all User-agents be disallowed for the rules that follow it. The lines following the User-
agent lines are called disallow statements. The disallow statements define the Web site
paths that crawlers are not allowed to access. For example, the first disallow statement
in the sample file tells crawlers not to crawl any links that begin with “/cgi-bin/”. Thus,
the following URLs are both off limits to crawlers according to that line.
http://somehost.com/cgi-bin
http://somehost.com/cgi-bin/register
3.1.3. Resource Constraints
Crawlers consume resources: network bandwidth to download pages, memory to

maintain private data structures in support of their algorithms, CPU to evaluate and
select URLs, and disk storage to store the text and links of fetched pages as well as
other persistent data.
3.2. Web Indexing
Search engine indexing collects, parses, and stores data to facilitate fast and accurate
information retrieval. Index design incorporates interdisciplinary concepts from
linguistics, cognitive psychology, mathematics, informatics, physics and computer
science. An alternate name for the process in the context of search engines designed to
find web pages on the Internet is Web indexing.
The purpose of storing an index is to optimize speed and performance in finding

relevant documents for a search query. Without an index, the search engine would scan
every document in the corpus, which would require considerable time and computing
power. For example, while an index of 10,000 documents can be queried within
milliseconds, a sequential scan of every word in 10,000 large documents could take
hours. The additional computer storage required to store the index, as well as the
considerable increase in the time required for an update to take place, are traded off for
the time saved during information retrieval.
3.2.1. Index design factors
Major factors in designing a search engine's architecture include:
 Merge factors: How data enters the index, or how words or subject features are
added to the index during text corpus traversal, and whether multiple indexers
can work asynchronously. The indexer must first check whether it is updating old
content or adding new content. Traversal typically correlates to the data
collection policy. Search engine index merging is similar in concept to the SQL
Merge command and other merge algorithms.
 Storage techniques: How to store the index data, that is, whether information
should be data compressed or filtered.
 Index size: How much computer storage is required to support the index.
 Lookup speed: How quickly a word can be found in the inverted index. The
speed of finding an entry in a data structure, compared with how quickly it can be
updated or removed, is a central focus of computer science.
 Maintenance: How the index is maintained over time.
 Fault tolerance: How important it is for the service to be reliable. Issues include
dealing with index corruption, determining whether bad data can be treated in
isolation, dealing with bad hardware, partitioning, and schemes such as hash-
based or composite partitioning, as well as replication.
3.2.2. Index data structures

Search engine architectures vary in the way indexing is performed and in methods of
index storage to meet the various design factors. Types of indices include:
 Suffix tree: It is figuratively structured like a tree, supports linear time lookup.
Built by storing the suffixes of words. The suffix tree is a type of trie. Tries
support extendable hashing, which is important for search engine indexing. Used
for searching for patterns in DNA sequences and clustering. A major drawback is
that the storage of a word in the tree may require more storage than storing the
word itself. An alternate representation is a suffix array, which is considered to
require less virtual memory and supports data compression such as the BWT
algorithm.
 Tree: An ordered tree data structure that is used to store an associative array
where the keys are strings. Regarded as faster than a hash table but less space-
efficient.
 Inverted index: Stores a list of occurrences of each atomic search criterion,

typically in the form of a hash table or binary tree
 Citation index: Stores citations or hyperlinks between documents to support

citation analysis, a subject of Bibliometrics.
 Ngram index: Stores sequences of length of data to support other types of

retrieval or text mining.
 Term document matrix: Used in latent semantic analysis, stores the occurrences
of words in documents in a two-dimensional sparse matrix.
3.2.3. Types of indexing: Indexing is basically of two types.
3.2.3.1. Inverted Index:
Many search engines incorporate an inverted index when evaluating a search query to
quickly locate documents containing the words in a query and then rank these
documents by relevance. Because the inverted index stores a list of the documents
containing each word, the search engine can use direct access to find the documents
associated with each word in the query in order to retrieve the matching documents
quickly. The following is a simplified illustration of an inverted index:
Word Documents
the Doc1, Doc3, Doc4, Doc5
cow Doc2, Doc3, Doc4
says Doc5
moo Doc7
This index can only determine whether a word exists within a particular document,
since it stores no information regarding the frequency and position of the word; it is
therefore considered to be a Boolean index. Such an index determines which
documents match a query but does not rank matched documents. In some designs the
index includes additional information such as the frequency of each word in each
document or the positions of a word in each document. Position information enables the
search algorithm to identify word proximity to support searching for phrases; frequency
can be used to help in ranking the relevance of documents to the query. Such topics are
the central research focus of information retrieval.The inverted index is a sparse matrix,
since not all words are present in each document. To reduce computer storage memory
requirements, it is stored differently from a two dimensional array. The index is similar to
the term document matrices employed by latent semantic analysis. The inverted index
can be considered a form of a hash table. In some cases the index is a form of a binary
tree, which requires additional storage but may reduce the lookup time. In larger indices
the architecture is typically a distributed hash table.
The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge
but first deletes the contents of the inverted index. The architecture may be designed to
support incremental indexing, where a merge identifies the document or documents to
be added or updated and then parses each document into words. For technical
accuracy, a merge conflates newly indexed documents, typically residing in virtual
memory, with the index cache residing on one or more computer hard drives.
After parsing, the indexer adds the referenced document to the document list for
the appropriate words. In a larger search engine, the process of finding each word in the
inverted index (in order to report that it occurred within a document) may be too time
consuming, and so this process is commonly split up into two parts, the development of
a forward index and a process which sorts the contents of the forward index into the
inverted index. The inverted index is so named because it is an inversion of the forward
index.
3.2.3.2. Forward Index:
The forward index stores a list of words for each document. The following is a simplified
form of the forward index:
Document word
Doc1 the, cow, says, moo
Doc2 the, cat, and, the ,hat
Doc3 the, dish, ran, away, with, the, spoon

The rationale behind developing a forward index is that as documents are parsing, it is
better to immediately store the words per document. The delineation enables
Asynchronous system processing, which partially circumvents the inverted index update
bottleneck. The forward index is sorted to transform it to an inverted index. The forward
index is essentially a list of pairs consisting of a document and a word, collated by the
document. Converting the forward index to an inverted index is only a matter of sorting
the pairs by the words. In this regard, the inverted index is a word-sorted forward index.
3.2.4. Latent Semantic Indexing (LSI)
3.2.4.1. What is LSI:
Regular keyword searches approach a document collection with a kind of accountant

mentality: a document contains a given word or it doesn't, without any middle ground.
We create a result set by looking through each document in turn for certain keywords
and phrases, tossing aside any documents that don't contain them, and ordering the
rest based on some ranking system. Each document stands alone in judgement before
the search algorithm - there is no interdependence of any kind between documents,
which are evaluated solely on their contents.
Latent semantic indexing adds an important step to the document indexing

process. In addition to recording which keywords a document contains, the method
examines the document collection as a whole, to see which other documents contain
some of those same words. LSI considers documents that have many words in common
to be semantically close, and ones with few words in common to be semantically
distant. This simple method correlates surprisingly well with how a human being, looking
at content, might classify a document collection. Although the LSI algorithm doesn't
understand anything about what the words mean, the patterns it notices can make it
seem astonishingly intelligent.
When you search an LSI-indexed database, the search engine looks at similarity values
it has calculated for every content word, and returns the documents that it thinks best fit
the query. Because two documents may be semantically very close even if they do not
share a particular keyword, LSI does not require an exact match to return useful results.
Where a plain keyword search will fail if there is no exact match, LSI will often return
relevant documents that don't contain the keyword at all.
To use an earlier example, let's say we use LSI to index our collection of mathematical
articles. If the words n-dimensional, manifold and topology appear together in enough
articles, the search algorithm will notice that the three terms are semantically close. A
search for n-dimensional manifolds will therefore return a set of articles containing that
phrase (the same result we would get with a regular search), but also articles that
contain just the word topology. The search engine understands nothing about
mathematics, but examining a sufficient number of documents teaches it that the three
terms are related. It then uses that information to provide an expanded set of results
with better recall than a plain keyword search.
3.2.4.2 How LSI Works:
Natural language is full of redundancies, and not every word that appears in a
document carries semantic meaning. In fact, the most frequently used words in English
are words that don't carry content at all: functional words, conjunctions, prepositions,
auxiliary verbs and others. The first step in doing LSI is culling all those extraneous
words from a document, leaving only content words likely to have semantic meaning.
There are many ways to define a content word - here is one recipe for generating a list
of content words from a document collection:
 Make a complete list of all the words that appear anywhere in the collection
 Discard articles, prepositions, and conjunctions
 Discard common verbs (know, see, do, be)
 Discard pronouns
 Discard common adjectives (big, late, high)
 Discard frilly words (therefore, thus, however, albeit, etc.)
 Discard any words that appear in every document
 Discard any words that appear in only one document
This process condenses our documents into sets of content words that we can then use
to index our collection.
Using our list of content words and documents, we can now generate a term-
document matrix. This is a fancy name for a very large grid, with documents listed
along the horizontal axis, and content words along the vertical axis. For each content
word in our list, we go across the appropriate row and put an 'X' in the column for any
document where that word appears. If the word does not appear, we leave that column
blank.
Doing this for every word and document in our collection gives us a mostly empty
grid with a sparse scattering of X-es. This grid displays everything that we know about
our document collection. We can list all the content words in any given document by
looking for X-es in the appropriate column, or we can find all the documents containing
a certain content word by looking across the appropriate row.
Notice that our arrangement is binary - a square in our grid either contains an X, or it
doesn't. This big grid is the visual equivalent of a generic keyword search, which looks
for exact matches between documents and keywords. If we replace blanks and X-es
with zeroes and ones, we get a numerical matrix containing the same information.
The key step in LSI is decomposing this matrix using a technique called singular value
decomposition. The mathematics of this transformation is beyond the scope of this
article.
Imagine that you are curious about what people typically order for breakfast
down at your local diner, and you want to display this information in visual form. You
decide to examine all the breakfast orders from a busy weekend day, and record how
many times the words bacon, eggs and coffee occur in each order.
You can graph the results of your survey by setting up a chart with three orthogonal
axes - one for each keyword. The choice of direction is arbitrary - perhaps a bacon axis
in the x direction, an eggs axis in the y direction, and the all-important coffee axis in the
z direction. To plot a particular breakfast order, you count the occurrence of each
keyword, and then take the appropriate number of steps along the axis for that word.
When you are finished, you get a cloud of points in three-dimensional space,
representing all of that day's breakfast orders.
If you draw a line from the origin of the graph to each of these points, you obtain
a set of vectors in 'bacon-eggs-and-coffee' space. The size and direction of each vector
tells you how many of the three key items were in any particular order, and the set of all
the vectors taken together tells you something about the kind of breakfast people favor
on a Saturday morning.
What your graph shows is called a term space. Each breakfast order forms a vector in
that space, with its direction and magnitude determined by how many times the three
keywords appear in it. Each keyword corresponds to a separate spatial direction,
perpendicular to all the others. Because our example uses three keywords, the resulting
term space has three dimensions, making it possible for us to visualize it. It is easy to
see that this space could have any number of dimensions, depending on how many
keywords we chose to use. If we were to go back through the orders and also record
occurrences of sausage, muffin, and bagel, we would end up with a six-dimensional
term space, and six-dimensional document vectors.
Applying this procedure to a real document collection, where we note each use of a
content word, results in a term space with many thousands of dimensions. Each
document in our collection is a vector with as many components as there are content
words. Although we can't possibly visualize such a space, it is built in the exact same
way as the whimsical breakfast space we just described. Documents in such a space
that have many words in common will have vectors that are near to each other, while
documents with few shared words will have vectors that are far apart.
Latent semantic indexing works by projecting this large, multidimensional space down
into a smaller number of dimensions. In doing so, keywords that are semantically similar
will get squeezed together, and will no longer be completely distinct. This blurring of
boundaries is what allows LSI to go beyond straight keyword matching. To understand

how it takes place, we can use another analogy.
3.2.4.3. Singular Value Decomposition:
Imagine you keep tropical fish, and are proud of your prize aquarium - so proud that you
want to submit a picture of it to Modern Aquaria magazine, for fame and profit. To get
the best possible picture, you will want to choose a good angle from which to take the
photo. You want to make sure that as many of the fish as possible are visible in your
picture, without being hidden by other fish in the foreground. You also won't want the
fish all bunched together in a clump, but rather shot from an angle that shows them
nicely distributed in the water. Since your tank is transparent on all sides, you can take
a variety of pictures from above, below, and from all around the aquarium, and select
the best one.
In mathematical terms, you are looking for an optimal mapping of points in 3-space (the
fish) onto a plane (the film in your camera). 'Optimal' can mean many things - in this
case it means 'aesthetically pleasing'. But now imagine that your goal is to preserve the
relative distance between the fish as much as possible, so that fish on opposite sides of
the tank don't get superimposed in the photograph to look like they are right next to
each other. Here you would be doing exactly what the SVD algorithm tries to do with a
much higher-dimensional space.
Instead of mapping 3-space to 2-space, however, the SVD algorithm goes to much
greater extremes. A typical term space might have tens of thousands of dimensions,
and be projected down into fewer than 150. Nevertheless, the principle is exactly the
same. The SVD algorithm preserves as much information as possible about the relative
distances between the document vectors, while collapsing them down into a much
smaller set of dimensions. In this collapse, information is lost, and content words are
superimposed on one another.
Information loss sounds like a bad thing, but here it is a blessing. What we are losing is
noise from our original term-document matrix, revealing similarities that were latent in
the document collection. Similar things become more similar, while dissimilar things
remain distinct. This reductive mapping is what gives LSI its seemingly intelligent
behavior of being able to correlate semantically related terms. We are really exploiting a
property of natural language, namely that words with similar meaning tend to occur
together.
While a discussion of the mathematics behind singular value decomposition is beyond

the scope of our article, it's worthwhile to follow the process of creating a term-
document matrix in some detail, to get a feel for what goes on behind the scenes. Here
we will process a sample wire story to demonstrate how real-life texts get converted into
the numerical representation we use as input for our SVD algorithm.
The first step in the chain is obtaining a set of documents in electronic form. This can be
the hardest thing about LSI - there are all too many interesting collections not yet
available online. In our experimental database, we download wire stories from an online
newspaper with an AP news feed. A script downloads each day's news stories to a local
disk, where they are stored as text files.
Let's imagine we have downloaded the following sample wire story, and want to
incorporate it in our collection:
O'Neill Criticizes Europe on Grants

PITTSBURGH (AP)
Treasury Secretary Paul O'Neill expressed

irritation on Wednesday that European
countries have refused to go along with a
U.S. proposal to boost the amount of
direct grants rich nations offer poor
countries.
The Bush administration is pushing a plan
to increase the amount of direct grants
the World Bank provides the poorest
nations to 50 percent of assistance,
reducing use of loans to these nations.
The first thing we do is strip all formatting from the article, including capitalization,
punctuation, and extraneous markup (like the dateline). LSI pays no attention to word
order, formatting, or capitalization, so can safely discard that information. Our cleaned-
up wire story looks like this:
O’Neill criticizes Europe on grants

treasury secretary Paul O’Neill expressed
irritation Wednesday that European
us proposal to boost the amount of direct
grants rich nations offer poor countries
the bush administration is pushing a plan
the world bank provides the poorest
nations to 50 percent of assistance
reducing use of loans to these nations
The next thing we want to do is pick out the content words in our article. These are the
words we consider semantically significant - everything else is clutter. We do this by
applying a stop list of commonly used English words that don't carry semantic
meaning. Using a stop list greatly reduces the amount of noise in our collection, as well
as eliminating a large number of words that would make the computation more difficult.
Creating a stop list is something of an art - they depend very much on the nature of the
data collection. You can see our full wire stories stop list here.
Here is our sample story with stop-list words highlighted:
O’Neill criticizes Europe on grants

treasury secretary Paul O’Neill expressed
irritation Wednesday that European
US proposal to boost the amount of direct
grants rich nations offer poor countries
the bush administration is pushing a plan
the world bank provides the poorest
nations to 50 percent of assistance
reducing use of loans to these nations
Removing these stop words leaves us with an abbreviated version of the article
containing content words only:
O’Neill criticizes Europe grants treasury

secretary Paul O’Neill expressed
irritation European countries refused US
proposal boost direct grants rich nations
poor countries bush administration
pushing plan increase amount direct
grants world bank poorest nations
assistance loans nations
However, one more important step remains before our document is ready for indexing.
We can notice how many of our content words are plural noun (grants, nations) and
inflected verbs (pushing, refused). It doesn't seem very useful to have each inflected
form of a content word be listed separately in our master word list - with all the possible
variants, the list would soon grow unwieldy. More troubling is that LSI might not
recognize that the different variant forms were actually the same word in disguise. We
solve this problem by using a stemmer.
3.2.4.4. Stemming:
While LSI itself knows nothing about language (we saw how it deals exclusively with a
mathematical vector space), some of the preparatory work needed to get documents
ready for indexing is very language-specific. We have already seen the need for a stop
list, which will vary entirely from language to language and to a lesser extent from
document collection to document collection. Stemming is similarly language-specific,
derived from the morphology of the language. For English documents, we use an
algorithm called the Porter stemmer to remove common endings from words, leaving
behind an invariant root form. Here are some examples of words before and after
stemming:
Information -> inform

Presidency -> preside
Presiding -> preside
Happiness -> happy
Happily -> happy
Discouragement -> discourage
Battles -> battle
And here is our sample story as it appears to the stemmer:
O’Neill criticizes Europe grants treasury

secretary Paul O’Neill expressed
irritation European countries refused US
proposal boost direct grants rich nations
poor countries
bush administration pushing plan increase
amount direct grants world bank poorest
nations assistance loans nations
Note that at this point we have reduced the original natural-language news story to a
series of word stems. All of the information carried by punctuation, grammar, and style
is gone - all that remains is word order, and we will be doing away with even that by
transforming our text into a word list. It is striking that so much of the meaning of text
passages inheres in the number and choice of content words, and relatively little in the
way they are arranged. This is very counterintuitive, considering how important
grammar and writing style are to human perceptions of writing.
Having stripped, pruned, and stemmed our text; we are left with a flat list of words:
administrat
amount
assist
bank
boost
bush
countri (2)
direct
europ
express
grant (2)
increas
irritat
loan
nation (3)
O’Neill
Paul
plan
poor (2)
propos
push
refus
rich
secretar
treasuri
US
world
This is the information we will use to generate our term-document matrix, along with a
similar word list for every document in our collection.
3.2.4.5. The Term-Document Matrix:
As we mentioned in our discussion of LSI, the term-document matrix is a large grid

representing every document and content word in a collection. We have looked in detail
at how a document is converted from its original form into a flat list of content words. We
prepare a master word list by generating a similar set of words for every document in
our collection, and discarding any content words that either appear in every document
(such words won't let us discriminate between documents) or in only one document
(such words tell us nothing about relationships across documents). With this master
word list in hand, we are ready to build our TDM.
We generate our TDM by arranging our list of all content words along the vertical axis,
and a similar list of all documents along the horizontal axis. These need not be in any
particular order, as long as we keep track of which column and row corresponds to
which keyword and document. For clarity we will show the keywords as an alphabetized
list.
We fill in the TDM by going through every document and marking the grid square for all
the content words that appear in it. Because any one document will contain only a tiny
subset of our content word vocabulary, our matrix is very sparse (that is, it consists
almost entirely of zeroes).
Here is a fragment of the actual term-document matrix from our wire stories database:
Document a b c d e F g h i j k l m n o p q
Astro 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
satellite 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
shine 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
star 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
planet 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sun 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
earth 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
We can easily see if a given word appears in a given document by looking at the
intersection of the appropriate row and column. In this sample matrix, we have used
ones to represent document/keyword pairs. With such a binary scheme, all we can tell
about any given document/keyword combination is whether the keyword appears in the
document.
This approach will give acceptable results, but we can significantly improve our results
by applying a kind of linguistic favoritism called term weighting to the value we use for
each non-zero term/document pair.
Term weighting is a formalization of two common-sense insights:
1. Content words that appear several times in a document are probably more
meaningful than content words that appear just once.
2. Infrequently used words are likely to be more interesting than common words.
The first of these insights applies to individual documents, and we refer to it as local
weighting. Words that appear multiple times in a document are given a greater local
weight than words that appear once. We use a formula called logarithmic local
weighting to generate our actual value.
The second insight applies to the set of all documents in our collection, and is called
global term weighting. There are many global weighting schemes; all of them reflect
the fact that words that appear in a small handful of documents are likely to be more
significant than words that are distributed widely across our document collection. Our
own indexing system uses a scheme called inverse document frequency to calculate
global weights.
By way of illustration, here are some sample words from our collection, with the number
of documents they appear in, and their corresponding global weights.
word count global weight
unit 833 1.44

cost 295 2.47
project 169 3.03
tackle 40 4.47
wrestler 7 6.22
You can see that a word like wrestler, which appears in only seven documents, is
considered twice as significant as a word like project, which appears in over a hundred.
There is a third and final step to weighting, called normalization. This is a scaling step
designed to keep large documents with many keywords from overwhelming smaller
documents in our result set. It is similar to handicapping in golf - smaller documents are
given more importance, and larger documents are penalized, so that every document
has equal significance.These three values multiplied together - local weight, global
weight, and normalization factor - determine the actual numerical value that appears in
each non-zero position of our term/document matrix.
Although this step may appear language-specific, note that we are only looking at word
frequencies within our collection. Unlike the stop list or stemmer, we don't need any
outside source of linguistic information to calculate the various weights. While weighting
isn't critical to understanding or implementing LSI, it does lead to much better results, as
it takes into account the relative importance of potential search terms.
With the weighting step done, we have done everything we need to construct a
finished term-document matrix. The final step will be to run the SVD algorithm itself.
Notice that this critical step will be purely mathematical - although we know that the
matrix and its contents are a shorthand for certain linguistic features of our collection,
the algorithm doesn't know anything about what the numbers mean. This is why we say
LSI is language-agnostic - as long as you can perform the steps needed to generate a
term-document matrix from your data collection, it can be in any language or format
whatsoever.
You may be wondering what the large matrix of numbers we have created has to do
with the term vectors and many-dimensional spaces we discussed in our earlier
explanation of how LSI works. In fact, our matrix is a convenient way to represent
vectors in a high-dimensional space. While we have been thinking of it as a lookup grid
that shows us which terms appear in which documents, we can also think of it in spatial
terms. In this interpretation, every column is a long list of coordinates that gives us the
exact position of one document in a many-dimensional term space. When we applied
term weighting to our matrix in the previous step, we nudged those coordinates around
to make the document's position more accurate.
As the name suggests, singular value decomposition breaks our matrix down into a set
of smaller components. The algorithm alters one of these components ( this is where
the number of dimensions gets reduced ), and then recombines them into a matrix of
the same shape as our original, so we can again use it as a lookup grid. The matrix we
get back is an approximation of the term-document matrix we provided as input, and
looks much different from the original:
a b c d e f g h i j
star -0.006 -0.006 -0.002 -0.002 -0.003 -0.001 0.000 0.007 0.004 0.008
plane -.0012
t
moon
sun
earth
astro
shine
Notice two interesting features in the processed data:
 The matrix contains far fewer zero values. Each document has a similarity value
for most content words.
 Some of the similarity values are negative. In our original TDM, this would
correspond to a document with fewer than zero occurrences of a word,
impossibility. In the processed matrix, a negative value is indicative of a very
large semantic distance between a term and a document.
This finished matrix is what we use to actually search our collection. Given one or more
terms in a search query, we look up the values for each search term/document
combination, calculate a cumulative score for every document, and rank the documents
by that score, which is a measure of their similarity to the search query. In practice, we
will probably assign an empirically-determined threshold value to serve as a cutoff
between relevant and irrelevant documents, so that the query does not return every
document in our collection.
3.2.4. Challenges in parallelism:
A major challenge in the design of search engines is the management of parallel

computing processes. There are many opportunities for race conditions and coherent
faults. For example, a new document is added to the corpus and the index must be
updated, but the index simultaneously needs to continue responding to search queries.
This is a collision between two competing tasks. Consider that authors are producers of
information, and a web crawler is the consumer of this information, grabbing the text
and storing it in a cache (or corpus). The forward index is the consumer of the
information produced by the corpus, and the inverted index is the consumer of
information produced by the forward index. This is commonly referred to as a producer-
consumer model. The indexer is the producer of searchable information and users are
the consumers that need to search. The challenge is magnified when working with
distributed storage and distributed processing. In an effort to scale with larger amounts
of indexed information, the search engine's architecture may involve distributed
computing, where the search engine consists of several machines operating in unison.
This increases the possibilities for incoherency and makes it more difficult to maintain a
fully-synchronized, distributed, parallel architecture.
4. Meta-Search Engine
A meta-search engine is the kind of search engine that does not have its own database
of Web pages. It sends search terms to the databases maintained by other search
engines and gives users the results that come from all the search engines queried.
Fewer meta-searchers allow you to delve into the largest, most useful search engine
databases. They tend to return results from smaller and/or free search engines and
miscellaneous free directories, often small and highly commercial. The mechanism and
algorithms that meta-search engines employ are quite different. The simplest meta-
search engines just pass the queries to other direct search engines. The results are
then simply displayed in different newly opened browser windows as if several different
queries were posed.
Some improved meta-search engines organize the query results in one screen in
different frames, or in one frame but in a sequential order. Some more sophisticated
meta-search engines permit users to choose their favorite direct search engines in the
query input process, while using filters and other algorithms to process the returned
query results before displaying them to the users. Problems often arise in the query-
input process though. Meta-Search engines are useful if the user is looking for a unique
term or phrase; or if he (she) simply wants to run a couple of keywords. Some meta-
search engines simply pass search terms along to the underlying direct search engine,
and if a search contains more than one or two words or very complex logic, most of
them will be lost. It will only make sense to the few search engines that supports such
logic. Following are some of the powerful meta-search engines with some direct search
engines like Google, AltaVista and Yahoo.
No two meta-search engines are alike. Some search only the most popular
search engines while others also search lesser-known engines, newsgroups, and other
databases. They also differ in how the results are presented and the quantity of engines
that are used. Some will list results according to search engine or database. Others
return results according to relevance, often concealing which search engine returned
which results. This benefits the user by eliminating duplicate hits and grouping the most
relevant ones at the top of the list.
Search engines frequently have different ways they expect requests submitted. For
example, some search engines allow the usage of the word "AND" while others require
"+" and others require only a space to combine words. The better meta-search engines
try to synthesize requests appropriately when submitting them.
Results can vary between meta-search engines based on a large number of

variables. Still, even the most basic meta-search engine will allow more of the web to be
searched at once than any one stand-alone search engine. On the other hand, the
results are said to be less relevant, since a meta-search engine can’t know the internal
“alchemy” a search engine does on its result (a meta-search engine does not have any
direct access to the search engines’ database).Meta-search engines are sometimes
used in vertical search portals, and to search the deep web. Some examples of meta-
search engine are Dogpile and Meta-crawler.
5. Search engine optimization
Search engine optimization (SEO) is the process of improving the volume and quality of
traffic to a web site from search engines via "natural" ("organic" or "algorithmic") search
results. Usually, the earlier a site is presented in the search results, or the higher it
"ranks," the more searchers will visit that site. SEO can also target different kinds of
search, including image search, local search, and industry-specific vertical search
engines.
As an Internet marketing strategy, SEO considers how search engines work and
what people search for. Optimizing a website primarily involves editing its content and
HTML coding to both increase its relevance to specific keywords and to remove barriers
to the indexing activities of search engines.
In Internet marketing terms, search engine optimization or SEO is the process of

making website easy to find in search engines for its targeted and relevant keywords.
This could be achieved by optimizing internal and external factors that influence search
engine positioning. The main goal of every professionally implemented SEO campaign
is gaining top positioning for targeted keywords as well as search engine traffic growth.
That is the reason why search engine optimization may increase the number of sales
and conversions in times. Many third party organizations provides visibility to business
organization’s website on the World Wide Web, through search engine marketing
techniques and by methods of increasing page ranks of the organization’s website.
5.1. Page Rank
Page-Rank is a link analysis algorithm used by the Google Internet search engine that
assigns a numerical weighting to each element of a hyperlinked set of documents, such
as the World Wide Web, with the purpose of "measuring" its relative importance within
the set. The algorithm may be applied to any collection of entities with reciprocal
quotations and references. The numerical weight that it assigns to any given element E
is also called the Page-Rank of E and denoted by PR (E).
The name "Page-Rank" is a trademark of Google, and the Page-Rank process

has been patented (U.S. Patent 6,285,999). However, the patent is assigned to
Stanford University and not to Google. Google has exclusive license rights on the patent
from Stanford University. The university received 1.8 million shares of Google in
exchange for use of the patent; the shares were sold in 2005 for $336 million.
As an algorithm, Page-Rank is a probability distribution used to represent the

likelihood that a person randomly clicking on links will arrive at any particular page.
Page-Rank can be calculated for collections of documents of any size. It is assumed in
several research papers that the distribution is evenly divided between all documents in
the collection at the beginning of the computational process. The Page-Rank
computations require several passes, called "iterations", through the collection to adjust
approximate Page-Rank values to more closely reflect the theoretical true value.
A probability is expressed as a numeric value between 0 and 1. A 0.5 probability

is commonly expressed as a "50% chance" of something happening. Hence, a Page-
Rank of 0.5 means there is a 50% chance that a person clicking on a random link will be
directed to the document with the 0.5 Page-Rank.
5.2. The Ranking algorithm simplified
Assume a small universe of four web pages: A, B, C and D. The initial approximation of
Page-Rank would be evenly divided between these four documents. Hence, each
document would begin with an estimated Page-Rank of 0.25.
In the original form of Page-Rank initial values were simply 1. This meant that the
sum of all pages was the total number of pages on the web. Later versions of Page-
Rank (see the below formulas) would assume a probability distribution between 0 and 1.
Here we're going to simply use a probability distribution hence the initial value of 0.25.
If pages B, C, and D each only link to A, they would each confer 0.25 Page-Rank to A.
All Page-Rank PR ( ) in this simplistic system would thus gather to A because all links
would be pointing to A.
This is 0.75.
Again, suppose page B also has a link to page C, and page D has links to all
three pages. The value of the link-votes is divided among all the outbound links on a
page. Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page
C. Only one third of D's Page-Rank is counted for A's Page-Rank (approximately 0.083)
In other words, the Page-Rank conferred by an outbound link L ( ) is equal to the

document's own Page-Rank score divided by the normalized number of outbound links
(it is assumed that links to specific URLs only count once per document).
In the general case, the Page-Rank value for any page u can be expressed as
I.e. the Page-Rank value for a page u is dependent on the Page-Rank values for each
page v out of the set Bu (this set contains all pages linking to page u), divided by the
number L (v) of links from page v
5.3. Damping factor
The Page-Rank theory holds that even an imaginary surfer who is randomly clicking on
links will eventually stop clicking. The probability, at any step, that the person will
continue is a damping factor d. The various studies have tested different damping
factors, but it is generally assumed that the damping factor will be set around 0.85. The
damping factor is subtracted from 1 (and in some variations of the algorithm, the result
is divided by the number of documents in the collection) and this term is then added to
the product of the damping factor and the sum of the incoming Page-Rank scores.
That is,
Or (N = the number of documents in collection)
So any page's Page-Rank is derived in large part from the Page-Ranks of other pages.
The damping factor adjusts the derived value downward. Google recalculates Page-
Rank scores each time it crawls the Web and rebuilds its index. As Google increase the
number of documents in its collection, the initial approximation of Page-Rank decreases
for all documents.
The formula uses a model of a random surfer who gets bored after several clicks
and switches to a random page. The Page-Rank value of a page reflects the chance
that the random surfer will land on that page by clicking on a link. It can be understood
as a Markov chain in which the states are pages, and the transitions are all equally
probable and are the links between pages. If a page has no links to other pages, it
becomes a sink and therefore terminates the random surfing process. However, the
solution is quite simple. If the random surfer arrives at a sink page, it picks another URL
at random and continues surfing again.
When calculating Page-Rank, pages with no outbound links are assumed to link
out to all other pages in the collection. Their Page-Rank scores are therefore divided
evenly among all other pages. In other words, to be fair with pages that are not sinks,
these random transitions are added to all nodes in the Web, with a residual probability
of usually d = 0.85, estimated from the frequency that an average surfer uses his or her
browser's bookmark feature.
So, the equation is as follows:

where p1,p2,p3,..,pN are the pages under consideration, M(pi) is the set of pages that link
to pi, L(pj) is the number of outbound links on page pj, and N is the total number of
pages.
The Page-Rank values are the entries of the dominant eigenvector of the
modified adjacency matrix. This makes Page-Rank a particularly elegant metric: the
eigenvector is
Where R is the solution of the equation as follows:
where the adjacency function L(Pi,Pj) is 0, if page pj does not link to pi, and normalized
such that, for each j
i.e. the elements of each column sum up to 1.
This is a variant of the eigenvector centrality measure used commonly in network

analysis. Because of the large Eigen-gap of the modified adjacency matrix above, the
values of the Page-Rank eigenvector are fast to approximate (only a few iterations are
needed).
As a result of Markov theory, it can be shown that the Page-Rank of a page is the
probability of being at that page after lots of clicks. This happens to equal t − 1 where t
is the expectation of the number of clicks (or random jumps) required to get from the
page back to itself.
The main disadvantage is that it favors older pages, because a new page, even a
very good one, will not have many links unless it is part of an existing site (a site being a
densely connected set of pages, such as Wikipedia). The Google Directory (itself a
derivative of the Open Directory Project) allows users to see results sorted by Page-
Rank within categories. The Google Directory is the only service offered by Google
where Page-Rank directly determines display order. In Google's other search services
(such as its primary Web search) Page-Rank is used to weigh the relevance scores of
pages shown in search results.
Several strategies have been proposed to accelerate the computation of Page-

Rank. The Various strategies to manipulate Page-Rank have been employed in
concerted efforts to improve search results rankings and monetize advertising links.
These strategies have severely impacted the reliability of the Page-Rank concept, which
seeks to determine which documents are actually highly valued by the Web community.
5.4. Uses of Page-Rank
 A version of Page-Rank has recently been proposed as a replacement for the

traditional Institute for Scientific Information (ISI) impact factor, and implemented
at eigenfactor.org. Instead of merely counting total citation to a journal, the
"importance" of each citation is determined in a Page-Rank fashion.
 A similar new use of Page-Rank is to rank academic doctoral programs based on

their records of placing their graduates in faculty positions. In Page-Rank terms,
academic departments link to each other by hiring their faculty from each other
(and from themselves).
 Page-Rank has been used to rank spaces or streets to predict how many people
(pedestrians or vehicles) come to the individual spaces or streets.
 Page-Rank has also been used to automatically rank WordNet synsets according
to how strongly they possess a given semantic property, such as positivity or
negativity.
 A dynamic weighting method similar to Page-Rank has been used to generate

customized reading lists based on the link structure of Wikipedia.
 A Web crawler may use Page-Rank as one of a number of importance metrics it

uses to determine which URL to visit next during a crawl of the web. One of the
early working papers which were used in the creation of Google is efficient
crawling through URL ordering, which discusses the use of a number of different
importance metrics to determine how deeply and how much of a site Google will
crawl. Page-Rank is presented as one of a number of these importance metrics,
though there are others listed such as the number of inbound and outbound links
for a URL, and the distance from the root directory on a site to the URL.
 The Page-Rank may also be used as a methodology to measure the apparent

impact of a community like the Blogosphere on the overall Web itself. This
approach uses therefore the Page-Rank to measure the distribution of attention
in reflection of the Scale-free network paradigm.
6. Marketing of search engines
Search engine marketing, or SEM, is a form of Internet marketing that seeks to promote
websites by increasing their visibility in search engine result pages (SERPs) through the
use of paid placement, contextual advertising, and paid inclusion. The Pay per Click
(PPC) lead Search Engine Marketing Professional Organization (SEMPO) also includes
search engine optimization (SEO) within its reporting, but SEO is a separate discipline
with most sources, including the New York Times defining SEM as 'the practice of
buying paid search listings”.
Fig. 1. The Advertisement market share of

search engines
Search engines have become indispensable to interacting on the Web. In addition to

processing information requests, they are navigational tools that can direct users to
specific Web sites or aid in browsing. Search engines can also facilitate e-commerce
transactions as well as provide access to noncommercial services such as maps, online
auctions, and driving directions.
People use search engines as dictionaries, spell checkers, and thesauruses; as
discussion groups (Google Groups) and social networking forums (Yahoo! Answers);
and even as entertainment (Google-whacking, vanity searching). In this competitive
market, rivals continually strive to improve their information-retrieval capabilities and
increase their financial returns. One innovation is sponsored search, an “economics
meets search” model in which content providers pay search engines for user traffic
going from the search engine to their Web sites. Sponsored search has proven to be a
successful business.
Most Web search engines are commercial ventures supported by advertising
revenue and, as a result, some employ the practice of allowing advertisers to pay
money to have their listings ranked higher in search results. Those search engines
which do not accept money for their search engine results make money by running
search related ads alongside the regular search engine results. The search engines
make money every time someone clicks on one of these ads.
Revenue in the web search portals industry is projected to grow in 2008 by 13.4
percent, with broadband connections expected to rise by 15.1 percent. Between 2008
and 2012, industry revenue is projected to rise by 56 percent as Internet penetration still
has some way to go to reach full saturation in American households. Furthermore,
broadband services are projected to account for an ever increasing share of domestic
Internet users, rising to 118.7 million by 2012, with an increasing share accounted for by
fiber-optic and high speed cable lines.
7. Summary
With the precipitous expansion of the Web, extracting knowledge from the Web is
becoming gradually important and popular. This is due to the Web’s convenience and
richness of information. Today search engines can cover more than 60% of information
of the information on the World Wide Web. The future prospects of every aspect of
search engine are very bright. Like Google is coming up with embedded intelligence in
its search engine.
For all their problems, online search engines have come a long way. Sites like
Google are pioneering the use of sophisticated techniques to help distinguish content
from drivel, and the arms race between search engines and the marketers who want to
manipulate them has spurred innovation. But the challenge of finding relevant content
online remains. Because of the sheer number of documents available, we can find
interesting and relevant results for any search query at all. The problem is that those
results are likely to be hidden in a mass of semi-relevant and irrelevant information, with
no easy way to distinguish the good from the bad.
8. References
 Brin, Sergey and Page Lawrence. The anatomy of a large-scale hyper textual
Web search engine. Computer Networks and ISDN Systems, April 1998
 Baldi, Pierre. Modeling the Internet and the Web: Probabilistic Methods and
Algorithms, 2003
 Chakrabarti, Soumen. Mining the Web: Analysis of Hypertext and Semi
Structured Data, 2003
 Jansen, B. J. (May 2007). "The Comparative Effectiveness of Sponsored and
Non-sponsored Links for Web E-commerce Queries" (PDF). ACM Transactions on
the Web.
 "Fast Page-Rank Computation via a Sparse Linear System (Extended Abstract)".
Gianna M. Del Corso, Antonio Gullí, Francesco Romani.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.118.5422.
 Deeho Search Engine Optimization (SEO) solutions
9. Bibliography
 Wikipedia.org
 Google Books
 The SEO Books

Seminar Report

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Seminar Report

Uploaded by

Copyright:

Available Formats

A

Submitted to: Guide:

Department of Computer Science and Engineering

Place: Jodhpur (SACHIN SHARMA)

Dr. K.R. Chowdhary

2. Types of search engine………………………………………………2

3. General system architecture of web search engine………………2

3.1. Web crawling……………………………………………………..4

3.1.1. Types of crawling…………………………………………6

3.1.1.1. Focused crawling…………………………………6

3.1.1.2. Distributed crawling……………………………....6

3.1.2. Robot exclusion protocol…………………………………7

3.1.3. Resource constraints……………………………………..8

3.2. Web indexing……………………………………………………..8

3.2.1. Index design factors………………………………………9

3.2.2. Index data structures……………………………………..10

3.2.3. Types of indexing…………………………………………11

3.2.3.1. Inverted Index……………………………………...11

3.2.3.2. Forward index……………………………………..12

3.2.4. Latent Semantic Indexing (LSI)………………………….13

3.2.4.1. What is LSI…………………………………………13

3.2.4.2. How LSI Works…………………………………….14

3.2.4.3. Singular Value Decomposition (SVD)…………..17

3.2.4.5. The Term Document Matrix………………………22

3.2.5. Challenges in parallelism…………………………………27

4. Meta search engine……………………………………………………27

5. Search engine optimization…………………………………………..29

5.1. Page Rank…………………………………………………………29

5.2. The ranking algorithm simplified………………………………...30

5.3. Damping factor…………………………………………………….32

5.4. Uses of page Rank………………………………………………..35

6. Marketing of search engines………………………………………….36

A search engine is an information retrieval system designed to help find information

 They search the Internet or select pieces of the Internet based on

2. Types of search engine

3. General system architecture of web search engine

The indexer performs a number of functions. It reads the repository,

3.1. Web crawling

 Download the Web page.

 For each link retrieved, repeat the process.

3.1.1. Types of crawling Crawlers are of two types basically.

3.1.1.1. Focused crawling

3.1.1.2. Distributed crawling

3.1.2. Robot exclusion protocol

# Robots.txt for http://somehost.com/

Disallow: /registration # Disallow robots on registration page

3.1.3. Resource Constraints

Crawlers consume resources: network bandwidth to download pages, memory to

3.2. Web Indexing

The purpose of storing an index is to optimize speed and performance in finding

3.2.1. Index design factors

Major factors in designing a search engine's architecture include:

 Maintenance: How the index is maintained over time.

3.2.2. Index data structures

 Inverted index: Stores a list of occurrences of each atomic search criterion,

 Citation index: Stores citations or hyperlinks between documents to support

 Ngram index: Stores sequences of length of data to support other types of

3.2.3. Types of indexing: Indexing is basically of two types.

3.2.3.1. Inverted Index:

the Doc1, Doc3, Doc4, Doc5

cow Doc2, Doc3, Doc4

3.2.3.2. Forward Index:

Doc1 the, cow, says, moo