You are on page 1of 16

CONTENTS

1. Introduction

2. Search Terminology

3. Search Engine Databases

4. Search Engine Ranking Algorithms

5. Search Tools and Services

5.1. Through Directories


5.2. By using Spiders or Robots
i. Search services
ii. Search sites
a. Search Directories
b. Search Engines

6. How Search Engine Works?


6.1. Search engine components
6.2. Keyword Based Searching
6.3. Concept Based Searching

7. Information on Meta Search Engine

8. What are Meta Search Engines?

9. Conclusion

1
SEARCH ENGINES
Introduction:

The Web is potentially a terrific place to get information on almost


any topic. Doing research without leaving your desk sounds like a great
idea, but all too often you end up wasting precious time chasing down
useless URLs. Almost everyone agrees that there’s going to be a better
way! But for now we’re stuck with making the best use of the search tools
that already exist on the Web.

If you’re ore interested in broad, general information, the first place


to go is to a Web Directory. If you’re after narrow, specific information, a
Web search engine is probably a better choice.

Searching by Means of Subject Directories

Think back to the library card catalogue analogy. In the old card
files, and even in today’s computer terminal library catalogues, you find
information by searching on either the author, the title, or the subject. You
usually choose the subject option when you want to cover a broad range of
information.

Example: You’d like to create your own home page on the Web, but you
don’t know how to write HTML, you’ve never created a graphic file, and
you’re not sure how you’d post a page on the Web even if you knew how
to write one. In short, you need a lot of information on a rather broad
topic--Web publishing.

Your best bet is not a search engine, but a Web directory like Yahoo.
Yahoo is a subject-tree style catalogue that organizes the Web into 14
majors topics, including Arts, Business and Economy, Computers and
Internet, Education, Entertainment, Government, Health, News,
Recreation, Reference, Regional, Science, Social Science, Society and
Culture. Under each of these topics is a list of subtopics, and under each of
those is another list, and another, and so on, moving from the more general
to the more specific.

2
Example: To find out about Web page publishing from yahoo, select the
Computers and Internet Topic, under which you find a subtopic on the
Wide World Web. Click on that and you find another list of subtopics,
several of which are pertinent to your search: Web Page Authoring, CGI
Scripting, Java, HTML, Page Design, and Tutorials. Selecting any of these
subtopics eventually takes you to Web pages that have been posted
precisely for the purpose of giving you the information you need.

Web directories usually come equipped with their own keyword


search engines that allow you to search through their indices for the
information you need.

Important note: More and more search engines are incorporating Web
directories into their sites. These directories interact with the main search
engine on the site in various ways. See Excite, Infoseek and Lycos, even
Alta Vista--they are no longer “just a search engine.” They are now
characterizing themselves as Web portals or hubs -- places where people
come to on the Web to get information about a multitude of subjects, and
even to chat, send email and form online communities.

Search Terminology

Here are a few common search-related terms you should know


about.

Search tool Any mechanism for locating information on the Web;


usually refers to a search or metasearch engine, or a directory.

Query Information entered into a form on a search engine’s Web


page that describes the information being sought. Note that a query is
not usually phrased as a question.

Query syntax A set of rules describing what constitutes a legal


query. On some search engines, special symbols may be used in a
query.

3
Query semantics A set of rules that defines the meaning of a query.

Hit A URL that a search engine returns in response to query.

Match A synonym for hit.

Relevancy score A value that indicates how close a match a URL


was to a query; usually expressed as a value from 1 to 100, with the
higher score meaning more relevant.

Search Engine Servers

Search engines, like all other web sites, are housed on high-speed
computers called WW servers. They are completely dedicated to providing
effective search services 24 hours a day. Search engine servers are
connected to the backbone (high speed infrastructure) of the WWW via
extremely fast, expensive telephone lines called t3 lines. Most of Yahoo’s
servers, for example, are located in Santa Clara, California.

Search Engine Databases


Before search engines can function, they need to have a collection of
information (a database, also called an index) to search. No search engine
actually toes out onto the WWW to look for matches when a query is
entered. Think about it: web sites sometimes go offline for maintenance,
and connection speeds vary depending on how busy the web is at any
given time. If a search engine were to initiate its search of the WWW
when its visitor clicked the “Search” button, its search would take weeks,
not seconds!

The solution to this problem is the creation and maintenance of an


enormous database. When a surfer performs a search, the engine searches
its database, not the WWW itself. Ideally, these databases and deletions
and changes of thousands and millions of web pages every minute of every
day, no search engine database meets this lofty goal.
If it did, it would simply be a copy of the whole WWW! Realistically,
each database is at least a large variety and significant sampling of quality

4
web sites. At most, these collections sport an impressive, frequently
updated and detailed majority of the WWW.

Once a database is in place, search engines keep much of these giant


summaries in the memory of their computers, not just on hard drives or
other mechanical storage media. Electronic searches (in memory) are
much faster than mechanical searches (on hard drives) because electronic
searches can be performed at the speed of electricity (near the speed of
light). In this manner, a search engine can search through its database of
millions of web site summaries within a few seconds, delivering very fast
results. Most household PCs these days have around 32 MegaBytes
(MB - millions of bytes) of memory in them. Computers used as web
servers for search engines have GigaBytes (GB – billions of bytes) of
memory to allow them to maintain much of their huge databases in quickly
searchable electronic memory.

Search Engine Ranking Algorithms


After the database has been created and placed in the search engine
computer’s memory, the device is finally ready to perform searches and
deliver results. Only now does another device come into play: the ranking
algorithm. All search engines, including directories, score the relevancy of
web pages through these mathematical machines. Their purpose is to
deliver links to web pages most relevant to each search phrase. Rightfully
so, these automatic mechanisms are a source of great pride and revenue for
their inventors.

When a surfer types in a search phrase on a search engine and hits


the “Search” button, the algorithm jumps into action. Say, for example,
that a surfer types in “martial arts in phoenix” as their search phrase. The
algorithm then looks at the first database entry in its memory, searching for
occurrences of the entire search phrase, of for occurrences of the individual
key words “martial”, “arts” or “phoenix” (extremely common words like
“in” are usually ignored).

Each ranking algorithm assigns different weights to different


occurrences of the key words, depending on where and in what form these
matches are found (more on this below). Taking all these factors into

5
account, these algorithms generate a relevancy score for the first web page
in their memory. They then proceed to do the same for the second, third
and millionth web pages. Finally, the relevancy scores are sorted in order
from most relevant to least, and the corresponding web pages are listed in
this order with informative summary information from the database.
Viola! The surfer (hopefully) gets the results he of she was looking for.

Although all search engines incorporate the basic components


described above, the boundaries among these components are not rigid.
The designs of a search engine’s database and of its ranking algorithm go
hand in hand, and usually it’s difficult to discern where on ends and the
other begins. For example, some search engines might calculate and store
ranking information for obvious web page themes during the creation of
their databases, in order to speed up the job of the ranking algorithm.
Major functional differences are also apparent between deep search
engines and directories, beginning with their distinct approaches to
building databases.

Indexing:
Document should be indexed for making search easier and less time
consuming. Indexing is the processing of a document representation by
assigning content descriptors or terms to the document. Each document
has objective terms (for example: The authors name, document URL, and
the date of publication), and non-objective terms intended to reflect the
information known as content terms.

SEARCH TOOLS AND SERVICES

The search industry has two ways to find things – through


directories and spiders. The problem with directories, which store
knowledge in some structure, is that classification is a labor-intensive
activity and there are far more publishers than classifiers on the web. And
if the information you are looking for is not reflected by the classification
structure, then you are out of lick. And this happens quite often.

An alternative is intensive automation that involves spider or robot,


which explores the web and helps find web pages. Spiders also have the
ability to test databases against queries and order the resulting matches

6
they have user interface for obtaining and presenting results. Search tools
employ robots for indexing web documents, and these can be classified as
type1 and type2.

Search services:

Search services broadcast user queries two several engines and


various other information sources simultaneously. They then merge results
submitted by these sources, check for duplicates and, present them to user
as an HTML page with clickable URLs.

Search sites:

There are basically two types of search sites on the web: search
directories and search engines.

Search directories contain a list of web sites organized


hierarchically in to categories and subcategories. These are created
manually rather than being automated.

Search engines, on the other hand or huge computer generated


databases containing information on millions of web sites. They use
spiders to automatically look up web sites and update their databases.

HOW SEARCH ENGINES WORK ?

Search engines use software robots to survey the Web and build their
databases. Web documents are retrieved and indexed. When you enter a
query at a search engine web site, your input is checked against the search
engine’s keyword indices. The best matches are then returned to you as
hits.

There are two primary methods of text searching – keyword and


concept.

7
SEARCH ENGINE COMPONENTS:

If you understand how a search tool works, there is a good chance


you will be able to use it more effectively. In this section, we describe how
a search engine works. For the most part, these same ideas apply to
directories; the main difference is that the hierarchical organizational
structure and categorizations for directories need to be in place and
displayed. The references include additional information about how
directories are put together.

To describe how a search engine works, we split up its functions into


a number of components: user interface, searcher, and evaluator.

User interface: The screen in which you type a query and which displays
the search results.

Searcher: The part that searches a database for information to match your
query.

Evaluator: The function that assigns relevancy scores to the information


retrieved.

In addition, a search engine’s database is created using the


following:

Gatherer: The component that traverses the Web, collecting information


about pages.

Indexer: The function that categorizes the data obtained by the gatherer.

KEYWORD SEARCHING:

This is the most common form of text search on the Web. Most
search engines do their text query and retrieval using keywords.

Unless the author of the Web document specifies the keywords for
her document (this is possible by using meta tags in the latest version of
HTML), it’s up to the search engine to determine them. Essentially, this

8
means that search engines pull out and index words that are believed to be
significant. Words that are mentioned towards the top of a document and
words that are repeated several times throughout the document are more
likely to be deemed important.

Some sites index every word on every page. Others index only part
of the document. For example, Lycos indexes the title, headings,
subheadings and the hyperlinks to other sites, along with the first 20 lines
of text and the 100 words that occur most often.

Infoseek uses a full-text indexing system, picking up every word in


the text except commonly occurring stop words such as “a”, “an”, “the”,
“is”, “and”, “or”, “www”. Hotbot also ignores stop words. Alta Vista
claims to index all words, even the articles, “a”, “an”, “and”, “the”. Some
of the search engines discriminate upper case from lower case, others store
all words without reference to capitalization.

THE PROBLEM WITH KEYWORD SEARCHING:

Keyword searches have a tough time distinguishing between words


that are spelled the same way, but mean something different (i.e. hard
cider, a hard stone, a hard exam, and the hard drive on your computer).
This often results in hits that are completely irrelevant to your query.
Some search engines also have trouble with so-called stemming – i.e., if
you enter the word “big”, should they return a hit on the word, “bigger?”
What about singular and plural words? What about verb tenses that differ
from the word you entered by only an “s”, or an “ed”?

Search engines also cannot return hits on keywords that mean the
same, but are not actually entered in your query. A query on heart disease
would not return a document that used the word “cardiac” instead of
“heart”.

CONCEPT BASED SEARCHING:

9
Unlike keyword search systems, concept-based search systems try to
determine what you mean, not just what you say. In the best
circumstances, a concept-based search returns hits on documents that are
“about” the subject/theme you’re exploring, even if the words in the
document don’t precisely match the words you enter into the query.

Excite is currently the best-known general-purpose search engine


site on the Web that relies on concept-based searching. This is also known
as clustering – which essentially means that words are examined in relation
to other words found nearby.

For example, the word heart, when used in the medical/health


context, would be likely to appear with such words as coronary, artery,
lung, stroke, cholesterol, pump, blood, attack, and arteriosclerosis. If the
word heart appears in a document with others words such as flowers,
candy, love, passion, and valentine, a very different context is established,
and the search engine returns hits on the subject of romance.

Warning: This often works better in theory than in practice. Concept-


based indexing is a good idea, but it’s far from perfect. The results are best
when you enter a lot of words, all of which roughly refer to the concept
you’re seeking information about.

Here’s an example of a concept-based query. Jump to Excite and


enter the phrase “Globareena and Consequences” (don’t use the quotation
marks). You will get back a lot of documents about Globareena and
Details online, even if they don’t contain the precise words in your query.
On the keyword search engines, you will also get hits, but they will be
limited to those that do contain the precise words of your query.

Refining Your Search

Most sites offer two different types of searches — “basic” and


“refined”. In a “basic” search, you just enter a keyword without sifting
through any pulldown menus of additional options. Depending on the
engine, though, “basic” searches can be quite complex.

Search refining options differ from one search engine to another, but
some of the possibilities include the ability to search on more than one

10
word, to give more weight to one search term than you give to another, and
to exclude words that might be likely to muddy the results. You might also
be able to search on proper names, on phrases, and on words that are found
within a certain proximity to other search terms.

Some search engines also allow you to specify what form you’d like
your results to appear in, and whether you wish to restrict your search to
certain fields on the internet (i.e., Usenet of the Web) or to specific parts of
Web documents (i.e., the title of URL).

Many, but not all search engines allow you to use so-called Boolean
operators to refine your search. These are the logical terms AND, OR,
NOT, and the so-called proximal locators, NEAR and FOLLOWED BY.

Boolean AND means that all the terms you specify must appear in
the documents, i.e., “heart” AND “attack”. You might use this if you
wanted to exclude common hits that would be irrelevant to your query.

Boolean OR means that at least one of the terms you specify must
appear in the documents, i.e., bronchitis, acute OR chronic. You might use
this if you didn’t want to rule out too much.

Boolean NOT means that at least one of the terms you specify must
not appear in the documents. You might use this if you anticipated results
that would be totally off-base, i.e., nirvana AND Buddhism, NOT Cobain.

Not quite Boolean + and – Some search engines use the characters +
and – instead of Boolean operators to include and exclude terms.

NEAR means that terms you enter should be within a certain number
of words of each other. FOLLOWED BY means that one term must
directly follow the other. ADJ, for adjacent, serves the same function. A
search engine that will allow you to search on phrases uses, essentially, the
same method (i.e., determining adjacency of keywords).

Phrases: The ability to query on phrases is very important in a search


engine. Those that allow it usually require that you enclose the phrase in
quotation marks, i.e., “space the final frontier”.

11
Capitalization: This is essential for searching on proper names of people,
companies or products. Unfortunately, many words in English are used
both as proper and common nouns – Bill, bill, Gates, gates, Oracle, oracle,
Lotus, lotus, Digital, digital – the list is endless.

All the search engines have different methods of refining queries.


The best way to learn them is to read the help files on the search engine
sites and practice!

Popular Search Engines are:

AOL NetFind  www.aol.com


AltaVista  www.altavista.digital.com
Excite  www.excite.com
HotBot  www.hotbot.com
Infoseek  www.infoseek.com
Lycos  www.lycos.com
Magellan  www.mckinley.com
Web Crawler  www.webcrawler.com

INFORMATION ON META SEARCH ENGINES:

Some search engines are now indexing Web documents by the meta
tags in the documents HTML (at the beginning of the document in the so-
called “head” tag). What this means is that the Web page author can have
some influence over which keywords are used to index the document, and
even in the description of the document that appears when it comes up as a
search engine hit.

This is obviously very important if you are trying to draw people to


your website based on how your site ranks in search engines hit lists.

There is no perfect way to ensure that you’ll receive a high ranking.


Even if you do get a great ranking, there’s no assurance that you’ll keep it

12
for long. There is a lot of conflicting information out there on meta-
tagging. If you’re confused it may be because different search engines
look at meta tags in different ways. Some rely heavily on meta tags, others
don’t use them at all.

It seems to be generally agreed that the “title” and the “description”


meta tags are important to write effectively, since several major search
engines use them in their indices. Use relevant keywords in your title, and
vary the titles on the different pages that make up your website, in order to
target as many keywords as possible. As for the “description” meta tag,
some search engines will use it as their short summary of your URL, so
make sure your description is one that will entice surfers to your site.

The “keyword” meta tag, which is essentially made up of a list of


keywords that (supposedly) appear in the document, has been abused by
some webmasters. For example, a recent ploy has been to put the words
“Pamela Anderson” into keyword meta tags, in hopes of luring searchers to
one’s website by using the keywords for one of the most popular searches
on the Web.

The search engines are aware of such deceptive tactics, and have
devised various methods to circumvent them, so be careful. Use keywords
that are appropriate to your subject, and make sure they appear in the top
paragraphs of actual text on your web page. Many search engine
algorithms score the words that appear towards the top of your document
more highly than the words that appear towards the bottom. Words that
appear in HTML header tags (H1, H2, H3, etc) are also given more weight
by some search engines. It sometimes helps to give your page a file name
that makes use one of your prime keywords, and to include keywords in
the “Alt” image tags.

One thing you should not do is using some other company’s


trademarks in your meta tags. Some website owners have been sued for
trademark violations because they’ve used other company names in the
meta tags.

Remember that all the major search engines have slightly different
policies. If you’re designing a website and meta-tagging your documents,
we recommend that you take the time to check out what the major search

13
engines say in their help files about how they each use meta tags. You
might want to optimize your meta tags for the search engines you believe
are sending the most traffic to your site.

What are “Meta-Search” engines?


In a meta-search engine, you submit keywords in its search box,
and it transmits your search simultaneously to several individual search
engines and their databases of web pages. Within a few seconds, you get
back results from all the search engines queried. Meta-search engines
do not Own a database of Web pages; they send your search terms to the
databases maintained for other search engines.

In ordinary (non-“meta”) search engines such as Northern Light,


AltaVista, Google, etc., you submit keywords to their individual database
of web-pages, and you get back a different display of documents from each
search engine. Results from submitting very comparable searches can
differ widely (about 40%), but also contain some of the same sites (about
60%).

Some meta-search engine sites offer many useful secondary, portal-


like services and specialized collections of web sites and/or resources for
businesses, web designers, movie-goers, etc. Others offer what I call
“pseudo-meta-searching” – a collection of search boxes for different search
engines or a drop-down menu that let’s you choose which one among a list
of search engines to search. Neither of these types of services is
commented on here.

Limitations of Meta-Search engines

How do you know if your search terms will “work’? As anyone


who does Internet searching knows, search protocol (the way you enter
search keywords) is far from standardized. Almost all accept “ “ as
causing a phrase. A few accept Boolean AND, OR, and NOT. Fewer
accept ( ) to group terms. Some only accept + or -. Some default to OR,
some to AND. Some take * to truncate. Other stem automatically. And
so on.

Three main factors determine the usefulness of any meta-search


engine:

14
1. The search engines they send your search terms to (size, content,
number of search engines, you ability to choose the search
engines you prefer); all of them search subject directories as well
as search engines and intermix results from all.

2. How they handle your search terms and search syntax (Boolean
operators, phrases, and defaults they impose).

3. How they display results (ranking; aggregated into one list, or


with each search engine’s results reported separately).

Good for simple searches. Meta-Search engines are useful if you


are looking for a unique term or phrase (enclose phrases in quotes “ “); or
if you simply want to test run a couple of keywords to see if they get what
you want. For such straight-forward searches, the unique ranking
algorithm used by Google (based on how many other sites link to a site)
often finds exactly what you want, better than any meta-search engine
(unless you choose one you can limit to Google only).

For more difficult searches, we can search within results on a term


or phrase we specify.

Use Meta-Search engines – but use them CAUTIOUSLY:

 Most meta-search engines only spend a short time in each


database and often retrieve only 10% of any of the results in
any of the databases queried. This makes their searches
usually “quick and dirty, “but often good enough to find what
you want.

 Most meta-searchers simply pass your search terms along, and


if your search contains more than one or two words or very
complex logic, most of that will be lost. It will only make
sense to the few search engines that support such logic.

15
 Quantity in results does not equal satisfaction. If you get more
results than you want, try refining the results by going directly
to AltaVista Advanced Search, Northern Light, or Infoseek by
clicking on their link in the results. Choose meta-search
engines that offer some of these as options.

 Look for meta-search engines that also send your terms to


selective or odd databases like WebCrawler, Thunderstone,
Direct Hit, and WhatUseek. One of the advantages of a meta-
searcher is that you might overlook databases like these which
may have sites missed by the big boys.

Popular Meta Search Engines are:

Meta Search  www.metasearch.com


Meta Crawler  www.metacrawler.com

Conclusion:

Though there are many search engines available on the web, the
searching methods and the engines need to go a long way for efficient
retrieval of information on relevant topics. As the technology advances at
an unimaginable pace, it is not unwise expecting an efficient search engine,
which addresses all the needs.

The current generation of search tools and services have to


significantly improve their retrieval effectives. Otherwise, the web will
continue to evolve towards an information entertainment center for users
with no specific search objectives.

Choosing the right search engine will need patience and experience.
Use Meta search engines. They minimize your search to a great extent.
The good news is that new search engines are evolving every day to
improve retrieval efficiency.

16

You might also like