Professional Documents
Culture Documents
Search engine is a software program that takes as input a search phrase, match
it with entries made earlier and returns a set of links pointing to their locations.
Actually consumers would really prefer a finding engine, rather than a search
engine. Search engines match queries against an index that they create. The
index consists of the words in each document, plus pointers to their locations
within the documents. This is called an inverted file.
A document processor
A query processor
A search and matching function
A ranking capability
While users focus on "search," the search and matching function is only one of
the four modules. Each of these four modules may cause the expected or
unexpected results that consumers get when they use a search engine.
CHAPTER 2
SEARCH ENGINE MODULES
SEARCH ENGINE MODULES
1.Document Processor
The document processor prepares, processes, and inputs the documents, pages,
or sites that users search against. The document processor performs some or all
of the following steps:
It dramatically affects the nature and quality of the document representation that
the engine will search against. In designing the system, we must define the word
"term." Is it the alphanumeric characters between blank spaces or punctuation? If
so, what about non-compositional phrases (phrases in which the separate words
do not convey the meaning of the phrase, like "skunk works" or "hot dog"), multi-
word proper names, or inter-word symbols such as hyphens or apostrophes that
can denote the difference between "small business men" versus small-business
men.” Each search engine depends on a set of rules that its document processor
must execute to determine what action is to be taken by the "tokenizer," i.e. the
software used to define a term suitable for indexing.
Deleting stop words
This step helps save system resources by eliminating from further processing, as
well as potential matching, those terms that have little value in finding useful
documents in response to a customer's query. Since stop words may comprise
up to 40 percent of text words in a document, it still has some significance. A stop
word list typically consists of those words such as articles (a, the), conjunctions
(and, but), interjections (oh, but), prepositions (in, over), pronouns (he, it), and
forms of the "to be" verb (is, are).
Term Stemming
Stemming is the ability for a search engine to search for variations of a word
based on its stem. Stemming removes word suffixes, perhaps recursively in layer
after layer of processing. The process has two goals. In terms of efficiency,
stemming reduces the number of unique words in the index, which in turn
reduces the storage space required for the index and speeds up the search
process. In terms of effectiveness, stemming improves recall by reducing all
forms of the word to a base or stemmed form. For example, if a user asks for
analyze, they may also want documents which contain analysis, analyzing,
analyzer, analyzes, and analyzed. Therefore, the document processor stems
document terms to analy- so that documents, which include various forms of
analy-, will have equal likelihood of being retrieved; this would not occur if the
engine only indexed variant forms separately and required the user to enter all.
Of course, stemming does have a downside. It may negatively affect precision in
that all forms of a stem will match, when, in fact, a successful query for the user
would have come from matching only the word form actually used in the query.
Having completed the above steps, the document processor extracts the
remaining entries from the original document. It is then inserted and stored in an
inverted file that lists the index entries and an indication of their position and
frequency of occurrence. The specific nature of the index entries, however, will
vary based on the decision in Step 2 concerning what constitutes an "index able
term."
For example, the following paragraph shows the full text sent to a search engine
for processing:
Milosevic's comments, carried by the official news agency Tanjug, cast doubt
over the governments at the talks, which the international community has called
to try to prevent an all-out war in the Serbian province. "President Milosevic said
it was well known that Serbia and Yugoslavia were firmly committed to resolving
problems in Kosovo, which is an integral part of Serbia, peacefully in Serbia with
the participation of the representatives of all ethnic communities," Tanjug said.
Milosevic was speaking during a meeting with British Foreign Secretary Robin
Cook, who delivered an ultimatum to attend negotiations in a week's time on an
autonomy proposal for Kosovo with ethnic Albanian leaders from the province.
Cook earlier told a conference that Milosevic had agreed to study the proposal.
Milosevic comm carri offic new agen Tanjug cast doubt govern talk interna
commun call try prevent all-out war Serb province President Milosevic said well
known Serbia Yugoslavia firm commit resolv problem Kosovo integr part Serbia
peace Serbia particip representa ethnic commun Tanjug said Milosevic speak
meeti British Foreign Secretary Robin Cook deliver ultimat attend negoti week
time autonomy propos Kosovo ethnic Alban lead province Cook earl told
conference Milosevic agree study propos.
The output of this step is then inserted and stored in an inverted file that lists the
index entries and an indication of their position and frequency of occurrence. The
specific nature of the index entries, however, will vary based on the decision in
Step 2 concerning what constitutes an "index able term." More sophisticated
document processors will have phrase recognizers, as well as Named Entity
recognizers and Categorizers, to insure index entries such as Milosevic are
tagged as a Person and entries such as Yugoslavia and Serbia as Countries.
BBC NEWS AS “NORMAL”
The ranking takes two ideas into account for weighting. The term frequency in
the given document (term frequency = TF) and the inverse document frequency
of the term in the whole db (inverse document frequency = IDF). The term
frequency in the given document shows how important the term is in this
document. The document frequency of the term (the percentage of the
documents which contain this term) shows how generally important the term is. A
high weight in a TF-IDF ranking scheme is therefore reached by a high term
frequency in the given document and a low document frequency of the term in
the whole database.
Not all terms are good "discriminators" — that is, all terms do not
single out one document from another very well. A simple example would be the
word "the." This word appears in too many documents to help distinguish one
from another. A less obvious example would be the word "antibiotic." In a sports
database when we compare each document to the database as a whole, the
term "antibiotic" would probably be a good discriminator among documents, and
therefore would be assigned a high weight.
Create index.
The index or inverted file is the internal data structure that stores the index
information and that will be searched for each query. Inverted files range from a
simple listing of every alphanumeric sequence in a set of documents to a more
linguistically complex list of entries, the tf/idf weights, and pointers to where
inside each document the term occurs. The more complete the information in the
index, the better the search results.
2.Query Processor
Query processing has six possible steps, though a system can cut these steps
short and proceed to match the query to the inverted file at any of a number of
places during the processing. Document processing shares many steps with
query processing. More steps and more documents make the process more
expensive for processing in terms of computational resources and
responsiveness. However, the longer the wait for results, the higher the quality of
results. Thus, search system designers must choose what is most important to
their users — time or quality.
3.Parsing
4.Stems terms
5.Query creation
Tokenizing.
Parsing.
Since users may employ special operators in their query, including Boolean,
adjacency, or proximity operators, the system needs to parse the query first into
query terms and operators. These operators may occur in the form of reserved
punctuation (e.g., quotation marks) or reserved terms in specialized format (e.g.,
AND, OR). ). In the case of an NLP system, the query processor will recognize
the operators implicitly in the language used no matter how the operators might
be expressed (e.g., prepositions, conjunctions, ordering).
At this point, a search engine may take the list of query terms and search them
against the inverted file. In fact, this is the point at which the majority of publicly
available search engines perform the search.
Some search engines will go further and stop-list and stem the query, similar to
the processes described above in the Document Processor section. However,
since most publicly available search engines encourage very short queries, the
engines may drop these two steps.
An NLP system will recognize single terms, phrases, and Named Entities. If it
uses any Boolean logic, it will also recognize the logical operators from Step 2
and create a representation containing logical sets of the terms to be AND'd,
OR'd, or NOT'd.
The final step in query processing involves computing weights for the terms in
the query. Sometimes the user controls this step by indicating either how much to
weight each term or simply which term or concept in the query matters most and
must appear in each retrieved document to ensure relevance.
Leaving the weighting up to the user is not common, because research has
shown that users are not particularly good at determining the relative importance
of terms in their queries. They can't make this determination for several reasons.
First, they don't know what else exists in the database, and document terms are
weighted by being compared to the database as a whole. Second, most users
seek information about an unfamiliar subject, so they may not know the correct
terminology. Few search engines implement system-based query weighting, but
some do an implicit weighting by treating the first term(s) in a query as having
higher significance. The engines use this information to provide a list of
documents/pages to the user. After this final step, the expanded, weighted query
is searched against the inverted file of documents.
3.Search and Matching Function
How systems carry out their search and matching functions differs according to
which theoretical model of information retrieval underlies the system's design
philosophy. Searching the inverted file for documents meeting the query
requirements, referred to simply as "matching," is typically a standard binary
search, no matter whether the search ends after the first two, five, or all seven
steps of query processing.
After computing the similarity of each document in the subset of documents, the
system presents an ordered list to the user. The sophistication of the ordering of
the documents again depends on the model the system uses, as well as the
richness of the document and query weighting mechanisms. For example, search
engines that only require the presence of any alpha-numeric string from the
query occurring anywhere, in any order, in a document would produce a very
different ranking than one by a search engine that performed linguistically correct
phrasing for both document and query representation and that utilized the proven
tf/idf weighting scheme.
4.Ranking
However the search engine determines rank, the ranked results list goes to the
user, who can then simply click and follow the system's internal pointers to the
selected document/page.
More sophisticated systems will go even further at this stage and allow the user
to provide some relevance feedback or to modify their query based on the results
they have seen. If either of these is available, the system will then adjust its
query representation to reflect this value-added feedback and re-run the search
with the improved query to produce either a new set of documents or a simple re-
ranking of documents from the initial search.
The software spider often reads and then indexes the entire text of each Web
site it visits into the main database of the search engine it is working for. Recently
many engines such as AltaVista have begun indexing only up to a certain
number of pages of a site, often about 400, and then stopping. Apparently, this is
because the Web has become so large that it's unfeasible to index everything.
How many pages the spider will index is not entirely predictable. Therefore,
it's a good idea to specifically submit each important page in your site that you
want to be indexed, such as those that contain important keywords. A software
spider is like an electronic librarian who cuts out the table of contents of each
book in every library in the world, sorts them into a gigantic master index, and
then builds an electronic bibliography that stores information on which texts
reference which other texts. Some software spiders can index over a million
documents a day!
As a software spider visits your site, it notes any links on your page to other sites.
In any search engine's vast database are recorded all the links between sites.
The search engine knows which sites you linked to, and more importantly, which
ones linked to you. Many engines will even use the number of links to your site
as an indication of popularity, and will then boost your ranking based on this
factor.
CHAPTER 4
SEARCH ENGINE OPTIMIZATION
SEARCH ENGINE OPTIMIZATION
Search engine optimization is the act of making a website come up higher in the
search results of major search engines; i.e., search engine optimization
(sometimes called SEO) is the process of increasing the positions of a web site
within the search engines, using careful analysis and research techniques. It is
the act of making ones website content more search engine friendly to make it
rank higher. Each major search engine has a unique way of determining the
importance of a given website. Some search engines focus on the content or
verbiage. Some review Meta Tags to identify who and what a web site's business
is. Most engines use a combination of Meta Tags, content, link popularity, click
popularity and longevity to determine a sites ranking. Google bases much of its
results on popularity .The procedure of website optimization ensures that a
website has all of the necessary ranking criteria to appeal to the individual search
engines needs.
To get listed correctly in the search engines each page of your site that you want
listed needs to be optimized to the best of your ability. Since the keywords that
you decide to target will be used throughout the optimization process choosing
the right keywords is essential. If you choose the wrong keywords you will not be
found in the search engines. If you are not found in the search engines how will
anyone find your site?
• Try to think like your target audience. - What would they search for when
looking for the page you are optimizing? Others will not necessarily use
the same keywords as you. You should try to come up with as many
keyword phrases as target and click through you can think of that relate to
the page you are optimizing.
• Check out your competition for ideas. - Do a search using keywords that
you already know you want to on the top sites that come up. Once on the
site view the source HTML code and view the keywords they have in their
meta tags - this should give you many more ideas. Make sure to only use
keywords that relate to your site or page. To view the HTML code simply
click the 'View' at the top of your web browser then select 'Source', or
'Page Source'.
• You should develop a list of keyword phrases for each page that you
optimize for the search engines.
The title tag of your page is the single most important factor to consider when
optimizing your web page for the search engines. This is because most engines
and directories place a high level of importance on keywords that are found in
your title tag. The title tag is also what the search engines usually use for the title
of your listing in the search results.
Tag limits: - Your title tag must be between 50-80 characters long including
spaces. The length that the different search engines accept varies, but as long as
you keep within this limit you should be ok.
Tag tips:
• You should include 1-2 of your most important keyword phrases in the title
tag, But be careful not to just list keywords. Your title tag should include
your keyword phrases while remaining as close to a readable sentence as
possible to avoid any problems.
• Make your title enticing. Don’t forget that even if you get that #1 listing in
the search engines your listing still needs to say something that makes the
surfer want to click through and visit your site.
• Since the length of your title tag could be a little long for some engines
place the keywords at the beginning of the tag.
• Each page of your site should have it's own title tag with it's own keywords
that related to the page that it appears on.
The copy on your page is also very important in order to achieve better search
engine listings. Actually, it is very close to being as important as your title tag so
make sure you keep reading. ‘Copy’ means the actual text that a visitor to your
site would read.
• For best results each page you submit has at least 200 words of copy on
it. There is some cases where this much text can be difficult to put on a
page, but the search engines really like it so you should do your best to
increase the amount of copy where you can.
• This text should include your most important keyword phrases, but should
remain logical and readable.
• Be sure to use those phrases that you have used in your other tags during
the optimization process.
• Add additional copy filled pages to your site. These types of content pages
not only help you in the search engines, but many other sites will link to
them too.
Optimizing your page copy is one of the most important things you could possibly
do to improve your listings in the search engines.
META Tags are hidden text placed in the HEAD section of your HTML page that
are used by most major search engines to index your site based on your
keywords and descriptions. Meta tags were originally created to help search
engines find out important information about your page that they might have had
difficulty determining otherwise. For example, related keywords or a description
of the page itself.
Many people incorrectly believe that good meta tags are all that is needed to
achieve good listings in the search engines, which is entirely incorrect. While
meta tags are usually always part of a well-optimized page they are not the be all
and end all of optimizing your pages.
The search engines now usually look at a combination of all the best search
engine tips to determine your listings, not just your metas - some don't even look
at them at all! What this means is that your page should have a combination of all
the tips implemented on your page - not just meta tags. There are two meta tags
that can help your search engine listings - meta keywords & meta description.
Description Meta:
<META NAME="description" content="This would be your description of what is
on your page. Your most important keyword phrases should appear in this
description.">
Keywords Meta:
<META NAME="keywords" content="keywords phrase 1, keyword phrase 2,
keyword phrase 3, etc.">
The correct placement for both meta tags is between the <HEAD> and </HEAD>
tags within the HTML the makes up your page. Their order does not really matter,
but most people usually place the description first then the keywords meta.
Tag limits:
Your Keywords Meta does not exceed 1024 characters including spaces and
your description meta tag does not exceed 250 characters including spaces.
• Make sure that you accurately describe the content of your page while
trying to entice visitors to click on your listing.
• Include 3-4 of your most important keyword phrases, especially those
used in your title tag and page copy.
• Try to have your most important keywords appear at the beginning of your
description. This often brings better results, and will help avoid having any
search engine cut off your keywords if they limit the length of your
description.
• You should only use those keyword phrases that you also used in the copy
of your page, title tag, meta description, and other tags. Any keywords
phrases that you use that do not appear in your other tags or page copy
are likely to not have enough prominence to help your listings for that
phrase.
• Don't forget plurals. For example, a travel site might have both "Caribbean
vacation" and "Caribbean vacations" in their keyword meta tag to make
sure they show up in both searches.
• Watch out for repeats! You want to include your most important phrases,
but when doing so it can be difficult not to repeat one word many times.
For example, "caribbean vacation" and "caribbean vacations" are two
different phrases, but the word "caribbean" appears twice. This is okay to
do in order to make sure you get the phrases you need in there, but be
careful not to repeat any one word excessively. There is no actual limit, but
we recommend that no one word be repeated in the keyword meta more
than 5 times.
• If your site has content of interest to a specific geographic location be sure
to include the actual location in your keyword meta.
Here's the length of time it currently takes to get listed at each of the major
search engines once you have submitted your web page.
MSN
Up to 2
months
Google
Up to 4
weeks
AltaVista
Up to 1
week
Fast
Up to 2
weeks
Excite
Up to 6
weeks
Northern
Light
Up to 4
weeks
AOL
Up to 2
months
HotBot
Up to 2
months
iWon
Up to 2
months
GOOGLE-A SEARCH ENGINE
CHAPTER 5
SPAMMING
There are several things, considered "spamming", that you can do to try to get
your page listed higher on a search engine results page. An example is when a
word is repeated hundreds of times on a page, to increase the frequency and
propel the page higher in the listings. Basically, you should never try to trick a
search engine in any way, or your risk being blacklisted by them. Since the
majority of your traffic will come from search engines the risk far outweighs the
benefits in the long run.
Below is a list of the more common things we recommend that you never do
when trying to achieve better listings.
CONCLUSION
Search engine is the popular term for an information retrieval (IR) system.
While researchers and developers take a broader view of IR systems, consumers
think of them more in terms of what they want the systems to do — namely
search the Web, or an intranet, or a database. Actually consumers would really
prefer a finding engine, rather than a search engine. Optimization is the act of
making ones website content more search engine friendly to make it rank higher.
REFERENCES
References
www.searchengines.com
www.searchenginewatch.com
www.submitexpress.com
www.monash.com