You are on page 1of 10

Proceedings of International Conference on Computing Sciences

WILKES100 ICCS 2013


ISBN: 978-93-5107-172-3
Ontology based web crawler to search document in semantic web
Vishal J ain
1*
and Mayank Singh
2

1
Research Scholar, Computer Science and Engineering Department, Lingayas University, Faridabad
2
Associate Professor, Krishna Engineering College, Ghaziabad
Abstract
The term Semantic Web (SW) given by Tim Berners Lee is considered as vast concept within itself. Semantic Web (SW) is
defined as collection of information linked in a way so that it can be easily processed by machines. It is information in
machine form. It contains Semantic Web Documents (SWDs) that are written in RDF or OWL languages. They contain
relevant information regarding users query. Crawlers play vital role in accessing information from SWDs. A Crawler is a
type of software that systematically browses documents and extracts information from them for purpose of Indexing. So, there
is need to develop some prototype systems that perform task of Information Retrieval (IR) using crawlers. There are several
prototype systems like OWLIR, SWANGLER, and SWOOGLE etc. The paper illustrates the outline of SWOOGLE which is
one of crawler based indexing and retrieval system for finding SWDs. It describes relations that are derived from RDF and
OWL languages and lists Ontologies thus providing complete description of given problem
2013 Elsevier Science. All rights reserved.
Keywords: Semantic Web (SW), Ontology, SWOOGLE, Semantic Web Documents (SWDs)
1. Introduction
Semantic Web (SW) came into existence due to problem in conventional search engines that dissatisfies users by retrieving
inadequate and inconsistent results. The documents retrieved by conventional search engines are like horse of different colors.
These engines work on predefined standard terms that work in centralized environment, thus accessing standard Ontologies.
With advent of SW and Ontology, users are able to develop new facts and use their own keywords/terms in different
environments. With use of ontology, user can perform following tasks:
(a) Users can use Interface Description Languages (IDL) and services for different environments. IDL means
defining new data objects and their relations.
(b) Users can communicate with different agents using shared ontology like FOAF (Friend of a Friend).
Semantic Web (SW) [1] is combination of SWDs that are expressed in ontology languages (RDF, OWL).
Ontology [2] refers to categorization of concepts and relationships between terms in hierarchical fashion
Although SWDs retrieves relevant information because they are characterized by semantic methods and ideas,
but it is tedious job to find URLs of SWDs.
So, there is need to develop some crawler based prototype systems that focuses on extraction of metadata for
each SWDs. The paper is categorized into following sections:
Section 2 makes readers aware of SWOOGLE, its significance including its architecture. Section 3 describes
information in favor of SWOOGLE describing how it is better than other prototype systems and Ontology
repositories. It also gives information about types of SWDs, use of crawlers and helps in finding Ontologies by
using Ontology Rank algorithm that identifies whether given document is Semantic Web Ontology (SWO) or
Semantic Web Database (SWDB). Section 4 provides current status of searching through SWOOGLE via
pictorial representation.
*
Corresponding author - Vishal Jain
290 Elsevier Publications, 2013
Vishal Jain and Mayank Singh
2. Literature Survey
The earlier version of SWOOGLE was ver1.0. It has facility of advanced database search query. Due to its capability of
retrieving SWDs, SWOOGLE has emerged as Semantic Search Engine with its advanced versions like SWOOGLE 2005 ver
2.1, SWOOGLE 2007, ver 3.1.
The process of finding SWDs from input keywords is very challenging task. When SWOOGLE did not come into existence,
then documents are retrieved using conventional Information Retrieval (IR) approaches and traditional search engines. These
engines are not so intelligent that they could retrieve relevant documents. They retrieves only ordinary text documents instead
of markup documents. The result is lots of documents are produced that may be relevant or irrelevant.
Some researchers tried to use Knowledge Management (KM) solutions in complex environments that are major phase in
emergence of SW and Ontology. With existence of SW, there arises SWDs that is combination of text documents as well as
structured documents written in ontology languages. At that time, there were no crawlers and users find Ontologies by
combining results retrieved from documents with the help of Ontology editors. These editors represent concept and
relationships between terms that matches given query.
Then, Web crawlers came into work. It includes Google Bot Crawler and yahoo. They are able to retrieve relevant documents
and satisfies users query but are unable to deliver Ontologies and generate metadata.
3. Outline OF SWOOGLE
SWOOGLE [3] is treated as crawler-based indexing prototype system that retrieves documents based on set of classes,
properties and methods and produces URIs matching the query.
3.1. Why SWOOGLE? What is its significance?
As we know, SW is a web that works like HTML documents. These documents are different from SWDs because HTML
documents follows conventional search engines which are unable to extract required information in short and simple way.
Keeping this in mind, we have developed a prototype SW search engine called SWOOGLE for extracting SWDs that is used
by users and software agents. With the help of SWOOGLE, we can AEQ RDF and OWL documents where A stands for
Access, E stands for Explore and Q stands for Querying. Querying includes we can clear our misconceptions by putting query.
3.2. Defining and Analyzing SWOOGLE
SWOOGLE is crawler based indexing and retrieval system for SW. Indexing means generation of metadata i.e. it extracts
metadata for each SWD and gives relationship between those documents. Documents are indexed by some Information
Retrieval (IR) system which either uses character N-grams or URIs (Uniform Resource Identifier) as keywords to find
relevant documents.
It provides web interface where user can ask query by submitting URL of either SWD or web page directly.
Analysis: -
After we have developed Swoogle, it is found to be analyzed on three activities which are listed below:-
Helps in searching appropriate Ontologies.
Searching Data Instance
Characterize Semantic Web
We will discuss them one by one.
(a) Searching appropriate Ontology: - Conventional Search engines failed many times to find required events for particular
task. Swoogle helps in finding Ontologies as it allows user to query for documents.
(b) Finding Data Instance: - Swoogle allows user to query SWDs with keywords that uses Classes/Properties.
(c) Characterizing Semantic Web: - Collection of data by researchers leads to characterization of SW. User can answer any
question about ontology
3.3. SWOOGLE Architecture
Four components include in its architecture. They are as follows:
(a) SWDs discovery
(b) Metadata creation
291 Elsevier Publications, 2013
Ontology based web crawler to search document in semantic web
(c) Analysis of data
(d) Interface
All four components work independently and interact with each other through database.
SWDs discovery: - It discovers Semantic Web Documents and keeps up to data information about objects.
Metadata creation: - It gives SWD cache and generates metadata at both semantic and syntactic level.
Data Analysis: - It uses cache SWDs and metadata to produce analysis with the help of IR analyzer and SWD
analyzer.
Interface: - It provides data services to SW community
4. How Swoogle is better than other Prototype Systems and Ontology Repositories?
There are many prototype systems that are designed to solve user queries like OWLIR (Ontology Web Language
and Information Retrieval), SWANGLER and SWOOGLE.
OWLIR is one of prototype systems that takes text documents as Input arguments. It does not directly consider
RDF or OWL documents as input. It annotates text documents with SW markup, produces results and then
indexes them. To find SWDs with the help of OWLIR, we have to build Custom Indexing System. After it, we
can pass both structured as well as text documents. So, obviously it is better but not optimal system.
SWANGLER directly considers RDF documents encoded in XML language and produces documents that are
suitable to given query. It can become optimal system but it fails due to following problems:
(a) XML namespace is not valid to search engines like Google.
(b) Tokenization rules are designed for natural languages.
SWOOGLE is termed as optimal crawler based prototype system that maintains interoperability between SWDs.
As Semantic Web contains RDF documents, so SWOOGLE directly takes RDF documents as input and lists
Ontologies that matches query. It can either use N-gram or URI refs as keywords to find relevant documents.
OWLIR and SWANGLER encode only 1 triple for each term. If there are more than 1 triple, they are replaced by
single URI. SWOOGLE can analyze lot of SWDs with lot of triples. It captures more metadata on classes and
properties to support huge collection of documents.
So, SWOOGLE is better and optimal than other prototype systems.
Comparison with Ontology Systems:
292 Elsevier Publications, 2013
Vishal Jain and Mayank Singh
There is difference between SWOOGLE and other SW engines and query systems. Ontology Based Annotation
Systems like SHOE, CREAM, and WEBKB focuses on creating metadata of online documents without seeing
whole documents. Their ontology standards are different from SWDs versions. These systems simply store RDF
documents rather than solving them and querying them. So, they are not capable of handling millions of
documents because their own Ontologies are not suitable for SWDs.
4.1. Types of Semantic Web Documents (SWDs)
Semantic Web Document (SWD) is a document written in SW languages like OWL, DAML+OIL etc that is
online and easily accessible to all web users. SWD is only means of information exchange in SW.
(a) Semantic Web Ontologies (SWOs): - A document is said to be SWO when required portion of given
statement defines new classes and properties or inherit the definitions of terms used by other SWDs.
(b) Semantic Web Databases (SWDB): - A document is said to be SWDB when it does not define new terms. It
matches given query with terms that are stored in database.
4.2. Use of Crawlers in Finding SWDs
The simplest way to find SWDs is to use conventional search engines but they will not return relevant results.
We have developed set of crawlers like Google Crawler, Focused Crawlers for finding SWDs.
Google Crawler: - It searches URLs using Google search engine. It uses extensions like rdf, owl, daml. To make
our search more expressive, we have introduced use of keywords. Searching URLs depends on Google Crawler
(Google Bot), Google Indexer and Google Query Processor.
The process follows as:
Web pages downloading are done by a web crawler named GoogleBot. It is a web crawling robot that
retrieves pages on web and hands them off to Google Indexer. GoogleBot has many computers attached
to it that requests and fetches web pages. Each web page has an associated ID number called docID.
When given URL is entered, it is assigned a given docID.
There is URL Server that sends list of URLs to be fetched by crawler. Fetched web pages are sent to
Store Server. Store Server compresses these pages and stores them in repository.
Google Indexer makes documents uncompressed. It removes all bad links in every web page and stores
important information. It ignores some punctuation marks as well as converting all letters to lowercase.
After Indexer, there is Google Query Processor which retrieves stored documents and return search results with
the help of Doc Server.
Focused Crawler: - It finds documents within given website. It uses extensions like jpg, html to reduce
complexity.
J ENA2 is based on SWOOGLE that analysis content of SWDs first and then produces them.
293 Elsevier Publications, 2013
Ontology based web crawler to search document in semantic web

4.3. Finding Ontologies using Ontology Rank Algorithm
For finding Ontologies, we should aware of language features and RDF statistics of SWDs that are described
below:


(a) Language Features: - It lists features of SWDs and their properties. It includes:
Encoding: It has three types of encoding used in SWDs i.e. XML/RDF, N-Triples and N3.
Language: - It shows SW languages that are OWL, RDF, RDFS, DAML
OWL Species: - It shows language species of SWDs written in OWL language only. Its species are
OWL-LITE, OWL-DL, and OWL-FULL.

(b) RDF Statistics: - It focuses on how SWDs define new classes and properties and individuals. There are three
things namely: Class (C), Property (P) and Individuals (I).

RDF statistics shares properties related to nodes of RDF graphs of SWDs. A node is defined as Class if and only
if it is not empty node and should be instance of some rdfs: Class (rdfschema). A node is termed as Property iff it
is not an empty node and should be instance of rdf: Property. An Individual is a node which is instance of any
user defined class.

Ontology Rank Algorithm: -

It ranks all the Ontologies that are returned by SWOOGLE while finding SWDs. Ranking means till how much
extent we can use particular ontology.

Let (gag) be one of SWD. Let C (gag), P (gag) and I (gag) be Class, property and Individual of given SWD.
Then Ontology Ratio for given SWD is calculated as:
R (gag) =|C (gag) +P(gag)| / |C(gag) +P(gag) +I(gag)|
If R (gag) =0, then our SWD is pure SWDB else it is pure SWO.
(c) Ontology Annotations: - It shows properties that describes SWD as ontology. Its properties are label,
comment and version info.

4.4. Illustration of SWOOGLE

This section describes the layout of SWOOGLE version 3.1 used in year 2007. It allows users to specify any
string arbitrarily in order to find relevant SWDs in response to that particular string.
SWOOGLE analyses whole document and generates only relevant parts of document in ranked order like URLs,
terms, description and namespaces about documents.

294 Elsevier Publications, 2013
Vishal Jain and Mayank Singh

Fig 4: SWOOGLE Start-Up Page

We have searched string Economic Crisis. So, it will return SWDs that matches these keywords in ranked order.
We will get separate documents for keyword economic and for keyword crisis.
It is shown below:


Fig 5: SWOOGLE query result

From above screen shot, we have seen that our first SWD is encoded in N3 and its Ontology ratio is 0.61
Second document is encoded in RDF/XML with ontology ratio of 0.97.
Related namespaces for second SWD is shown below:
295 Elsevier Publications, 2013
Ontology based web crawler to search document in semantic web
Fig 6: Namespaces about given SWD
The current version of SWOOGLE returns following statistical information regarding number of SWDs
retrieved, number of triples generated and other parameters.
We can say that SWOOGLE can handle huge collection of documents.
Fig 7: SWOOGLE statistical information
5. Conclusions
The paper has given us way of extracting Semantic Web Documents (SWDs) by using one of crawler-based
prototype indexing and retrieval system named SWOOGLE. It generates metadata for given SWDs and lists
Ontologies related to given keywords. It is better than other prototype systems like OWLIR and SWANGLER
that requires building of Custom Indexing Module. They use their own ontology standards which are not suitable
for SWDs.
296 Elsevier Publications, 2013
Vishal Jain and Mayank Singh
OWLIR and SWANGLER treat markup as structured information and perform results over it. SWOOGLE stores
metadata about RDF documents in its database so that it can retrieve SWDs based on Classes(C), Properties (P)
and Individuals (I). SWOOGLE is designed to work with all SWDBs and is better than current web search
engines like Google because Google work with natural languages only.


Acknowledgement
I Vishal J ain give my sincere thanks to Prof. M. N. Hoda, Director, BVICAM, New Delhi for giving me
opportunity to do P.hD from Lingayas University, Faridabad.

References
[1]. Accessible from T.Berners Lee, The Semantic Web, Scientific American, May 2007
[2]. Berners Lee, J .Lassila, Ontologies in Semantic Web, Scientific American, May (2001) 34-43
[3]. Tim Finin, Anupam J oshi, Vishal Doshi, Swoogle: A Semantic Web Search and Metadata Engine, In proceedings of the 13
th

international conference on Information and knowledge management, pages 461-468, 2004.
[4]. Gagandeep Singh, Vishal J ain, Information Retrieval (IR) through Semantic Web (SW):An Overview, In proceedings of
CONFLUENCE 2012- The Next Generation Information Technology Summit at Amity School of Engineering and Technology, September
2012, pp 23-27.
[5]. M. Preethi, Dr. J . Akilandeswari, Combining Retrieval with Ontology Browsing, International J ournal of Internet Computing, Vol.1,
Issue-1, 2011.
[6]. T.Finin, J . Mayfield, A.J oshi, Information retrieval and the semantic web, IEEE/WIC International Conference on Web Intelligence,
October 2003.
[7]. U.Shah. T.Finin and A.J oshi. Information Retrieval on the semantic web, Scientific American, pages 34-43, 2003
[8]. Stojanovic, N. Studer, R. Stojanovic, An approach for ranking of query results in the Semantic Web, The Semantic Web ISWC,
2003, pp 500-516
[9]. Swati Ringe, Nevin Francis, Palanawala, Ontology Based Web Crawler, International J ournal of Computer Applications in
Engineering Sciences, ISSN 2231-4946, Vol. II, Issue III, September 2012.
[10]. Goetz Graze, Query Evaluation techniques for large databases, In Proceedings of ACM COMPUTING SURVEYS, 2003

297 Elsevier Publications, 2013
Index
A
Anchor-flood, 286287
Automatic Semantic Matching of Ontologies with Verification (ASMOV),
283284

C
COMA++, 286

F
Falcon, 285

O
Ontology matching, 280281
agent communication, 282
anchor-flood, 286287
ASMOV, 283284
COMA++, 286
comparative review of, 287288
constraint based technique, 282
data integration, 282283
DSSim, 284
falcon, 285
graph based technique, 282
language based technique, 281282
linguistic resources and alignment reuse technique, 282
model based approach, 282
peer to peer network, 282
repository of structure, 282
review of, 287288
RIMOM, 285286
SAMBO, 284285
semantic web, 283284
string based technique, 281
taxonomy based technique, 282
upper level and domain specific formal ontologies, 282
web service composition, 282

R
Rimom, 285286

S
System for Aligning and Merging Biomedical Ontologies (SAMBO), 284285

You might also like