Proceedings of International Conference on Computing Sciences
WILKES100 ICCS 2013
ISBN: 978-93-5107-172-3 Ontology based web crawler to search document in semantic web Vishal J ain 1* and Mayank Singh 2
1 Research Scholar, Computer Science and Engineering Department, Lingayas University, Faridabad 2 Associate Professor, Krishna Engineering College, Ghaziabad Abstract The term Semantic Web (SW) given by Tim Berners Lee is considered as vast concept within itself. Semantic Web (SW) is defined as collection of information linked in a way so that it can be easily processed by machines. It is information in machine form. It contains Semantic Web Documents (SWDs) that are written in RDF or OWL languages. They contain relevant information regarding users query. Crawlers play vital role in accessing information from SWDs. A Crawler is a type of software that systematically browses documents and extracts information from them for purpose of Indexing. So, there is need to develop some prototype systems that perform task of Information Retrieval (IR) using crawlers. There are several prototype systems like OWLIR, SWANGLER, and SWOOGLE etc. The paper illustrates the outline of SWOOGLE which is one of crawler based indexing and retrieval system for finding SWDs. It describes relations that are derived from RDF and OWL languages and lists Ontologies thus providing complete description of given problem 2013 Elsevier Science. All rights reserved. Keywords: Semantic Web (SW), Ontology, SWOOGLE, Semantic Web Documents (SWDs) 1. Introduction Semantic Web (SW) came into existence due to problem in conventional search engines that dissatisfies users by retrieving inadequate and inconsistent results. The documents retrieved by conventional search engines are like horse of different colors. These engines work on predefined standard terms that work in centralized environment, thus accessing standard Ontologies. With advent of SW and Ontology, users are able to develop new facts and use their own keywords/terms in different environments. With use of ontology, user can perform following tasks: (a) Users can use Interface Description Languages (IDL) and services for different environments. IDL means defining new data objects and their relations. (b) Users can communicate with different agents using shared ontology like FOAF (Friend of a Friend). Semantic Web (SW) [1] is combination of SWDs that are expressed in ontology languages (RDF, OWL). Ontology [2] refers to categorization of concepts and relationships between terms in hierarchical fashion Although SWDs retrieves relevant information because they are characterized by semantic methods and ideas, but it is tedious job to find URLs of SWDs. So, there is need to develop some crawler based prototype systems that focuses on extraction of metadata for each SWDs. The paper is categorized into following sections: Section 2 makes readers aware of SWOOGLE, its significance including its architecture. Section 3 describes information in favor of SWOOGLE describing how it is better than other prototype systems and Ontology repositories. It also gives information about types of SWDs, use of crawlers and helps in finding Ontologies by using Ontology Rank algorithm that identifies whether given document is Semantic Web Ontology (SWO) or Semantic Web Database (SWDB). Section 4 provides current status of searching through SWOOGLE via pictorial representation. * Corresponding author - Vishal Jain 290 Elsevier Publications, 2013 Vishal Jain and Mayank Singh 2. Literature Survey The earlier version of SWOOGLE was ver1.0. It has facility of advanced database search query. Due to its capability of retrieving SWDs, SWOOGLE has emerged as Semantic Search Engine with its advanced versions like SWOOGLE 2005 ver 2.1, SWOOGLE 2007, ver 3.1. The process of finding SWDs from input keywords is very challenging task. When SWOOGLE did not come into existence, then documents are retrieved using conventional Information Retrieval (IR) approaches and traditional search engines. These engines are not so intelligent that they could retrieve relevant documents. They retrieves only ordinary text documents instead of markup documents. The result is lots of documents are produced that may be relevant or irrelevant. Some researchers tried to use Knowledge Management (KM) solutions in complex environments that are major phase in emergence of SW and Ontology. With existence of SW, there arises SWDs that is combination of text documents as well as structured documents written in ontology languages. At that time, there were no crawlers and users find Ontologies by combining results retrieved from documents with the help of Ontology editors. These editors represent concept and relationships between terms that matches given query. Then, Web crawlers came into work. It includes Google Bot Crawler and yahoo. They are able to retrieve relevant documents and satisfies users query but are unable to deliver Ontologies and generate metadata. 3. Outline OF SWOOGLE SWOOGLE [3] is treated as crawler-based indexing prototype system that retrieves documents based on set of classes, properties and methods and produces URIs matching the query. 3.1. Why SWOOGLE? What is its significance? As we know, SW is a web that works like HTML documents. These documents are different from SWDs because HTML documents follows conventional search engines which are unable to extract required information in short and simple way. Keeping this in mind, we have developed a prototype SW search engine called SWOOGLE for extracting SWDs that is used by users and software agents. With the help of SWOOGLE, we can AEQ RDF and OWL documents where A stands for Access, E stands for Explore and Q stands for Querying. Querying includes we can clear our misconceptions by putting query. 3.2. Defining and Analyzing SWOOGLE SWOOGLE is crawler based indexing and retrieval system for SW. Indexing means generation of metadata i.e. it extracts metadata for each SWD and gives relationship between those documents. Documents are indexed by some Information Retrieval (IR) system which either uses character N-grams or URIs (Uniform Resource Identifier) as keywords to find relevant documents. It provides web interface where user can ask query by submitting URL of either SWD or web page directly. Analysis: - After we have developed Swoogle, it is found to be analyzed on three activities which are listed below:- Helps in searching appropriate Ontologies. Searching Data Instance Characterize Semantic Web We will discuss them one by one. (a) Searching appropriate Ontology: - Conventional Search engines failed many times to find required events for particular task. Swoogle helps in finding Ontologies as it allows user to query for documents. (b) Finding Data Instance: - Swoogle allows user to query SWDs with keywords that uses Classes/Properties. (c) Characterizing Semantic Web: - Collection of data by researchers leads to characterization of SW. User can answer any question about ontology 3.3. SWOOGLE Architecture Four components include in its architecture. They are as follows: (a) SWDs discovery (b) Metadata creation 291 Elsevier Publications, 2013 Ontology based web crawler to search document in semantic web (c) Analysis of data (d) Interface All four components work independently and interact with each other through database. SWDs discovery: - It discovers Semantic Web Documents and keeps up to data information about objects. Metadata creation: - It gives SWD cache and generates metadata at both semantic and syntactic level. Data Analysis: - It uses cache SWDs and metadata to produce analysis with the help of IR analyzer and SWD analyzer. Interface: - It provides data services to SW community 4. How Swoogle is better than other Prototype Systems and Ontology Repositories? There are many prototype systems that are designed to solve user queries like OWLIR (Ontology Web Language and Information Retrieval), SWANGLER and SWOOGLE. OWLIR is one of prototype systems that takes text documents as Input arguments. It does not directly consider RDF or OWL documents as input. It annotates text documents with SW markup, produces results and then indexes them. To find SWDs with the help of OWLIR, we have to build Custom Indexing System. After it, we can pass both structured as well as text documents. So, obviously it is better but not optimal system. SWANGLER directly considers RDF documents encoded in XML language and produces documents that are suitable to given query. It can become optimal system but it fails due to following problems: (a) XML namespace is not valid to search engines like Google. (b) Tokenization rules are designed for natural languages. SWOOGLE is termed as optimal crawler based prototype system that maintains interoperability between SWDs. As Semantic Web contains RDF documents, so SWOOGLE directly takes RDF documents as input and lists Ontologies that matches query. It can either use N-gram or URI refs as keywords to find relevant documents. OWLIR and SWANGLER encode only 1 triple for each term. If there are more than 1 triple, they are replaced by single URI. SWOOGLE can analyze lot of SWDs with lot of triples. It captures more metadata on classes and properties to support huge collection of documents. So, SWOOGLE is better and optimal than other prototype systems. Comparison with Ontology Systems: 292 Elsevier Publications, 2013 Vishal Jain and Mayank Singh There is difference between SWOOGLE and other SW engines and query systems. Ontology Based Annotation Systems like SHOE, CREAM, and WEBKB focuses on creating metadata of online documents without seeing whole documents. Their ontology standards are different from SWDs versions. These systems simply store RDF documents rather than solving them and querying them. So, they are not capable of handling millions of documents because their own Ontologies are not suitable for SWDs. 4.1. Types of Semantic Web Documents (SWDs) Semantic Web Document (SWD) is a document written in SW languages like OWL, DAML+OIL etc that is online and easily accessible to all web users. SWD is only means of information exchange in SW. (a) Semantic Web Ontologies (SWOs): - A document is said to be SWO when required portion of given statement defines new classes and properties or inherit the definitions of terms used by other SWDs. (b) Semantic Web Databases (SWDB): - A document is said to be SWDB when it does not define new terms. It matches given query with terms that are stored in database. 4.2. Use of Crawlers in Finding SWDs The simplest way to find SWDs is to use conventional search engines but they will not return relevant results. We have developed set of crawlers like Google Crawler, Focused Crawlers for finding SWDs. Google Crawler: - It searches URLs using Google search engine. It uses extensions like rdf, owl, daml. To make our search more expressive, we have introduced use of keywords. Searching URLs depends on Google Crawler (Google Bot), Google Indexer and Google Query Processor. The process follows as: Web pages downloading are done by a web crawler named GoogleBot. It is a web crawling robot that retrieves pages on web and hands them off to Google Indexer. GoogleBot has many computers attached to it that requests and fetches web pages. Each web page has an associated ID number called docID. When given URL is entered, it is assigned a given docID. There is URL Server that sends list of URLs to be fetched by crawler. Fetched web pages are sent to Store Server. Store Server compresses these pages and stores them in repository. Google Indexer makes documents uncompressed. It removes all bad links in every web page and stores important information. It ignores some punctuation marks as well as converting all letters to lowercase. After Indexer, there is Google Query Processor which retrieves stored documents and return search results with the help of Doc Server. Focused Crawler: - It finds documents within given website. It uses extensions like jpg, html to reduce complexity. J ENA2 is based on SWOOGLE that analysis content of SWDs first and then produces them. 293 Elsevier Publications, 2013 Ontology based web crawler to search document in semantic web
4.3. Finding Ontologies using Ontology Rank Algorithm For finding Ontologies, we should aware of language features and RDF statistics of SWDs that are described below:
(a) Language Features: - It lists features of SWDs and their properties. It includes: Encoding: It has three types of encoding used in SWDs i.e. XML/RDF, N-Triples and N3. Language: - It shows SW languages that are OWL, RDF, RDFS, DAML OWL Species: - It shows language species of SWDs written in OWL language only. Its species are OWL-LITE, OWL-DL, and OWL-FULL.
(b) RDF Statistics: - It focuses on how SWDs define new classes and properties and individuals. There are three things namely: Class (C), Property (P) and Individuals (I).
RDF statistics shares properties related to nodes of RDF graphs of SWDs. A node is defined as Class if and only if it is not empty node and should be instance of some rdfs: Class (rdfschema). A node is termed as Property iff it is not an empty node and should be instance of rdf: Property. An Individual is a node which is instance of any user defined class.
Ontology Rank Algorithm: -
It ranks all the Ontologies that are returned by SWOOGLE while finding SWDs. Ranking means till how much extent we can use particular ontology.
Let (gag) be one of SWD. Let C (gag), P (gag) and I (gag) be Class, property and Individual of given SWD. Then Ontology Ratio for given SWD is calculated as: R (gag) =|C (gag) +P(gag)| / |C(gag) +P(gag) +I(gag)| If R (gag) =0, then our SWD is pure SWDB else it is pure SWO. (c) Ontology Annotations: - It shows properties that describes SWD as ontology. Its properties are label, comment and version info.
4.4. Illustration of SWOOGLE
This section describes the layout of SWOOGLE version 3.1 used in year 2007. It allows users to specify any string arbitrarily in order to find relevant SWDs in response to that particular string. SWOOGLE analyses whole document and generates only relevant parts of document in ranked order like URLs, terms, description and namespaces about documents.
294 Elsevier Publications, 2013 Vishal Jain and Mayank Singh
Fig 4: SWOOGLE Start-Up Page
We have searched string Economic Crisis. So, it will return SWDs that matches these keywords in ranked order. We will get separate documents for keyword economic and for keyword crisis. It is shown below:
Fig 5: SWOOGLE query result
From above screen shot, we have seen that our first SWD is encoded in N3 and its Ontology ratio is 0.61 Second document is encoded in RDF/XML with ontology ratio of 0.97. Related namespaces for second SWD is shown below: 295 Elsevier Publications, 2013 Ontology based web crawler to search document in semantic web Fig 6: Namespaces about given SWD The current version of SWOOGLE returns following statistical information regarding number of SWDs retrieved, number of triples generated and other parameters. We can say that SWOOGLE can handle huge collection of documents. Fig 7: SWOOGLE statistical information 5. Conclusions The paper has given us way of extracting Semantic Web Documents (SWDs) by using one of crawler-based prototype indexing and retrieval system named SWOOGLE. It generates metadata for given SWDs and lists Ontologies related to given keywords. It is better than other prototype systems like OWLIR and SWANGLER that requires building of Custom Indexing Module. They use their own ontology standards which are not suitable for SWDs. 296 Elsevier Publications, 2013 Vishal Jain and Mayank Singh OWLIR and SWANGLER treat markup as structured information and perform results over it. SWOOGLE stores metadata about RDF documents in its database so that it can retrieve SWDs based on Classes(C), Properties (P) and Individuals (I). SWOOGLE is designed to work with all SWDBs and is better than current web search engines like Google because Google work with natural languages only.
Acknowledgement I Vishal J ain give my sincere thanks to Prof. M. N. Hoda, Director, BVICAM, New Delhi for giving me opportunity to do P.hD from Lingayas University, Faridabad.
References [1]. Accessible from T.Berners Lee, The Semantic Web, Scientific American, May 2007 [2]. Berners Lee, J .Lassila, Ontologies in Semantic Web, Scientific American, May (2001) 34-43 [3]. Tim Finin, Anupam J oshi, Vishal Doshi, Swoogle: A Semantic Web Search and Metadata Engine, In proceedings of the 13 th
international conference on Information and knowledge management, pages 461-468, 2004. [4]. Gagandeep Singh, Vishal J ain, Information Retrieval (IR) through Semantic Web (SW):An Overview, In proceedings of CONFLUENCE 2012- The Next Generation Information Technology Summit at Amity School of Engineering and Technology, September 2012, pp 23-27. [5]. M. Preethi, Dr. J . Akilandeswari, Combining Retrieval with Ontology Browsing, International J ournal of Internet Computing, Vol.1, Issue-1, 2011. [6]. T.Finin, J . Mayfield, A.J oshi, Information retrieval and the semantic web, IEEE/WIC International Conference on Web Intelligence, October 2003. [7]. U.Shah. T.Finin and A.J oshi. Information Retrieval on the semantic web, Scientific American, pages 34-43, 2003 [8]. Stojanovic, N. Studer, R. Stojanovic, An approach for ranking of query results in the Semantic Web, The Semantic Web ISWC, 2003, pp 500-516 [9]. Swati Ringe, Nevin Francis, Palanawala, Ontology Based Web Crawler, International J ournal of Computer Applications in Engineering Sciences, ISSN 2231-4946, Vol. II, Issue III, September 2012. [10]. Goetz Graze, Query Evaluation techniques for large databases, In Proceedings of ACM COMPUTING SURVEYS, 2003
297 Elsevier Publications, 2013 Index A Anchor-flood, 286287 Automatic Semantic Matching of Ontologies with Verification (ASMOV), 283284
C COMA++, 286
F Falcon, 285
O Ontology matching, 280281 agent communication, 282 anchor-flood, 286287 ASMOV, 283284 COMA++, 286 comparative review of, 287288 constraint based technique, 282 data integration, 282283 DSSim, 284 falcon, 285 graph based technique, 282 language based technique, 281282 linguistic resources and alignment reuse technique, 282 model based approach, 282 peer to peer network, 282 repository of structure, 282 review of, 287288 RIMOM, 285286 SAMBO, 284285 semantic web, 283284 string based technique, 281 taxonomy based technique, 282 upper level and domain specific formal ontologies, 282 web service composition, 282
R Rimom, 285286
S System for Aligning and Merging Biomedical Ontologies (SAMBO), 284285