Review of Slug Semantic Web

World Applied Programming, Vol (2), No (1), January 2012.
34-37
ISSN: 2222-2510
2011 WAP journal. www.waprogramming.com
Review of Slug Semantic Web

Priyanka Saxena
Masoud Nosrati
Department of Computer Science Engineering

SHOBHIT UNIVERSITY (Perusing M.tech)
Meerut(U.P), INDIA
priyanka.saxena111@gmail.com
Kermanshah University of
Medical Science,
Kermanshah, Iran
minibigs_m@yahoo.co.uk
Abstract: This study introduces Slug a web crawler (or Scutter) that is designed for harvesting semantic web
content. Slug provides a configurable, modular framework that allows an acceptable level of flexibility in
configuring the retrieval, processing and storage of harvested content. In this framework RDF vocabulary helps to
describe crawler configurations. Crawling activity, based on metadata, allows for reporting and analyzing the
crawling progress, as well as more efficient retrieving through the storage caching data.
Keywords: Semantic web, Slug semantic web, Scutter, RDF.
I.
INTRODUCTION
Semantic web crawler concept differs from a traditional web crawler regarding the format of the source material that is
traversing, and specifying links between information resources. Whereas, crawler operates on HTML documents, on
the other hand a semantic web crawler operates on RDF metadata in which links are implemented using the rdfs
relationship [1].
However, a normal crawler must function for extracting text from content and subsequent link extraction, possibly
from an invalid presumption. Whereas, a semantic web crawler must carry out additional processing tasks such as
merging information resources via Inverse-Functional-Properties; tracking provenance of data; harvesting schemas and
ontologies in addition to source data; extraction of embedded metadata etc. Crawling the semantic web provides an
excellent introduction to these issues and it summarizes implementation experience gained while constructing a scutter
[2][3].
Semantic web focuses on the areas of storage and query performance, federated queries, etc. It enables the provision of
a wider variety of data sources. To address these concerns and encourage additional research, a quest has been made
for studying the possibilities of "Slug" as an open source which has configurable and modular scutter. The emphasis of
the framework on modularity and ease of use rather than its speed, although its performance is currently acceptable
[4].
The rest of this paper describes the architecture, and features of the framework.
II.
SUMMARY OF FEATURES
The current release of the Slug crawling framework includes [5]:
Multi-threaded Retrieval of data, via HTTP.
Support of enabling optimized retrieval from previously fetched content.
Statistics on crawler state to facilitate monitoring of the crawler.
Crawling from a fixed list of starting points and/or refreshing from previously crawled data sources.
A persistent memory for capturing and persisting crawl related metadata.
Configurable crawler including crawling depth, blacklisting of URLs and avoidance of crawling loops.
Automatic processing and traversal of RDF documents.
Creation of a local cache for retrieved data.
Storage of retrieved RDF data using persistent models.
34
Priyanka Saxena, Masoud Nosrati, World Applied Programming, Vol (2), No (1), January 2012.
Configurable processing and filtering pipelines enabling the scutter to be customized for specific tasks.
Creation of scutter profiles using a custom RDF vocabulary.
III. CRAWLER ARCHITECTURE

In Figure 1, the key components and relationships in the Slug framework is presented briefly. Variations of the
Master-Slave and Producer-Consumer design patterns are considered in designing the proposed framework.
Master
Manages
Controller
*
Task
Manages
Slave
*
Carries out
Client
Delegates to
Produces
1 *
Consumes
Consumer
Response
Figure 1: Slug architecture in brief view
The controller is responsible for managing a list of tasks which are carried out by a number of client instances. A
factory class is used by the controller to create worker instances during application start-up and on demand during
application processing. Each client runs as a separate thread within the application coordinated by the single Controller
instance.
There is a core controller-client framework that seems similar to the Master-Slave design pattern in which a central
component delegates tasks to a number of slave instances. But, main difference is that the controller itself does not
affect the total results for completing each Task; neither, it is performed by the clients.
The processing of task- results involves multiple processing stages by applying the Producer-Consumer pattern. Each
client produces results which are handled by a consumer instance. Alternate processing behavior is made possible
without changing the core functionality that deals with multi-threaded retrieval of content.
Finally, this delegation model allows multiple consumer instances to be involved in processing a single response as
generated by each client.
A typical consumer configuration might consist of:
A Response store retrieved RDF data in a local file system cache, treating the response as a series of bytes.
A RDF Consumer that parses the retrieved data, extracting rdfs links to generate new tasks, adding them to
the controller's job queue.
A persistent response store that parses the retrieved data, storing the result in persistent model
35
Another key component is task filters. When any new task is encountered, the controller filters it through the taskfilter, and it ensures that they meet arbitrary criteria. Like the consumer interface, the framework allows multiple task
filters to co-operate in making a decision to admit a new task to the queue. Complex filtering criteria are assembled
from fine-grained filter implementations.
The final aspect of slug framework is the memory and it is final maintaining by the controller. Memory is always
exposed and accessible by all framework components, and it provides a persistent RDF model for storing crawl related
metadata. All the components in the slug frame work are configured by using a custom RDF vocabulary that allows
extending its capabilities by integrating custom components [6].
IV. CRAWLER CONFIGURATION
A
Slug
is
configured
via
a
proprietary
RDF
vocabulary,
in
the
namespace
http://purl.org/NET/schemas/slug/config/. The classes and properties in the configuration
vocabulary echo the main application architecture as follow:
A Scutter resource has properties configuring its initial number of workers, its memory, and the required
consumer and task filter components.
A Memory resource is configured with properties indicating a file name
Consumer and Filter components are configured using an impl property that indicates the fully-qualified
Java class name of the implementation at run-time.
Component instances are responsible for configuring themselves further from the available metadata: each
instance is passed a reference to its peer RDF resource in the configuration file immediately after
instantiation.
A RDF vocabulary was chosen as the mean for configuring the framework, rather than a custom XML vocabulary for
several reasons.
First, the ability to add arbitrary additional properties makes it easy to extend the configuration for specific component
instances. This flexibility can also be exploited to improve readability of the configuration by using Dublin Core
properties to associate a title and description with each component.
Second, it was desirable to allow for the specification of different crawling profiles for the alternate configurations
of the framework, specialized for different crawling strategies.
These configurations allow the basic framework to specify a number of different crawling profiles by reconfiguring
the components for specialized purposes [5].
V. THE SCUTTER VOCABULARY

The persistent memory created by the Slug framework is an initial implementation of the Scutter-Vocabulary
specification.
Representation describes a source URL encountered by the crawler and has a source property that indicates the
original web resource it describes. Additionally, it may have any number of original properties to identify documents
that refer to the specific representation, i.e. the origin of the rdfs link(s) that enables the scutter to discover and
retrieve this URL. Therefore, over subsequent runs of the scutter the original properties associated with each
representation will end up producing a map of the semantic web network.
Representation may be retrieved all the time, by appearing in the results of multiple fetch properties and each fetch can
annotate with properties describing the results of the fetch, including the date of retrieval, additional status code ,the
36
Last-Modified and E-tag headers. If the given fetch results show an error, then it is recorded along with a suitable error
message. The Scutter-Vocabulary allows representations to be annotated with a property skip that indicates that it
should be ignored in subsequent crawls and automatically generates these properties whenever it encounters specific
errors, e.g. Unknown Host, or the HTTP status codes 404 Not Found [6].
Over the time, Fetch metadata produces a complete history of crawler activity including error reporting and reasons for
blacklisting.
VI. FUTURE WORKS
Future work is planned on the Slug framework, including:
Additional consumer implementations used for publishing retrieved data to remote data sources.
Additional Task Filter implementations to allow for white-listing of URLs.
Improvements of client-implementation to support additional HTTP operations.
REFERENCES
[1]
T. Berners-Lee, Y. Chen, L. Chilton, D. Connolly, et al. Tabulator: Exploring and analyzing linked data on the Semantic Web. In Proceedings
of the ISWC Workshop on Semantic Web User Interaction. 2006.
[2]
L. Ding and T. Finin. Characterizing the SemanticWeb on the web. In Proceedings of the International Semantic Web Conference (ISWC).
2006.
[3]
T. W. Finin, L. Ding, R. Pan, A. Joshi, et al. Swoogle: Searching for knowledge on the semantic web. In Proceedings of the National
Conference on Artificial Intelligence (AAAI). 2005.
[4]
D. Huynh, S. Mazzocchi, and D. Karger. Piggy bank: Experience the Semantic Web inside your web browser. Journal of Web Semantics,
5(1):16{27, 2007.
[5]
Leigh Dodds, Slug: A Semantic Web Crawler, February 2006, available at: www.ldodds.com/projects/slug/slug-a-semantic-web-crawler.pdf
[6]
Morten Frederiksen, Scutter Vocab, 2003, available at: http://rdfweb.org/topic/ScutterVocab
37

Review of Slug Semantic Web

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Review of Slug Semantic Web

Uploaded by

Copyright:

Available Formats

World Applied Programming, Vol (2), No (1), January 2012.

Review of Slug Semantic Web

Department of Computer Science Engineering

The current release of the Slug crawling framework includes [5]:

Multi-threaded Retrieval of data, via HTTP.

Support of enabling optimized retrieval from previously fetched content.

Statistics on crawler state to facilitate monitoring of the crawler.

A persistent memory for capturing and persisting crawl related metadata.

Automatic processing and traversal of RDF documents.

Creation of a local cache for retrieved data.

Storage of retrieved RDF data using persistent models.

Creation of scutter profiles using a custom RDF vocabulary.

III. CRAWLER ARCHITECTURE

Figure 1: Slug architecture in brief view

A Memory resource is configured with properties indicating a file name

V. THE SCUTTER VOCABULARY

Additional Task Filter implementations to allow for white-listing of URLs.

Improvements of client-implementation to support additional HTTP operations.

Morten Frederiksen, Scutter Vocab, 2003, available at: http://rdfweb.org/topic/ScutterVocab

You might also like