You are on page 1of 96

Ghent University

Architectural report

Semantic annotation service

Group 7 Authors: Boghaert Michiel Goossens Sander Hebben Stan Heyse Tom Taelman Stijn Van Otten Neri Vandermeiren Maarten Vandewalle Bram

Contents
I II Introduction State of the Art 5 8
9 9 9 9 10 10 10 11 11 12 12 12 12 13 13 14 14 14 15 15 15 15 16 16 16 16 16 16 17 17 17

1 Science 1.1 (Semi-)automatically annotating a web page . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Survey of Semantic Annotations Platform [45] . . . . . . . . . . . . . . . . . 1.1.2 SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation [26] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis [40] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Extracting information using visual representation . . . . . . . . . . . . . . . . . . 1.2.1 Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classication [36] . . . . . . . . . . . . . . . . 1.2.2 Extracting Content Structure for Web Pages based on Visual Representation [21] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 HTML page analysis based on visual cues [52] . . . . . . . . . . . . . . . . . 1.2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Ubiquitous content delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Adapting Content for Wireless Web Services [42] . . . . . . . . . . . . . . . 1.3.2 Adapting Web Content to Mobile User Agents [37] . . . . . . . . . . . . . . 1.3.3 Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices [22] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Annotation-Based Web Content Transcoding [33] . . . . . . . . . . . . . . . 1.3.5 A Two Layer Approach for Ubiquitous Web Application Development [25] . 1.3.6 Customization for Ubiquitous Web Applications - A Comparison of Approaches [28] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Tolerable Waiting Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 A Study on Tolerable Waiting Time: How Long Are Web Users Willing to Wait? [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Akamai and JupiterResearch: 4 seconds [8] . . . . . . . . . . . . . . . . . . 1.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Microformats: The Next (Small) Thing on the Semantic Web? [34] . . . . . 1.5.2 hGRDDL: Bridging Microformats and RDFa [20] . . . . . . . . . . . . . . . 1.5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 European Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 INSEMTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2 QuASAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

CONTENTS 1.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 18 18 18 18 18 19 19 19 19 19 19 20 21 21 22 22 23 23 24 24 24 24 25 25 25 26 27 27 27 27 28 29

2 IPR - Patents 2.1 Analyzing and annotating web content . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Method for scanning, analyzing and rating digital information content [50] . 2.1.2 Web page annotation systems [38] . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Method for annotating web content in real time [30] . . . . . . . . . . . . . 2.2 Invocation of the engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Web page annotating and processing [48] . . . . . . . . . . . . . . . . . . . 2.3 Ubiquitous content delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Web content adaptation process and system [24] . . . . . . . . . . . . . . . 2.3.2 Web server for adapted web content [23] . . . . . . . . . . . . . . . . . . . 2.3.3 Web content transcoding system and method for small display device [47] . 2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Standard bodies 3.1 RDFa (Resource Description Framework - in 3.2 Microformats . . . . . . . . . . . . . . . . . . 3.3 Microdata (HTML5) . . . . . . . . . . . . . . 3.4 DAML (DARPA Agent Markup Language) . 3.5 Conclusion . . . . . . . . . . . . . . . . . . .

attributes) . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

4 Professional organizations 4.1 Reuters OpenCalais . . . . . . . . . . . . . . . 4.2 Ontotext KIM Platform . . . . . . . . . . . . . 4.3 Ontoprise GmbH Semantic Contents Analytics 4.4 iQser GIN Platform . . . . . . . . . . . . . . . 4.5 Annotea . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . 5 Market reports 6 Industry trends 6.1 Googles Rich Snippets . . . . . . . . . . . . . . 6.2 Del eTools - eLearning Annotation Web Service 6.3 OSA Web Annotation Service . . . . . . . . . . 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . 7 Conclusion

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

III

Vision
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30
31 31 31 31 32 32

8 Vision 8.1 Mission Statement . . . . . 8.2 Customers and benets . . 8.3 Key factors to judge quality 8.4 Key features and technology 8.5 Crucial factors as applicable

CONTENTS

IV

Scenarios
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33
34 34 34 35 36 36 37 39 41 42 43 45 47 47 47 48 48 48 49 49 50

9 Use cases 9.1 Use case diagram . . . . . . . . . . . . . . . . . . 9.1.1 Actors . . . . . . . . . . . . . . . . . . . . 9.1.2 Use cases . . . . . . . . . . . . . . . . . . 9.2 Use case scenarios . . . . . . . . . . . . . . . . . 9.2.1 Perform semantic analysis . . . . . . . . . 9.2.2 Choose performance and accuracy . . . . 9.2.3 Correct annotations . . . . . . . . . . . . 9.2.4 Add annotated page for machine learning 9.2.5 Management of rule-based methods . . . 9.2.6 Add functionality . . . . . . . . . . . . . . 9.2.7 Install and run locally . . . . . . . . . . . 10 Quality attribute scenarios 10.1 Scalability . . . . . . . . . 10.2 Performance . . . . . . . . 10.3 Modiability . . . . . . . 10.4 Accuracy . . . . . . . . . 10.5 Availability . . . . . . . . 10.6 Completeness . . . . . . . 10.7 Stability . . . . . . . . . . 10.8 Extensibility . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Architectural design

51
52 53 54 55 55 55 55 55 57 60 61 62 62 62 64 64 64 64 67 67 67 67 69 78

11 Global overview 11.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Attribute Driven Design 12.1 Inputs for the system . 12.2 Architectural drivers . 12.3 Architectural pattern . 12.3.1 Chosen pattern 12.3.2 Instantiation . 12.4 White Box Scenarios . 12.5 Deployment . . . . . . . . . . . . . Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . I . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

13 Attribute Driven Design - Cycle II 13.1 Access Layer . . . . . . . . . . . . 13.1.1 Inputs for the system . . . 13.1.2 Architectural drivers . . . . 13.2 Application Layer . . . . . . . . . . 13.2.1 Inputs for the subsystem . . 13.2.2 Architectural drivers . . . . 13.2.3 Architectural pattern . . . 13.3 Persistence Layer . . . . . . . . . . 13.3.1 Inputs for the system . . . 13.3.2 Architectural drivers . . . . 13.3.3 Architectural pattern . . . 13.4 White Box Scenarios . . . . . . . . 13.5 Deployment . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

CONTENTS

14 Attribute Driven Design - Cycle 14.1 Engine . . . . . . . . . . . . . . 14.1.1 Inputs for the system . 14.1.2 Architectural drivers . . 14.2 Machine Learning . . . . . . . . 14.2.1 Inputs for the system . 14.2.2 Architectural drivers . . 14.3 White boxes . . . . . . . . . . . 15 Global system

III . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

80 81 81 81 83 83 84 86 89

VI

Conclusion

91

Part I

Introduction

The purpose of this project is to design an automatic semantic annotation engine. This means we are going to design an engine which adds semantic annotations to a given web page in a completely automatic way. Annotations are metadata, used in web pages. This metadata contains information about the data in a web page. Annotations in a web page give information about the structural elements and content in the web page. For example, search engines use annotations in web pages to be able to better understand the content of the pages. There are many applications that can benet from the project. For example, the engine can be part of a larger system (gure 1) to adapt a web application to the device used to visite the application. First, the web pages are annotated by the engine to mark dierent structural and content elements. Based on these annotations, dierent structural and content elements can be reordered, based on their function and importance and adapted so the web application can be viewed optimally on the device. As said before, the focus of the project is the engine itself. It will take a web page as input and will annotate this web page completely automatic. An important factor for the success of the project is the correctness, accuracy and completeness of the semantic annotations. This means adding the correct annotations at the correct place and not leaving too many annotations out. This document is meant to give an overview of the dierent stages of the architectural design. We will start with a State of the Art analysis, where we will investigate and explore existing technologies and patents, the market and existing companies and organisations. Next we will capture the requirements by dening the dierent use cases and the quality attributes that are important for the engine. Based on these quality attributes, we will design an actual architecture in dierent iterations using the Attribute Driven Design (ADD) method. We will verify our decisions after each iteration with white box scenarios and deployement diagrams. We will nish this document by looking back at what we have accomplished and what remains as feature work.

Figure 1: The engine as part of a larger system.

Part II

State of the Art

Chapter 1

Science
1.1 (Semi-)automatically annotating a web page

There are many dierent techniques on how to identify dierent content and structural parts on a web page to be able to add annotations. The following articles propose one or more techniques to do so.

1.1.1

Survey of Semantic Annotations Platform [45]

This paper tries to oer the reader an overview of all existing semi-automatic annotation platforms in 2005. Semi-automatic means that each of these platforms require human interaction at a certain stage of the annotation. Chapter 2 is the most interesting part concerning the project. It oers an overview of three commonly used annotation approaches. The rst approach is Pattern-based: the annotation engine looks for patterns or tries to nd predened patterns. The next approach is Machine Learningbased. Platforms based on this technique utilize two methods: probability and induction. The nal approach is called Multi strategy and tries to combine the advantages of both pattern-based and machine learning-based systems. Chapter 3 compares dierent platforms: Armadillo [27], AeroDAML [35], KIM [44], MnM [51], MUSE [39], Ont-O-Mat [31] and SemTag [26]. Chapter 4 oers a performance evaluation of each platform and chapter 5 tries to explain the methods used by each platform. Chapter 6 contains the conclusion and repeats the three categories of annotation approaches. The authors of this paper also wrote a chapter in the book Web semantics and ontology [49] about Semantic Annotation Platforms. It contains approximately the same information as this paper. This paper shortly introduces some interesting methods and approaches to adding semantic annotations. Furthermore it compares 7 semi-automatic annotation platforms. The ideas of these platforms may serve as a launchpad for the project.

1.1.2

SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation [26]

In this document a system is introduced that processes HTML data and annotates it with semantic data in a fully automatic way if the necessary metadata is provided manually beforehand. The system allows queries to be executed to retrieve certain data using the tags that have been assigned. All data is pre-calculated and all documents are processed on one large batch operation. This batch processing is necessary to perform proper analysis on the data. The analysis uses the words 9

CHAPTER 1. SCIENCE

before and after the current word as context and can recognize correlations to detect which parts of the context are signicant to extract the proper meaning of the word. This system is quite dierent from ours, especially since all documents are processed at the same time, and beforehand. Additionally only content-based annotations are added. Therefore it has dierent applications compared to the project. Perhaps some algorithms could be interesting to study if we develop an implementation of the system.

1.1.3

Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis [40]

This article handles the problem of how to add automatic annotations to a HTML document. Namely, how to construct a semantic partition tree. The annotation engine for the proposed project can work from two observations considering semantically related items in HTML documents. The rst observation is that related items exhibit consistency in presentation style (font, hyperlinks, ...). Second, related items exhibit spatial locality, which means related items are located close to each other in the document. This structural analysis can be exploited to form an idea which items on a page are related and should remain together. But this technique can be wrong. Therefore the article proposes semantic analysis techniques as an addition. Lexical association is a linguistic processing tool to identify small segments of related text (based on common words or synonyms). This can be implemented using WordNet [19]. Concept association uses domain knowledge encoded in an ontology to match partitions with certain concepts. The two association methods can be combined to obtain better results. The methods described are very useful for the project. They give an idea on how to write an algorithm that partitions pages into pieces that are semantically related. On the other hand, they do not specify which technologies are used for the annotation or which vocabulary is used.

1.1.4

Conclusion

For the project it will most likely be important to combine a wide range of techniques and algorithms in order to try and add useful semantic annotations. Survey of Semantic Annotations Platform [45] describes basic ideas and approaches for adding semantic annotations. It also describes how annotations could be added in a semi-automatic way, a very useful addition that we should consider to implement as it gives developers the opportunity to choose on how their web pages should be annotated. SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation [26] implements a system slightly dierent from ours but contains algorithms worth studying. The most useful article in this section is Automatic Annotation of Content-Rich HTML Documents: Structural and Semantic Analysis [40] as it provides a way of writing algorithms which partition pages according to those components that semantically t together.

1.2

Extracting information using visual representation

An important technique to identify dierent sections on a web page is to rst render the page. Next this visual representation can be used to extract information from that web page. We have found several articles presenting dierent techniques to extract information from a visual representation.

10

CHAPTER 1. SCIENCE

Figure 1.1: The vision-based page segmentation algorithm

1.2.1

Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classication [36]

This article focuses on recognizing visually important areas for a better classication of web pages. This is done by describing a possible representation for a web page in which objects are placed into a well-dened tree hierarchy according to where they belong in an HTML structure of a page. Each object carries information about its position in a browser window. This visual information enables the denition of heuristics for recognition of common areas such as header, footer, left and right menus, and center of the page. The crucial diculty discovered was the development of suciently good rendering algorithm, i.e. imitating behavior of popular user agents such as Internet Explorer. The article describes in detail how a hierarchical tree can be constructed from a HTML structure. This tree structure could then be used to nd key features such as a headers and footers. This information can then be semantically annotated in the HTML code.

1.2.2

Extracting Content Structure for Web Pages based on Visual Representation [21]

This article focuses on a vision-based page segmentation algorithm. A clarication is shown in gure 1.2.2. This is done by identifying the logic relationship of web content. Based on visual layout information, web content structure can eectively represent the semantic structure of the web page. An automatic top-down, tag-tree independent and scalable algorithm to detect web content structure was presented. It simulates how a user understands the layout structure of a web page based on its visual representation. Compared with traditional DOM-based segmentation methods, the scheme utilizes useful visual cues to obtain a better partition of a page at the semantic level. It also is independent of physical realization and works well even when the physical structure is far dierent from visual presentation. Test result show that humans judge 86 of the 140 rendered test pages to be perfect, 50 to be satisfactory and only 4 to have failed. This algorithm can therefore be seen as having a 97% success rate. This paper provides a measure of accuracy for a good rendering algorithm which could, along with other techniques, be implemented in the engine.

11

CHAPTER 1. SCIENCE

1.2.3

HTML page analysis based on visual cues [52]

This paper presents a method to extract structures from HTML pages, without any a priori knowledge, in an automatic way. It bases its analysis on visual similarity between dierent web pages and their organization. Their method requires a few denitions concerning the comparison of both simple objects and container objects. Using these denitions, the method tries to detect visual similarity patterns in the HTML-document. A large part of the paper is spent describing the technical details which we will not rewrite here. Their test results were actually quite impressive: 92% of their documents were correctly analysed, 4% missed some apparent structural elements. Parsing failed for the remaining 4% of their test set. They did however use a rather small test set: only 50 web pages. This paper oers a very interesting and promising method of detecting structural elements in a web page. However, a few remarks can be made concerning the project. First of all, its a rather old paper (dating from 2001). This means the web pages of their test set will be very dierent from current web pages. They were less dynamic, contained no or very little CSS, no Flash or other plugins and a general simpler structure. This means it may be harder to nd patterns for us or even impossible. Secondly, this method only works to detect structural elements. We may want to add more semantic annotations.

1.2.4

Conclusion

The visual representation of a web page consists of lots of cues. The article HTML page analysis based on visual cues [52] is an older paper that tries to recognize visually interesting areas by using a few predened examples. This method, however only failing for 4% of the test pages, is not extensible. As the World Wide Web evolves, this method would continously needs updating. This technique could however be used in combination with other more extensible techniques. The two other articles in this section construct DOM tree structures from the HTML code in order to detect visually interesting areas. The article Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classication [36] goes a bit further by implementing a vision based segmentation algorithm that combines the visual information and the DOM tree in order to get a dierent hierarchical structure. This method has a 97% success rate, wich is a good measure for the engine to strive to but probably not in the initial implementation.

1.3

Ubiquitous content delivery

An important application of semantic annotations is ubiquitous content delivery. Ubiquitous content delivery means to deliver content to all kinds of devices. To do so, its necessary to adapt the content to the device used. We found quite some articles which present dierent ways to adapt content to dierent devices and one of them is using annotated web pages.

1.3.1

Adapting Content for Wireless Web Services [42]

This article mainly handles the rendering of web pages for specic devices. This is done by taking specic properties of the device into account. The article mentions that rendering based on added annotations produces better results compared to the use of only basic HTML-tags. Such a dual-purpose page is hard to handle because both, the HTML-tags and the annotation tags, have to be taken into account. Assuming the developer adds these annotations manually, it can be dicult to handle dual-purpose pages. The engine on the other hand, runs at real time which allows the annotations to be added when the page is requested. The developer will be able to create his web pages without having to worry about the annotations. 12

CHAPTER 1. SCIENCE

1.3.2

Adapting Web Content to Mobile User Agents [37]

This article (2005) focuses on the concepts of content adaptation, specically for mobile applications. The emphasis is more on reliability and interactive functionality rather than on the layout. The adaptation depends on the device, the network, the user preferences ... and can be static or dynamic. There are three possibilities (which can be combined): server-side, intermediate and client-side adaptation. Server-side adaptation gives more control to web developers, causes better performance, and uses XSLT (less bandwidth for e.g. multimedia applications). With intermediate adaptation the processing occurs in a proxy, the developer can optionally add meta data or hints to the content. Another possibility is the use of common adaptation heuristics (e.g. DOM-tree). The advantage of client side adaptation is that you dont need to send extra client properties. The disadvantage, on the other hand, is a regression of performance (processing time, memory usage, network connection ...). The authors of the article made a (standalone) content adaptation proxy application with the following functionalities: adaptation of web documents and media les, user session-based caching, state management, navigation generation (list with links to delivery units (DUs)), error messages, user management and conguration. Their extensible application is not tuned for a specic type of web page and makes ecient use of the services of the device. Its scalable and can handle multiple users in realtime. The content is decomposed in DUs with added meta data like priority or labels. The decomposition happens in two phases. In the adaptation module, the content is divided into perceivable units through semantic interpretation. In the post processing phase there are compatibility checks for the device, and the data can be divided into extra DUs if necessary. The adaptation module only excepts XHTML as an input (converting to XHTML can be done with HTML Tidy) and XHTML is converted to XHTML MP (mobile prole) or WML (Wireless Markup Language).

1.3.3

Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices [22]

This article describes how to improve the presentation of web pages on a mobile device. This includes analysis of existing pages in order to adapt it. This piece is very important for the annotation algorithm. The algorithm specied, uses an iterative system: each iteration divides blocks into smaller blocks that semantically belong together. The algorithm has three steps. First, high-level content block detection. This determines what content fall into which high-level block (examples of blocks: header, footer, left/right side bar). This happens based on the position and shape of the region it occupies. Next, explicit separator detection follows. The detection of regions that are separated by for example: hr, table, td, div, horizontal and vertical lines, etc. Finally, implicit separator detection. This method tries to separate regions divided by white space. This article is the only one that eectively uses high-level content block detection (to detect header, footer, sidebars, etc). Also the pieces on separator detection are very useful for the projects semantic annotation algorithm. On the other hand, theres absolutely no information on which annotations are added, and in which format.

13

CHAPTER 1. SCIENCE

1.3.4

Annotation-Based Web Content Transcoding [33]

This article proposes a system to transcode web content to dierent devices accessing it. The most relevant part for the project is the rst part where the article proposes a framework of external annotations: existing web pages are associated with content adaptation hints in separate annotation les. The authors have chosen to use external les for the annotations because they claim that it would be impractical to incorporate meta-information into existing HTML documents. They use RDF (Resource Description Framework) as the syntax of annotation le. XPath and XPointer are used for associating portions of a HTML page with annotating descriptions. An annotation le can refer to more than one HTML doc, and vice versa. Finally, they describe a vocabulary to use in the annotation les. There are 3 kinds of annotations. The rst kind species a list of alternative representations for an annotated element (E.g. gray scale image vs the original image) Then there is Splitting Hints. The purpose of this annotation is to provide a hint for determining appropriate page break points, so a complex HTML page can be divided into dierent pages on clients with smaller displays. The last kind is Selection Criteria. This contains information to help the transcoding system select, from several alternative representations, the one that best suits the client device. For example the importance of an element, resource requirements of an alternative representation, etc. This approach is a little bit dierent from partition algorithms in other articles but there are some solid ideas in the article. Maybe it can be considered using a combination of several approaches to come to better semantic annotation results.

1.3.5

A Two Layer Approach for Ubiquitous Web Application Development [25]

This article describes two steps to adapt web applications and to customize it to the specications of the device used by using a web-content adaptation engine. It is stated that semantic analysis on the structure of a web page is fundamental in adapting the presentation. You must be able to distinguish the main data from the menus, sidebars, and other side information. In order to adapt the presentation layer of a web application, they consider using microformats, HTML5, RDF or XML to annotate the web application which gives the content adaptation engine the information it needs to adapt the HTML les to an optimized version for the device used. The article considers using ECMA-scripts, which is a good alternative, except that it is not supported on all devices because its client-sided. Further implementation specics are not given. This article oers a good source of information about the use of HTML5, RDF(a), or XML for this projects purpose.

1.3.6

Customization for Ubiquitous Web Applications - A Comparison of Approaches [28]

This article proposes 10 techniques for customization for ubiquitous web applications. The goal is to minimize the data transfer but to maximize the semantic content left. One of these approaches is the Global Document Annotation (GDA) project. This section can be useful for the project because it handles annotations for video, image, and text compression, hereby simplifying the web application for devices with lesser capabilities, but it does not handle visual reorganization of the web pages. This project proposes the use of semantic annotation in the form of a separate XML le. This section of the article can be useful for the project. There are three kinds of annotations distinguished. Linguistic annotation aims at making text machine readable. Commentary annotation

14

CHAPTER 1. SCIENCE

serves for annotating non-textual content (images, sounds, ...). Multimedia annotation handles videos. Based on this context information, the generic adaptations that are considered are text transcoding, image transcoding, voice transcoding, video transcoding. This all happens dynamically. The information about methods used in this article will certainly be useful for the project. One downside is that this technique does not include visual separation and reorganization, but this can be obtained from several other articles in this section.

1.3.7

Conclusion

In order to have ubiquitous content delivery, the content needs to be adapted for dierent devices. From Adapting Content for Wireless Web Services [42] can be learned, adding annotations to a HTML page produces better rendering results for ubiquitous devices. However a disadvantage of doing this is that pages using this dual tag (HTML and annotations) system are a lot harder to handle. Customization for Ubiquitous Web Applications - A Comparison of Approaches [28] describes ten techniques that could be used for customization of ubiquitous devices. These techniques could easily be included in the project. Important ubiquitous devices are mobile devices due to their wide spread popularity. The article Adapting Web Content to Mobile User Agents [37] describes dierent ways in which adaptation can take place and describes the advantages and disadvantages of each. Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices [22] focuses on the detection of page structure and how this can be adapted to small devices. This is done using high level content block and separator detection. In Annotation-Based Web Content Transcoding [33] a dierent algorithm is described that could possibly be interesting to implement. Although A Two Layer Approach for Ubiquitous Web Application Development [25] is less specic, it does contain descriptions of rather useful technologies relevant for this project.

1.4

Tolerable Waiting Time

The project needs to perform its task within a certain time. These articles will help us in determining the amount of time an average web user is willing to wait.

1.4.1

A Study on Tolerable Waiting Time: How Long Are Web Users Willing to Wait? [41]

This paper tries to determine the threshold for tolerable waiting time for web users when retrieving information from the World Wide Web. The results indicate that the waiting time is not aected by the availability or unavailability of pictures on the webpages and that the inclusion of a feedback bar signicantly prolonged the waiting time of users. The conclusion of this paper is that most users are willing to wait for only 2 seconds for information retrieval from the World Wide Web. Two seconds seems to be the waiting limit for most web users when retrieving information. Since the paper of Selvidge [46] concludes that web delay time is not aected by the type of task the user is executing, the project should stay as close to the 2 second limit as possible.

1.4.2

Akamai and JupiterResearch: 4 seconds [8]

This report commissioned by Akamai through JupiterResearch examines the reaction of a consumer to a poor online shopping experience. Most of the conclusions are irrelevant to this project,

15

CHAPTER 1. SCIENCE

but their ndings indicate that 4 seconds is the maximum length of time an average shopper will wait for a webpage to load before potentially abandoning a retail site. This paper places the threshold at 4 seconds for the project.

1.4.3

Conclusion

Its hard to come up with an actual time period that a user is willing to wait. Both these articles report a rather low value: 2 and 4 seconds. If we want to annotate in realtime, we have to bear in mind these numbers.

1.5

Annotations

The last articles handle about an annotation standard which might be interesting to include in the project.

1.5.1

Microformats: The Next (Small) Thing on the Semantic Web? [34]

This article focuses on the use of Microformats. More specically it examines detailed examples, the general principles by which they are constructed and the growing community of users behind this alternative to the Semantic Web. The Microformats community is an open wiki, mailing list, and Internet relay chat (IRC) channel that has proven remarkably scalable and accommodating. Microformats could be used as a way of adding annotations to a web page but due to the nature of the standard described in this article, this would be less advisable for the larger scaled project.

1.5.2

hGRDDL: Bridging Microformats and RDFa [20]

This article proposes hGRDDL, this name is chosen because it is meant to be a GRDDL-like (a W3C eort that aims to extract RDF triples from any XML document) transformation with a focus on processing Microformats. The mechanism transforms HTML-embedded structure data such as Microformats into RDFa. This technique has many advantages. Microformats has already been widely used. Converting this to RDFa allows this data to be preserved while allowing new deployments to focus on RDFas great extensibility and consistency. It would be unwise to neglect Microformats as developers have used this to add extra informations to existing web pages. This article describes how hGRDLL can be used to transform this data in the more useful RDFa standard.

1.5.3

Conclusion

The rst article on Microformats shows that it would be an unwise choice to let the engine use Microformats to add annotations to web pages. This is due to the limiting nature of the format. The second article focuses on converting microformats to RDFa using hGRDDL. This is very relevant to the project. If we use this technique, it will allow the engine to adapt web pages to the specications of the developers who use Microformats as their annotation of choice.

1.6

European Projects

There are several European projects going on that deal with semantic annotations.

16

CHAPTER 1. SCIENCE

1.6.1

INSEMTIVES

Incentives for semantics (INSEMTIVES) is a ve million euro costing European project funded in 2009 under 7th FWP (Seventh Framework Programme). Several universities and organisations from over six dierent countries in Europe are participating. The aim of INSEMTIVES is to bridge the gap between human and computational intelligence in the current semantic content authoring landscape. They are developing methodologies for the creation of semantic metadata for dierent types of web resources. They investigate applicable social and economic incentives, notably in the areas of ontology engineering, and the semantic annotation of media and Web services, to motivate user participation in these inherently human-driven tasks. [6]

1.6.2

QuASAR

Quality Assurance of Semantic Annotations for Services (QuASAR) is a project of the school of computer science, University of Manchester. The QuASAR project aims to provide a toolkit to assist in the cost-eective creation and evolution of reliable semantic annotations web services. In particular, they have developed tools to assist human annotators in verifying the annotations they develop before they are deployed into public repositories, and to gain maximum value from manually created annotations, by using them as the starting point from which to infer new annotations. Although QuASAR aims at optimizing manually created annotations, the tools they have developed could be very interesting for the project. We might be able to use it to check if the annotations the engine produced are thrustworthy.

1.6.3

Conclusion

A lot of research is done on the development of the Semantic Web and semantic annotations by some large or smaller projects across Europe. Mostly their search area is a lot larger and semantic annotation is just a piece of one larger project. But it does show that the Semantic Web is a hot research topic and we are not the only ones into semantic annotation. It would be good to keep an eye on some of these projects and what they accomplish, as we might be able to use some of their work.

1.7

Conclusion

Web pages can be annotated using a large variety of techniques. This can be done manually by the developer, semi-automatically or completely automatic. Each technique has a large number of options. These are described in detail in the dierent articles. We will most likely not pick one specic technique for the project but combine several approaches to strive to a better annotation result. One very useful approach of combining the automatic approach but still keeping the end user in mind is to add semantic annotations based on the visual representation of a web page. Several projects are going on across Europe and may deliver interesting result for this project. Lastly, we examined Microformats as a possible standard. Due to the large community of active users this technique should not be forgotten as it can provide valuable semantic information.

17

Chapter 2

IPR - Patents
2.1 Analyzing and annotating web content

Weve found several patents concerning the analysis and annotation of web content.

2.1.1

Method for scanning, analyzing and rating digital information content [50]

This patent describes a method to analyse data and classify it. Data is matched against multiple regular expressions. Those results are processed by a neural network, which has been trained by a large dataset. Although it is described in the paper as a method to classify pornographic or oensive content, it could be used for other types of content classication as well. The algorithm to match text against a word database or regular expression database which is afterwards processed by a neural network, is patented. We have to make sure we do not use such an algorithm in the implementation.

2.1.2

Web page annotation systems [38]

This patent describes an annotation technique which relies on a data processing system. This system acts as a proxy and retrieves data from the internet per user request. The system then analyses the web page content to select, by subject matter, at least one product class of a plurality of product classes represented in a product classication database. The annotations that will be available for display are each associated with a display condition dependent on one or more product items in the database. If these conditions are satised the annotation data will be included in the document that is supplied to the end-user. This patent handles about a very specic way of annotating: a proxy server with a product database. Furthermore it does not annotate any semantic data but only adds extra data related to certain products. Based on these observations, this patent wont pose a problem.

2.1.3

Method for annotating web content in real time [30]

In this patent the inventor describes a method for annotating web content in realtime via a client interface. Users can add annotations, after retrieving the page, to (portions of) web pages and share them with other users. The intention is to inform the other users (e.g. reliability of ecommerce pages). Since these annotations are not automatically added and the purpose is dierent than ours, this patent wont be a problem. 18

CHAPTER 2. IPR - PATENTS

2.2

Invocation of the engine

We found a patent which might be interesting concerning the way we would invoke the engine.

2.2.1

Web page annotating and processing [48]

This patent presents a system and method for associating annotations, modications and other information to a web page. More specic, a redirector is patented. If a user tries to access a web document on the internet using a specic URL, the request is intercepted by the redirector. The redirector modies the requested web document and returns the modied document to the user. These modications may include annotations, comments on the document ... The redirector is implemented as a web service so no modications to available browsers or server software is needed. If the engine is invoked by intercepting a users request and redirect these request to the engine, this patent could be violated. We are intending to implement the engine as a web service. On the other hand, the redirection itself is not part of the project. So depending on how the patent is exactly stated, external applications using the engine could violate this patent. So its important to the developers of those applications to be aware of this patent.

2.3

Ubiquitous content delivery

As mentioned before (section 1.3), an important application of semantic annotations is ubiquitous content delivery. Therefore its possible to use annotations in an web page to render the web page specic to the device used. There are several patents about the rendering and adaptation of such web pages which might be interesting for us.

2.3.1

Web content adaptation process and system [24]

This patent patents content adaptation for devices with small displays (E.g.: mobile devices). The adaptation process receives information about the display characteristics of the target device and a web page to be transformed. Next, the web page is adapted and stored in a database. Every application which uses the annotations to adapt the annotated web page to the device used, would violate this patent. Implementations will likely have to make an arrangement with the inventors.

2.3.2

Web server for adapted web content [23]

This patent describes a web server architecture for delivering web content adapted to mobile devices. A database is used which stores multiple adapted versions of web pages. These adapted versions can be pre-calculated or calculated on the y if necessary. A method is described to classify web page content and to adapt the page for a certain screen size. The patent patents the server which stores multiple adapted versions of a web page in a database, detects display characteristics of the requesting device, and transmits the cached or dynamically adapted version of the application back to the mobile device. Although content analysis is not patented on its own, many applications using the system could violate the patent.

2.3.3

Web content transcoding system and method for small display device [47]

This patent describes two methods to generate a version of a large web page adapted to a small screen size. Basically, HTML is preprocessed, the HTML data is analysed to categorize the content 19

CHAPTER 2. IPR - PATENTS

blocks, a processing step is performed on this data (this step diers somewhat between the two versions) and voice mark-up and adapted HTML code are generated. The patent patents only these methods as described in the patent. This patent wont cause us any problems. Again, content analysis is not patented, only a specic set of applications of it would infringe the patent.

2.4

Conclusion

There are some existing patents concerning the analysis and annotation of web content. Although they are all really focused on a particular problem, they may cause problems for the users of the system. Two of them handle about searching for specic content in a text [50, 38]. The described methods might be interesting for us and we should take this patents into account but were not only focussing on the content. The third patent [30] requires to add the annotations manually which isnt the purpose of the project. The other patents are less important for us but more for our customers, as they will use the engine and the annotated web pages. Especially to render this web pages to a specic device, there are quite some patents that should be taken into account.

20

Chapter 3

Standard bodies
In the following paragraphs describe a few annotation technologies which might be interesting for the project. We mention the benets and drawbacks of each proposed technology.

3.1

RDFa (Resource Description Framework - in - attributes)

RDFa [18, 11, 29], developed and proposed by the W3C, is a set of rules that can be used as a module for XHTML2. It reuses attributes from standard XHTML meta and link elements and applies them to all other XHTML elements, so one can annotate XHTML markup with semantic information. The ultimate goal of RDFa is to make any RDF structure representable in pure XHTML. This allows the author to use a predened set of rules to mark up just about anything. The benets of RDFa: Publisher Independence: The publishers are independent and each website is allowed to use their own standards. Self Containment: Because of Self Containment, the RDF triplets are separated from the (X)HTML content. Modularity: Modularity of the schema makes attributes reusable. No duplication: There is no need to create separate XML and HTML section with the same content, so there is no duplicated data. Modiability: Additional eld can easily be added and XML transforms can extract the semantics of the data from an HTML le. Widely supported : Its developed by the W3C and Google will use RDFa soon so there will be no lack of support. But there are some drawbacks too: Incompatible tools: Some XHTML cleaning tools (to create well-formed content) can break the embedded RDF semantics. Complexity: Its more complicated than Microdata and Microformats. Usage: Will it ever be used worldwide?

21

CHAPTER 3. STANDARD BODIES

3.2

Microformats

Microformats (F) [17, 10, 29] is a web-based approach to semantic markup which wants to reuse existing HTML/XHTML tags to convey metadata and other attributes in web pages and other contexts that support (X)HTML. This approach allows software to process information intended for end-users (such as contact information, geographic coordinates, calendar events etc) automatically. The use, adoption and processing of Microformats enables data items to be indexed, searched for, saved or cross-referenced, so that information can be reused or combined. The benets of Microformats: Basic HTML: There is no need to learn another language if you already know HTML. Microformats uses the class-attribute of the dierent HTML tags. Compact: Microformats uses a very compact syntax. Compatibility: Microformats are easy to add in an existing web page and it works perfectly with CSS. Allow applications to use already existing technologies instead of converting data to RDF and back. Existing technologies: Try to model and encapsulate real, existing technologies like vCard and iCal data. Widely supported : There is a very wide deployment and adoption by mainstream web designers and developers. Also the newest versions of browsers (like Firefox 3 and IE 8) will provide support for Microformats. (Not with plugin like current versions.) Drawbacks of Microformats: Scalability: Scalability issues and only limited number of Microformats (but number is still growing.) Parsing problems: Separate parsing rules are required for each Microformat. Ineciency: Its quite inecient from a parsers point of view. (Dicult for automated search.) Namespaces: No use of namespaces. Youve to make sure the class you dened isnt already used for another purpose. (E.g.: dened in CSS)

3.3

Microdata (HTML5)

Microdata [16, 5] is a proposed feature of HTML5 intended to provide a simple way to embed semantic markup into HTML documents. Microdata can be viewed as an extension of the existing Microformat idea which attempts to address the deciencies of Microformats without the complexity of systems such as RDFa. The benets of Microdata: Complexity: Microdata is less complex then RDFa Easy to use: Easy to add markup to your pages using a few HTML attributes. Adoption: Its already adopted by Google. Standard : Its very likely to become an ocial recommended web standard as part of HTML 5.0. Drawbacks of microdata: Unnished : HTML 5.0 is still under construction and is not a standard yet. 22

CHAPTER 3. STANDARD BODIES

3.4

DAML (DARPA Agent Markup Language)

DAML [15, 3] is the name of a US funding program. The program focused on the creation of machine readable representations for the web. The DAML language is being developed as an extension to XML and RDF (Resource Description Framework). It provides a rich set of constructs to create ontologies and to markup information so that its machine readable and understandable. Much of the eorts put into DAML has now been incorporated into OWL (Web Ontology Language). The benets of DAML: Machine readable: The DAML language allows the machines to make the same sort of simple inferences that human beings do. Validation: A validator for DAML exists. Extensive: DAML is more extensive than RDFa. Drawbacks of DAML: Complexity: Compared to the other standards, DAML is more complex to use.

3.5

Conclusion

This table gives an overview of the most important characteristics of the dierent technologies and compare them with each other.

Easy to use Extensive Widely supported Embedded in HTML Free to use

RDFa No Yes Yes No Yes

Microformats Yes Yes Yes Yes Yes

Microdata (HTML 5.0) No Yes Yes Yes Yes

DAML Yes Yes No No Yes

Table 3.1: Overview of the dierent technologies and their features. Technologies for the semantic web are still in development. There isnt a language thats better on every aspect than the others. If you are looking for a simple and quick way to add annotations, Microformats probably is the best option for you. But if you want to be thorough and if you have the time to do it, you should prefer RDFa or Microdata. DAML and Microformats should be your choice if you want a technology thats easy to extend and modify to your needs. In the project we should consider using each technology because they all got their benets. But due to incompleteness, Microformats and Microdata arent really the best choice. In the future, on the other hand, they may be the most used technology. So we have to consider everything before turning down a technology.

23

Chapter 4

Professional organizations
4.1 Reuters OpenCalais

OpenCalais is a open web service by Thomson Reuters. This Web Service creates rich semantic metadata for the content you submit in a fully automatic way. They also claim the annotation process is done in under a second. Their system relies on methods such as natural language processing (NLP), machine learning and some other, unknown methods. OpenCalais is a free service for both non-commercial and commercial use. Basically, what OpenCalais does is try to nd certain words in the submitted content which it can link to other web pages. E.g.: brands, places, specic terms... It does not add semantic data concerning the layout or structure of the web application. It does oer some extra services though, it can deliver Social Tags and Topics. Social Tags presents a list of tags related to the submitted content along with a relevance indicator. The topic(s) of the submitted content will also be delivered along with a relevance percentage. OpenCalais uses RDF to write down their annotations.

4.2

Ontotext KIM Platform

The KIM Platform, developed by Ontotext, is a semantic search engine. It can analyse text, either from your own documents or from the web, and provide hybrid queries to search the structured data. Its free for non-commercial use, but paying for commercial use. The KIM Platform is the result of the paper KIM - Semantic Annotation Platform [44] and uses RDF tags. It diers from this project because it doesnt annotate the structure of a web application. The annotating engine is only part of KIM.

4.3

Ontoprise GmbH Semantic Contents Analytics

Ontoprise GmbH created Semantic Contents Analytics together with IBM. It analyses text from diverse sources (e.g. documents, e-mails, wikis, databases...) and classies or tags documents according to its search results. This function is comparable to this project at a certain level but Semantic Contents Analytics has quite a few other functions. It tries to extract facts from data, nd links between dierent documents and even tries to logically derive new information. The analysis can be helped by providing a knowledge model (ontology). It does not pay any attention to the structural elements of the texts it analyses. Semantic Contents Analytics is part of a bigger, commercial Semantic Middleware system.

24

CHAPTER 4. PROFESSIONAL ORGANIZATIONS

4.4

iQser GIN Platform

The iQser GIN (Global Information Network) Platform is semantic middleware with quite a lot of functions concerning data integration, process control and information retrieval. The most interesting part is the part of the semantic analysis. They claim their analysis process is fully automated, light-weight and ensured high semantic validity. It does not require any preceding ontology and adapts to the current information situation. It does not provide semantic annotations concerning the structure and layout of webpages. Since 2009 the SDK of this middleware is available, but you have to pay a license fee to become a developer.

4.5

Annotea

Annotea [2] is an open project of W3C that enables you to annotate your web pages. This project is meant to support and demonstrate W3C standards, mainly RDF based annotations (section 3) and XPointer. The annotations are made using RDF based annotations and XPointer is used to dene for each annotation to which part of the HTML page its related. The special thing about this project is the fact that the annotations and their paths are saved seperatly from the web page itself. Therefore the developer is able to add annotations to web pages without actually having to edit the web page itself. Annotea consists of two parts. First of all, W3C oers an online editor, namely Amaya [1], where developers can add annotations to their web pages. This editor is an example implementation and can easily be replaced by any other editor with the same capabilities. The second part is the RDF based metadata server. This server is used for storing and fetching the annotations. As mentioned before, the annotations are kept seperatly from the web page so the server stores the les containing the annotations but not the web pages. The editor and the server can communicate with eachother. The editor can present a web pages using the annotations fetched by the server. Converserly the server saves the annotations made with the editor. In the article Adapting Content for Wireless Web Services [42], they mention that it is dicult to edit dual-purpose web pages. That problem is solved here by seperating the annotations from the HTML-page. This method of external annotation, seperated from the web page is also presented in one of the articles [33] (section 1.3.4) which gives several (dis)advantages of this method. This might be interesting for us because Amaya could be replaced or extended with the engine to add annotations automatically.

4.6

Conclusion

There are a few companies who oer semantic analysis software. Most of them oer a middleware platform that analysis data beforehand and constructs a database containing all the information. None of the above mentioned organizations oer realtime annotation or add any information concerning the structure of the webpage. None of them is doing any research in that area (at least we couldnt nd any information related to this kind of research on their website). This project seems to oer a pioneering concept, but we must be aware these organizations may easily try to adapt their current platforms to incorporate some of our features.

25

Chapter 5

Market reports
The World Wide Web and its services are an evergrowing market with lots of opportunities. The internet landscape changes rapidly but it can be stated that the future of the web is semantic. The semantic web has become a hot topic over the past decade and currently a lot of research is going on in this area. Both on business as on academic level, a lot of research is done and new technologies are being developed. Major companies already oer semantic web tools or systems using semantic web: Adobe, IBM, HP, Oracle etc. [12] Gartner, a leading IT research and advisory company, predicted in a 2007 report that during the next 10 years, web-based technologies will improve the ability to embed semantic structures. They expect that by 2017, the vision of the Semantic Web coalesce and the majority of web pages are decorated with some form of semantic hypertext. By 2012, 80% of public web sites will use some level of semantic hypertext to create SW documents. [12] They also state that the grand vision of the semantic web will occur in multiple evolutionary steps, and small-scale initiatives are often the best starting points. [14] Semantic annotations play a crucial role in the realisation of the semantic web and they can be used in a wide variety of other application areas on the internet. With the ability to tag all content on the web, we can describe what each piece of information is about and give semantic meaning to the content item. Search engines will become more eective and users can nd the precise information they are hunting for. This will have a big impact on the internet economy, as a majority of the searches are in the domain of consumer e-commerce, where a web user is looking for something to buy. Agent enabled semantic search will have a dramatic impact on the precision of these searches. It will reduce and possibly eliminate information asymmetry where a better informed buyer gets the best value. [32] There is a need for machine readable metadata on the web and nowadays its often dicult and time consuming to mark up data semantically. This is one of the reasons the semantic web hasnt yet been widely adopted, at least commercially. [13] An engine that does this automatically would solve this problem and many other applications could use it to speed up and facilitate their services. Adding semantic annotations to web applications could pave the way for the semantic web as they create web pages that can be understood by computers and so a whole new range of applications and possibilities will become available, such as the possibility to enhance ubiquitous content delivery.

26

Chapter 6

Industry trends
6.1 Googles Rich Snippets

When you search something on the internet using Google, Google presents you a list of results containing a link to each web page and a short section of information about the web page. With Google Rich Snippets [7], they want to give you the opportunity to help Google decide which information to display in that section. As developer you have to annotate your web pages manually. You are free to use RDFa or Microformats (section 3) because Googles Rich Snippets recognizes both. Once your web page is annotated, Google can parse it and decide which information is valuable to show in the search result of users. Based on the search term of the user, Google decides, using your annotations, which part of your web page is most relevant to show. Annotating your web pages manually is a disadvantage for many developers. As noticed in one of the articles [42] (section 1.3.1), it is dicult to edit such dual-purpose web pages. On the other hand, the user isnt restricted to one annotation standard. Therefore they can choose (out of two) the standard they are the most familiar with.

6.2

Del eTools - eLearning Annotation Web Service

Del eTools - eLearning Annotation Web Service [4] is a web service which oers you the possibility to add semantic annotations directly into the content of web pages. You can not just add this annotations but also share them with other learners. This service oers a web-based user interface where developers can manually add, manage and share the annotation functionality of their web application. Similar to other tools, the user has to add all annotations manually. As noticed in one of the articles [42] (section 1.3.1), it is dicult to edit such dual-purpose web pages.

6.3

OSA Web Annotation Service

The main objective of the OSA project [9] is to give people the possibility to annotate content. This content can be their own web page but can also be content from another or unknown author. Adding this annations can even be done without knowledge of the author. The basic structure is quite the same as in Annotea (section 4.5). The annotations and the content are saved seperately which allows people to add annotations to whatever online content they prefer.

27

CHAPTER 6. INDUSTRY TRENDS

An import part of OSA is the annotation service. This service consists of a URL interceptor. This interceptor searches for annotations related to the document which is retrieved. Further it combines the annotations and the web page to an annotated web page before it is delivered to the user. It also oers other kinds of features, for example to decide from which sources you want to accept annotations. Compared to Annotea (section 4.5), where a web page is linked to one page of annotations, here a web page can have more than one page of annotations. At runtime the annotation service can decide which annotations will be used to combine with the actual web page. There are no restrictions mentioned on how the annotations of a web page are created. Although they arent created at realtime. However it might be interesting for us as the engine could create this annotations automatically, when a web page is requested. Afterwards this existing annotation service could combine the annotations with the web page.

6.4

Conclusion

All projects, companies and services try to oer developers a platform to annotate web pages. All of them do this by using one (or more) annotation standards. But the platforms use dierent ways to add the annotations to the web pages. In some cases, the developer has to edit the web page (E.g.: Google Rich Snippets, Del eTools), others use seperate les to save the annotations (E.g.: OSA). More important is the fact that in all these projects, the developer or more general the user has to add this annotations manually. So, important for the project is to add this annotations automatically and deliver annotated web pages that are at least as accurate as the web pages that where annotated manually.

28

Chapter 7

Conclusion
There are many methods and algorithms to detect content and structural parts of a web page. We can divide these methods in two major groups. First of all theres one group of algorithms which uses the DOM-structure of a web page to analyse the document. On the other hand, many algorithms use a visual representation of the web page and dene dierent parts based on the rendered page. Both kind of algorithms are interesting for us as its likely that the engine has to combine several methods. Important to us is the fact that none of this algorithms seems to be patented. There is no patent which really limits what we try to do. Our customers, on the other hand, should take notice of the dierent patents because some of the applications of annotated web pages are patented in one or another way. On top, there doesnt seems to exist any company or organization who tries to do exactly what we want to do. Several companies and organizations try to oer an application or service to add annotations to web documents. However, all this companies developed semi-automatic methods which require interaction from the user or developer. At the moment, nobody oers a completely automatic annotation service or engine. We would be the rst to add annotations completely automatic, so today we wouldnt have any real competition. Market research shows us that the world wide web is evolving to a more semantic web in which, of course, semantic annotations play a major role. It would be very interesting for us to be part of the semantic web as adding annotations automatically would be a huge step forward towards the semantic web. To annotate web pages, we have the possibility to choose between dierent annotation technologies. However, its not so easy to choose one technology. All of them have their advantages and disadvantages. On top, all of them have quite some applications on the world wide web and are integrated by one or more web companies. Therefore we cant just ignore a technology. Its important for us to use technologies by which we cover a part of the web as large as possible. Interesting is the possibility to transform annotations from one technology to another. More specic, there are methods to convert microformats into RDFa. This would be a benet for us because we could cover more technologies.

29

Part III

Vision

30

Chapter 8

Vision
8.1 Mission Statement

We are going to create an engine which analyses the contents and structure of web pages. Based on this analysis, the engine will automatically add annotations. Unlike existing engines, this engine will add annotations completely automatic. This will solve the lack of semantic information on the web today. This extra information will improve existing applications, but also create endless possibilities for new kinds of applications.

8.2

Customers and benets

Primary: at rst our main customers will be web application developers. The engine will allow them to easily and automatically add semantic data to their pages without spending extra development time. Secondary: once semantic annotations are becoming more common on the world wide web, we will have a whole new range of customers who want to make use of the annotations. E.g.: search engines, browser developers, adaptation engines, application developers... Benets: adding semantic annotations to web applications will create webpages that can be understood by computers. This will pave the way for the semantic web and create a whole new range of applications and possibilities.

8.3

Key factors to judge quality

Scalability: Since we opted for a serverside approach the engine has to be able to handle many requests within a short time period. Performance: People do not like to spend a few seconds waiting for their page to load. The system will annotate at realtime. Therefore its crucial that the engine will be able to do this very fast. Modiability: We need to be able to swap out an algorithm for another one at runtime. This means the entire system should be split up in smaller, less complex subsystems. Each of these subsystems can then be edited without aecting other subsystems. Each subsystem can be exchanged for another subsystem with the same function. Accuracy: The engine needs to replace developers who annotate their pages by hand. To make sure the engine will be used, it must reach an acceptable level of accuracy.

31

CHAPTER 8. VISION

Availability: To oer a certain service level, the servers have to be up and running all the time. (E.g.: Also during updates and modications we have to try to keep the servers up and running) Completeness: The added annotations not only have to be accurate but also complete. This means, every place where a developer expects an annotation, should contain the correct annotations. Stability: The engine annotates at realtime. Therefore every error has to be handled correctly and fast without stopping the engine. An error may not lead to a bad annotated application. Extensibility: As the world wide web evolves very quickly, the engine has to be extensible to be able to handle new technologies used in web applications. We should be able to new technologies (E.g. a new annotation algorithm) in a short time span. These new technologies should be available immediately.

8.4

Key features and technology

The engine will add annotations to a given web application. The engine will do this completely automatically. The engine takes a web application as input and gives an annotated web application as output. The engine can be used as part of a bigger system.

8.5

Crucial factors as applicable

A certain level of correctness (accuracy) will be required of the engine to be usable. The engine needs to be consistent, which means we cant have 2 dierent annotated versions of the same web page. This is necessary for the engine to be usable. The solution has to be easily extendable for future web features. Documentation about the used annotations and API will be necessary. This will stimulate third party software to make use of the annotations and the semantic annotation engine. The documentation will be publicly available.

32

Part IV

Scenarios

33

Chapter 9

Use cases
9.1 Use case diagram

Figure 9.1: Use case diagram for the annotation engine

9.1.1

Actors

We have three kind of actors: 1. User: A user can be anyone who wants to use the engine. This can be a physical person but also another engine or application. Every single user is able to perform a semantic analysis on a selected web application. To do this, the user is able to change some common settings concerning the performance and accuracy of the engine. 34

CHAPTER 9. USE CASES

2. Administrator: The administrator is an extension of the user, which means he can do everything a normal user can do. On top, he manages the engine on the server and is able to change all settings related to the engine. Hes also able to add additional functionality to the engine, e.g.: add a new algorithm. 3. Web developer: The web developer is also an extension of the user. Additionally he can install the engine on his own computer. Doing this, he becomes administrator of his local installation and is able to do everything similar to what the administrator can do related to the engine on the server.

9.1.2

Use cases

Perform semantic analysis Performing a semantic analysis of a web page can be done in two ways. The engine is implemented as a web service. This means the engine can be invoked by any possible user from any place on the internet. A user even can be an other engine or application. Secondly, we provide a web application as a user interface to the engine. Therefore any physical user can use the engine by using the provided user interface. In both ways, the user will have to give a web page as input. The output after the semantic analysis will be an annotated web page. Choose performance and accuracy The user will be able to change the performance and accuracy. Concrete, this means the user can choose the algorithm used by the engine. Not every algorithm will perform as fast. Most likely, a faster algorithm will be less accurate. When using the engine at realtime, a user probably prefers a fast algorithm which may be a little less accurate. On the other hand, when a web developer runs the engine on his own computer, he can decide to choose a slower algorithm which gives a higher accuracy. Correct annotations As noticed before (section 9.1.2), we oer a user interface to the user. If this user interface is used, the user can choose, before starting the semantic analysis, to view the annotations in an editor afterwards. After the analysis, an editor will be opened and show the added annotations. The user is able to add other annotations or change the already added annotations. Afterwards, the revised page can be submitted to the engine for machine learning. For safety reasons, only an administrator can actually insert annotated pages for machine learning (sections 9.1.2 and 9.2.4). If a user submits an annotated page to the engine, this page will be saved until an administrator conrms to use that page for machine learning. This option will especially be used by advanced users when they notice the engine isnt accurate enough anymore. Add annotated pages for machine learning Another way of dealing with new technologies is the possibility to add already annotated web pages. Using this annotated web pages, the engine will be able to use machine learning to learn from this annotations and maybe change his behavior. As already noticed (section 9.1.2), any user can submit an annotated web page. The actual selection of which of these web pages has to be used for machine learning is, for safety reasons, left to the administrator. Therefore the administrator will be able to view the submitted pages and choose which pages has to be used for the actual machine learning. If the administrator resubmits those pages, they will be analysed and used to update the algorithms. 35

CHAPTER 9. USE CASES

Management of rule-based methods As the world wide web evolves quite fast, its necessary the engine can be adapted to deal with new technologies. To do so, the administrator can add, change or remove rules containing information about what structural or content part has to be mapped to which annotation. Add functionality The administrator is able to add additional functionality, e.g.: a new algorithm to the engine. This can easily be done, at runtime, using the user interface. Important is the fact that this is done at runtime, so the engine keeps running while adding this additional functionality. Install and run locally A web developer has the possibility to install the engine on his own computer. By doing this, he has his own local engine and he becomes administrator of his local engine (also see section 9.1.1). So he is able to perform semantic analysis on his web applications by using his own engine.

9.2
9.2.1

Use case scenarios


Perform semantic analysis

Name Perform semantic analysis Textual description When the engine is given a certain web page as input, the engine needs to return an annotated web page. Priority High - This is essential for the entire project. Complexity High - This is the most essential part Actors Any user who wants to annotate a web application. Events (triggering, during execution, ...) The user starts the engine and gives the web page which has to be annotated. During execution the engine could trigger a fatal error sending a failure notication to the user and a log message to the administrator. The original web page without any annotations will be returned. Preconditions The engine is running. The given web page is available. The engine has the correct rights in order to visit the web page. Main success scenario (MSS) 1. The user gives the location of the web page to be annotated. 2. The engine tries and read the web page. 36

CHAPTER 9. USE CASES

3. The engine applies heuristics and algorithms to detect the web pages structure and content. 4. The structural and content parts are annotated. 5. The annotated version is returned to the user. Extensions on the MSS 2. The web page can not be read or contains errors. An error message is returned to the user and will be logged internally. Postconditions after success and failure If the engine is used by using the user interface, the annotated web page will be showed in the editor. If the engine is used as web service by another application, the annotated web page is returned to the user. In case of failure, a message will be returned to the end user and the original web application is returned or showed.

Blackbox

Figure 9.2: Blackbox for the use case to perform a semantic analysis.

9.2.2

Choose performance and accuracy

Name Choose performance and accuracy Description You can choose which algorithm you want to use for performing the annotation of the web page. The accuracy of the annotation will vary, depending on the selected algorithm. By selecting a highly accurate algorithm, performance will likely be lower and vice versa.

37

CHAPTER 9. USE CASES

Priority Low - This is optional. Otherwise the standard algorithm is used. Complexity High - Implementing dierent algorithms is a complex job. It will take a while to make several algorithms with a high accuracy. Actors User Events This use case is triggered when the user decides to choose another algorithm as the standard one. Preconditions We need to have dierent algorithms which the user can choose from. The algorithms work perfectly. Main Success Scenario (MSS) 1. The user goes to the option pane where he can select the algorithm he prefers. 2. The user selects the algorithm he wants to use. 3. The user saves his selection and can perform the semantic analysis. Extensions on the MSS 2. The user makes an empty selection. A message is returned to the user that he has to select an algorithm. 3. The user doesnt save his selection The selection will be discarded and the standard algorithm (set by the administrator) will be used. Postconditions The user can now perform semantic analysis using the selected algorithm.

38

CHAPTER 9. USE CASES

Blackbox

Figure 9.3: Blackbox for the use case to choose performance and accuracy.

9.2.3

Correct annotations

Name Correct annotations Description A user will be able to view the result of the annotations added by the engine and edit these if they are incorrect. This will especially be done by advanced users as the revised pages can be used by the engine for machine learning to make the annotation algorithms more accurate. After a user corrected a page, he can submit this page as a suggestion for machine learning. The actual selection of the page for machine learning is done by the administrator (section 9.2.4). Priority Medium - This method should be fairly performant. The user should not have to wait very long before he can view and edit the annotations, but priority still goes to performing semantic analysis. Complexity Medium - This actually involves a simple viewer/editor. This should not be very complex, although this will still ask some time to program. Actors User Events This use case is launched when the user annotates a page using the web interface and selected the option to view the added annotations afterwards. In this way, the user can view what changes have been made, and still edit these if they are incorrect. Preconditions The user selected the option to view the added annotations after the analysis. The user has performed semantic analysis on a page using the web interface of the server. 39

CHAPTER 9. USE CASES

Main Success Scenario (MSS) 1. The user is redirected to this editor after performing semantic analysis on a page using the web interface of the server. 2. The user sees the resulting document and the added annotations. 3. The user performs modications to the resulting document as desired. 4. The user saves his changes. 5. The user submits the nal page to the engine for machine learning. 6. The user exits this web application. Extensions on the MSS 4. The user doesnt save his changes. The changes made by the user will not be added to the nal page. 5. The user exits without submitting the nal page. The user is asked for conrmation. If the user, again, chooses to exit, the annotated page is thrown away. If the user decides not to exit, the editor will be shown again containing the nal page. Postconditions The submitted page is saved for later use.

Blackbox

Figure 9.4: Blackbox for the use case to correct annotations.

40

CHAPTER 9. USE CASES

9.2.4

Add annotated page for machine learning

Name Add annotated page for machine learning Description The system can use annotated pages to update algorithms by using machine learning to learn from this annotated pages. Users can make changes to annotated pages and submit them (9.1.2) for machine learning. Afterwards an administrator has to conrm which pages will be used for machine learning. Priority Low - The performance must mainly go to the deliverance of annotated pages in time. This use case is purely for settings, and may be delayed if necessary. Complexity High - Because the machine learning itself is not an easy matter to implement. Actors Administrator Events This use case is triggered when the admin decides, based on results of annotations, a certain algorithm isnt accurate enough. Preconditions There must be correctly annotated pages available. Main Success Scenario (MSS) 1. The administrator logs on on the web application and gets an overview of the submitted and correctly annotated web pages. 2. The administrator views the content of the pages and selects one or more pages. 3. The selection of pages is committed by the administrator and will be processed by the engine. 4. The administrator logs out. Extensions on the MSS 3. One or more pages contain errors. The administrator gets a notication about this pages. These pages arent processed. 4. The Administrator logs out without committing the pages. The adaptations are discarded. The pages arent used for machine learning at this point. Postconditions The rules for certain algorithms are updated because the system learned how it should be done.

41

CHAPTER 9. USE CASES

Blackbox

Figure 9.5: Blackbox for the use case to add annotated page for machine learning.

9.2.5

Management of rule-based methods

Name Management of rule-based methods Description For rule-based methods to analyse and annotate web pages, the administrator must be able to edit the rules used by the algorithm. Priority Medium - This use case is purely for settings, and may be delayed but is necessary to guarantee a certain level of accuracy. Complexity High - The format of the rules is customized to the algorithm. Also, the rule database must not be corrupted during the operation. Actors Administrator Events This use case is triggered when the administrator decides, based on results of annotations, there must be added/edited/removed rules used by a certain annotation algorithm. Preconditions The database of rules is in a valid state. Main Success Scenario (MSS) 42

CHAPTER 9. USE CASES

1. The administrator logs on on the web application. 2. Rules are added, modied or deleted. 3. The rules are committed to the database. 4. The administrator logs out. Extensions on the MSS 3. A conict arises when committing the rule modications. The rules are not committed. The conicting rules are presented (goes back to rule modication step) and any issues must be resolved by the administrator manually. 4. The administrator logs out without committing the rules. The adaptations are discarded: The database is not modied. Postconditions The ruleset has been modied according to the requested operations. The database of rules is in a valid state.

Blackbox

Figure 9.6: Blackbox for the use case for management of rule-based methods.

9.2.6

Add functionality

Name Add functionality

43

CHAPTER 9. USE CASES

Textual description At runtime, additional functionality such as a new algorithm can be added. This can be done by simply uploading the code using the user interface. The system itself will make sure the functionality is properly added to all instances of the annotation engine. Priority Low - This is not essential for this project but its required to be able to keep the engine up-to-date. Complexity Medium - The engine has to keep running while adding the new code to the instances. Actors Administrator Events (triggering, during execution, ...) This use case is triggered when the administrator has code which has to be added to the engine. Preconditions The engine is running. The code is valid. Main success scenario (MSS) 1. The administrator logs on. 2. The correct conguration page is selected in the user interface. 3. The administrator uploads the additional functionally. 4. The system takes care of the allocation of the functionality to the dierent components of the system. 5. The administrator logs out. Extensions on the MSS 4. The code doesnt have the correct structure (concerning packages, interfaces, etc.) and therefore can not be handled by the system. An error message is returned to the administrator. The code will not be added to the components. Postconditions after success and failure The dierent components notice the changes. The additional functionality will be used by the dierent components.

44

CHAPTER 9. USE CASES

Blackbox

Figure 9.7: Blackbox for the use case to add functionality.

9.2.7

Install and run locally

Name Install and run engine local Description The web developer can install the engine on his computer (basically anyone can but we assume this is only benecial to a web developer). By doing so, the web developer becomes administrator of his own installed engine and can do everything the administrator can do compared to the engine on the server. Priority Medium - If web developers use their own installation to perform semantic analysis, the load of the servers will be lower. Complexity Low - To make it possible to run the engine local, none or few changes need to be done to the engine. Actors Web developer Events This use case is triggered when the web developer decides to run the engine locally. Preconditions The web developer has downloaded the installation les of the engine. The web developer needs to be sure he has a web server (e.g.: apache) installed on his computer. 45

CHAPTER 9. USE CASES

Main Success Scenario (MSS) 1. The web developer extracts the installation les and runs the setup. 2. A locally installed web server is selected so the engine would be able to execute. 3. An administrator account is created. 4. Setup ends. Extensions on the MSS 3. There is incorrect information about the account to create, e.g.: the passwords doesnt match. The web developer will be notied. The web developer needs to re-enter the information. Postconditions After installation, the web developer becomes administrator of his installation and can perform semantic analysis, change all settings, etc.

Blackbox

Figure 9.8: Blackbox for the use case to install and run the engine locally.

46

Chapter 10

Quality attribute scenarios


10.1 Scalability

The system can no longer handle the amount of received requests. Source of stimulus server Stimulus requests enter the buer Artifact system Environment overloaded operation mode Response activate an additional server to handle part of the requests. Request will then be divided between both the old servers and the new servers. Response measure assuming the additional server is part of the cluster, this server can be used immediately.

10.2

Performance

A user wants to perform a semantic analysis on a web application. Source of stimulus a user Stimulus the user wants to view or use an annotated web application Artifact system Environment normal operation mode Response the system annotates the given web application Response measure the system should annotate a web application in under 4 seconds. The latency will be 3500ms and jitter will be 500ms. Handling a lot of request within a short time interval. Source of stimulus server Stimulus many simultaneous request Artifact system Environment normal operation mode Response the system annotates the given web applications of all requests

47

CHAPTER 10. QUALITY ATTRIBUTE SCENARIOS

Response measure when the server hardware is a server equipped with 2 AMD Opteron 4176HEs and 64GB DDR3 RAM, the software should be able to handle 400 requests per second.

10.3

Modiability

User wants a specic annotation algorithm. Source of stimulus a user Stimulus annotation request Artifact system Environment normal operation mode Response the requested algorithm is loaded and prepared for usage Response measure the algorithm should be ready instantly and can be changed at runtime.

10.4

Accuracy

The annotations that are added must be correct. Source of stimulus a user Stimulus annotation request Artifact system Environment normal operation mode Response an annotated page with a minimum accuracy. Response measure a series of test pages with a wide variety of characteristics is annotated and veried by a group of testers. They mark the number of improperly or inaccurately annotated items in the page. The average ratio of properly annotated items by total number of annotated items must be at least 90%. [52, 21]

10.5

Availability

Server crash Source of stimulus server Stimulus server crashes due to a hardware problem, power outage, etc. Artifact system Environment normal operation mode Response server must auto restart. If restart is not possible (e.g. hardware crash) the administrator must be notied. If there are other servers available, the incoming requests should be automatically redirected while the server is restarting. Response measure the server should be checked every second if hes he still available. The downtime for the server is dependent on the type of crash. If it is a softwarecrash, the server should restart in 1 minute. If the error is due to an external factor (e.g. power loss) or a hardwarecrash (e.g. I/O-failure) the downtime can be several hours or even days [43]. Since this is unacceptable it is necessary to work with multiple servers.

48

CHAPTER 10. QUALITY ATTRIBUTE SCENARIOS

Add functionality to the engine Source of stimulus developer Stimulus the system needs additional functionality Artifact system Environment normal operation mode Response the additional functionality can be added by uploading les using the user interface. This can be done while the engine and server are running. The engine will add the code to each running instance of the annotation engine one by one. Response measure the additional functionality has to be added without any downtime, meaning that the server and engine have to be available at all times during the updates.

10.6

Completeness

A sucient number of items must get annotations. Source of stimulus a user Stimulus annotation request Artifact system Environment normal operation mode Response an annotated page with a minimum number of annotations. Response measure a series of test pages with a wide variety of characteristics is annotated and veried by a group of testers. They mark the number of items on a page which should have been annotated. The average ratio of annotated items by total number of items which should be annotated must be at least 80%.

10.7

Stability

An error occurs in the annotation process Source of stimulus engine Stimulus error in the annotation process Artifact system Environment normal operation mode Response the user can choose between restarting the annotation process or stopping the annotation process (and returning the original web application). Response measure the reannotion should not take more time than the normal time to annotate a page, i.e. 4 seconds. If an error keeps occurring during the second attempt, the input should be assumed to be the problem (see the next item). The input cannot be processed Source of stimulus web application Stimulus due to mistakes in the input or unreadable content the input cannot be annotated Artifact system

49

CHAPTER 10. QUALITY ATTRIBUTE SCENARIOS

Environment normal operation mode Response send an error message back to the user and the original web application Response measure sending a message should not aect the other requests.

10.8

Extensibility

The administrator wants to add additional functionality (E.g. a new algorithm, changes to the GUI...) Source of stimulus administrator Stimulus the system needs additional functionality Artifact system Environment normal operation mode Response the additional functionality can be added by adding the code using the user interface at runtime. The engine itself makes sure the code is added to all running instances of the annotation engine. Response measure after adding the code, the additional functionality has to be available immediately.

50

Part V

Architectural design

51

Chapter 11

Global overview

52

CHAPTER 11. GLOBAL OVERVIEW

11.1

System Overview

This project will be implemented as a web service which adds semantic annotations to a given web page. Technically this means external users and applications can invoke or use the web service within external applications to annotate their web pages. As an extra feature we provide a GUI in the shape of a web application. In this way, users arent forced to use the engine as a web service but can also use the GUI. If a user accessess the engine using the GUI, we can provide the user with an extra feature in the GUI where he can examine the added annotation and correct or change these annotations if necessary. If the user is an administrator, he can also congure the system using this web application. Another extra feature is that a user can install the actual annotation engine locally. In that case he automatically becomes admin of his installation so he can adapt his engine to his own needs. We oer this as an extra feature to advanced users but will not take this into account during the architectural design. At a rst stage, we will only run the system on our own servers. Next, we can consider to oer the actual annotation engine for downloading. An alternative for a web service would be to oer the engine as a plugin. This means, every web server administrator could download this plugin and use it on his server. Weve choosen a web service over a plugin because of several reasons. First of all, by choosing a web service, we have control over the engine. We can make sure the engine is installed correctly, runs well and is always up-to-date. When we oer a plugin, these are responsibilities of the web server administrators and it can happen that a user, for example, receives outdated annotations or has to wait very long before the annotation process is completed. On top, its not even sure that webserver administrators would use the plugin because we cant force any administrator to install the plugin. Secondly, we believe a web service is much more user friendly. If we would oer a plugin, a user who wants to annotate a page rst has to nd a server which makes use of the plugin or needs to install the plug-in on his own system (if no server can be found). Next, its possible that this specic server doesnt oer a useful interface to the plugin (e.g.: a GUI), which makes it very dicult for physical users to use the engine. Therefore we oer a web service, installed on our own servers. It can be used by anyone without having to search a server or install the engine rst. On top, we oer a GUI which makes the web service accessible for everyone. One important element to note is the fact that the system needs to implement scalability. Because of the use of a web service, all requests are addressed to our servers. This can be a problem when there are many requests within a short period of time. However, we believe its possible to design the system so it remains scalable even when the amount of requests at a certain moment is huge.

53

Chapter 12

Attribute Driven Design - Cycle I

54

CHAPTER 12. ATTRIBUTE DRIVEN DESIGN - CYCLE I

12.1

Inputs for the system

The semantic annotation service is a rather complex system that has to be able to deliver on a variety of use cases. This complex system is the input and we want to split it in less complex substructures.

12.2

Architectural drivers

Modiability To simplify the system, we opted to create a series of less complex subsystems which we can modify seperately. When needed, we can swap out one subsystem and replace it with another one. The complex system as a whole should be split up in several, independent subsystems.

12.3
12.3.1

Architectural pattern
Chosen pattern

Layer pattern Based on this architectural driver we choose the Layer pattern. This pattern allows us to split the system into dierent layers. Each layer can then be modied without aecting the other layers.

12.3.2

Instantiation

Figure 12.1: The system divided into subsystems based on the layer pattern.

55

CHAPTER 12. ATTRIBUTE DRIVEN DESIGN - CYCLE I

We have dened 3 Layers (gure 12.1): Access Layer, Application Layer and Persistence Layer. We split the system into these layers, based on functionality. The Access Layer is the upper layer and oers the services to the outside world. The Application Layer contains all business logic and is the intermediate layer. We expect this layer to be the most CPU intense one. The lower layer is the Persistence layer. A detailed description of each layer follows. Access Layer The Access Layer is the interface of the application responsible for all outside communication. Users can then interact with the system through the web service or using the GUI (a web application). This Layer collects all requests and passes them onto the Application Layer. Class name: Access Layer <subsystem> Responsibilities: Oer a Graphical User Interface (GUI) to the application logic Oer a web service to the application logic Manage requests Application Layer The intermediate layer is the Application Layer. All requests from the Access Layer are passed to this layer. The system is likely to receive a high load of requests so the Application Layer should be able to deal with this by applying loadbalancing and buering the requests. The Application Layer is responsible for the actual annotation of a web page and for creating new machine learning rules. The Application Layer is connected to the Persistence Layer to be able to request data if necessary. The Application Layer will cache all necessary data and will check at xed intervals whether the data in the Persistence Layer is updated in the meanwhile or not. If so, the Application Layer will request this updated data and add it to the cache. Class name: Application Layer <subsystem> Responsibilities: Buer requests Load balancing Analysis of annotated pages for machine learning Annotate given request Update the subsystems within the Application Layer Persistence Layer The Persistence Layer takes care of the actual data storage. Requests for data are handled here and executed on the database. If multiple clusters coexist, this layer will be responsible for keeping the data synchronized in all clusters. When data is updated in this layer, the Persistence Layers in the other clusters will be notied. Class name: Persistence Layer <subsystem> Responsibilities: Store data Provide data Synchronise dierent Persistence Layer instances

Collaborations: Application Layer Application Layer

Collaborations:

Persistence Layer

Collaborations: Application Layer Application Layer

56

CHAPTER 12. ATTRIBUTE DRIVEN DESIGN - CYCLE I

12.4

White Box Scenarios

Figure 12.2: A user performing an annotation request at the point of cycle 1.

Figure 12.3: A user performing page corrections at the point of cycle 1.

57

CHAPTER 12. ATTRIBUTE DRIVEN DESIGN - CYCLE I

Figure 12.4: The admin selects annotated pages for machine learning that are processed by the Application Layer at the point of cycle 1.

Figure 12.5: An admin performing an engine update at the point of cycle 1.

58

CHAPTER 12. ATTRIBUTE DRIVEN DESIGN - CYCLE I

Figure 12.6: An admin managing the rule set at the point of cycle 1.

Figure 12.7: An admin booting a new cluster at the point of cycle 1.

59

CHAPTER 12. ATTRIBUTE DRIVEN DESIGN - CYCLE I

Figure 12.8: An admin tries to login at the point of cycle 1.

12.5

Deployment

Figure 12.9: Deployment of the system in cycle 1.

The Persistence Layers of the dierent clusters are connected to make sure the data in all clusters is up-to-date and stays up-to-date when one cluster receives a data update. The clusters work independent so we can easily add multiple clusters to be able to handle many requests.

60

Chapter 13

Attribute Driven Design - Cycle II

61

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

13.1
13.1.1

Access Layer
Inputs for the system

The Access Layer provides the interface of the application (GUI and web service) and has to be able to handle multiple types of HTTP requests. The HTTP requests have to be transformed into an internal request format to get rid of the overhead of an HTTP request and to make sure the Application and Persistence Layer dont have to work with these HTTP requests. This has to be done in an easy way to be able to change the conversion when the HTTP requests or internal requests change.

13.1.2

Architectural drivers

Modiability Modiability is crucial in this layer at two points. All incoming HTTP requests are transformed to internal requests before passing these to the Application Layer. New types of requests can be added, or modications in request handling can be made without the need to modify the other kind of request. Concrete, HTTP requests can be changed without having to change the internal requests and vice versa. As a consequence, dierent subsystems of the Access Layer can be modied and use other types of HTTP requests without the other layers noticing it. Chosen pattern MVC pattern The MVC pattern allows the subsystem to be fully modiable, the GUI and web service can be modied without impact on the other components.

Instantiation The Access Layer consists of ve subsystems (gure 13.1): RequestModel, GUIController, GUIView, WebServiceView and WebServiceController. The RequestModel (Model) converts all HTTP requests coming from the controllers. There are four dierent HTTP request: annotationRequest: A HTTP request containing a url of a web page that has to be annotated. updateRequest: A HTTP request to update one or more components within the application layer. machineLearningRequest: A HTTP request to use a given page for machine learning. dataRequest: A HTTP request concerning data. This can be a request to select data but also to update or delete certain data. These requests are converted by the RequestModel to internalAnnotationRequests, internalUpdateRequests, internalMachineLearningRequests and internalDataRequests. The GUIController and WebServiceController act as controllers. The GUIView and WebServiceView act as the views of this model. The controllers pass the requests to the RequestModel, which

62

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Figure 13.1: The Access Layer, divided into subsystems using the MVC pattern.

returns the result of the request to the controller. The controller then informs the view. GUIController and GUIView The GUI subsystem provides a graphical interface for the users (including administrator). It will allow users to submit a web page for annotation and correct an annotated web page. The administrator can change settings, input les for machine learning, update the algorithms and choose the used algorithms. The GUI is divided into a GUIController (Controller) and GUIView (View). The GUIController will generate requests depending on the actions of the user and pass them to the RequestModel. Class name: GUIController <subsystem> Responsibilities: Collaborations: Pass on annotationRequest RequestModel Submit corrected annotations RequestModel Login RequestModel Allow admin to add or change rules/algorithms RequestModel Allow user to select the algorithm RequestModel Class name: GUIView <subsystem> Responsibilities: Collaborations: Return the result of the request to the user GUIController Web Service Controller and Web Service View The Web Service handles the annotationRequests coming from external applications or web services that use the system. The Web Service is divided into Web Service Controller (Controller) and Web Service View (View). The Web Service Controller receives the external requests and passes these to the RequestModel.

63

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Class name: Web Service Controller <subsystem> Responsibilities: Collaborations: Pass on annotationRequest RequestModel Class name: Web Service View <subsystem> Responsibilities: Collaborations: Return the result of the request RequestController RequestModel The RequestModel will handle all requests coming from the GUI Controller and Web Service Controller and pass these to the Application Layer after converting them to internal requests. The RequestModel also handles authentication of requests to make sure no unauthorised requests reach the Application Layer. This is done here because all requests are gathered in the RequestModel. Class name: RequestModel <subsystem> Responsibilities: Receive all requests Convert incoming request to internalRequests Forward these internalRequests

Collaborations:

Application Layer.LoadBalancer

13.2
13.2.1

Application Layer
Inputs for the subsystem

This subsystem receives dierent kinds of requests coming from the Access Layer. The Application Layer must be able to process any amount of requests. Additionally, these requests must be processed eciently and fast.

13.2.2

Architectural drivers

Scalability, Performance Since it is important to be able to process many requests, scalability is the main architectural driver for this component. This component will have to be able to dynamically adapt the number of components to be able to handle each amount of requests. Performance is another important architectural driver and is related to scalability. By dynamically adapting the number of components, its possible to guarantee performance.

13.2.3

Architectural pattern

Chosen pattern Publisher-Subscriber pattern The pattern we have chosen is Publisher-Subscriber. Using this pattern, we can dynamically add or delete components (subscribers) if necessary. In this way, we can ensure both scalability and performance.

64

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Figure 13.2: The Application Layer, divided into subsystems using the publisher-subscriber pattern.

Instantiation The Application Layer (gure 13.2) consists of four components: a LoadBalancer, MachineLearningComponent, Engine and DataSubscriber. In the Publisher-Subscriber pattern, the Access Layer acts as publisher, the LoadBalancer is the event channel, and the Engine, DataSubscriber and MachineLearningComponent are the subscribers. The LoadBalancer receives requests from the Access Layer and distributes them to the subscribers. The MachineLearningComponent lters internalMachineLearningRequests, the engine lters internalAnnotationRequests and the DataSubscriber lters internalDataRequests. All the Components in this subsystem can also receive internalUpdateRequests to patch the component. LoadBalancer The Load Balancer receives requests from the Access Layer (the publisher) and sends these to the subscribers. The LoadBalancer receives information, at xed intervals, from the dierent subscribers about their load and will try to distribute the requests as much as possible to maximize the overall performance of the system. The LoadBalancer publishes the request and will indicate within the request for which subscriber the request is meant. Only the targeted subscriber will process this request.

65

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Class name: Load Balancer <subsystem> Responsibilities: Receive internalRequests from Access Layer Distribute internalRequests

Send annotated page responses from the engines back to the Access Layer Dynamically add/remove engines as necessary Dynamically add/remove MachineLearningComponents as necessary Dynamically add/remove DataSubscribers as necessary MachineLearningComponent

Collaborations: Access Layer.RequestModel MachineLearningComponent Engine DataSubscriber Access Layer.RequestModel Engine MachineLearningComponent DataSubscriber

The MachineLearningComponent subscribes to the LoadBalancer to process machineLearningRequests. This component processes a given page using machine learning to produce new rules for a certain algorithm. These rules are sent to the Persistence Layer using internalDataRequests. Class name: Machine Learning Component <subsystem> Responsibilities: Collaborations: subscribe to LoadBalancer LoadBalancer receive internalRequests from LoadBalancer LoadBalancer processes internalMachineLearningRequests send internalDataRequests to the Persistence Layer Persistence Layer.DataController Engine The engine also subscribes to the LoadBalancer. This component processes annotationRequests and returns annotated pages to the LoadBalancer. The engine can also send internalDataRequests to the Persistence Layer. The data from the Persistence Layer, which is relevant to the engine, is cached in the engines to minimize the number of internalDataRequests. The engine itself will check for updated data in the Persistence Layer at xed intervals, using an internalDataRequest. If the data in the Persistence Layer is changed in the meanwhile, the engine will request and cache the updated data. Class name: Engine <subsystem> Responsibilities: subscribe to LoadBalancer receive internalRequests from LoadBalancer processes internalAnnotationRequests send internalDataRequests to the Persistence Layer caches data received from Persistence Layer DataSubscriber The DataSubscriber subscribes to the LoadBalancer. It handles any type of internalDataRequest coming from the Access Layer via the LoadBalancer. It sends these to the Persistence Layer. Class name: DataSubscriber <subsystem> Responsibilities: receive internalRequests process internalDataRequests send internalDataRequests to the Persistence Layer

Collaborations: LoadBalancer LoadBalancer Persistence Layer.DataController Persistence Layer.DataController

Collaborations: LoadBalancer Persistence Layer.DataController

66

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Figure 13.3: The persistence layer divided into subsystems using the MVC pattern.

13.3
13.3.1

Persistence Layer
Inputs for the system

The Persistence Layer stores all data. This needs to be done in a stable manner to make sure all internalDataRequests are made persistent. As there can be several clusters running, this subsystem has to make sure the data is consistent between the dierent clusters. Meaning each cluster works with the same, most recent, set of data.

13.3.2

Architectural drivers

Stability, Scalability Stability is an important architectural driver, as the Persistence Layer has to be resistant to faults, corrupt data and other errors. When more than one cluster is running, the system has to remain scalable to keep the dierent clusters consistent. Thus, scalability is another important driver for the Persistence Layer.

13.3.3

Architectural pattern

Chosen pattern MVC pattern To handle the complexity of scalability and stability, we chose the MVC pattern. Using this pattern, the model can be responsible for data storage, the controller for data modications, and the view can read the data and relay it to other clusters when modications are made.

67

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Instantiation The Persistence Layer (gure 13.3) is divided into a DataController (Control), a Repository (Model) and a Synchronizer (View). The DataController (Control) receives the internalDataRequests, validates them and passes them to the Repository. The Repository (Model) saves the data changes or selects and sends the requested data back, based on the type of internalDataRequest. If changes where made, the Synchronizer (View) will notice this and will notify the other clusters. DataController The DataController (Control) will handle all internalDataRequests and verify whether its a valid request or not. If the internalDataRequest is valid, the datacontroller will transform it into the database query language and sends this query to the Repository.Validation is done in the DataController to avoid corrupt data. This can not be checked earlier as there are requests, coming from other clusters, that dont pass through the other layers. If changes where made to the data, the DataController will notify the Synchronizer to make sure all clusters stay consistent. Class name: DataController <subsystem> Responsibilities: Collaborations: receive internalDataRequests Application Layer.DataSubscriber Application Layer.Engine Application Layer.MachineLearningComponent validate the internalDataRequests transform the internalDataRequests into queries send the queries to the database Repository notify the Synchronizer of data changes Synchronizer Repository The Repository (Model) holds the actual database that makes and keeps the data persistent. Class name: Repository <subsystem> Responsibilities: Collaborations: receive database queries DataController execute the queries store data and keep it persistent Synchronizer The Synchronizer (view) is notied by the DataController if data in the database has changed. The Synchronizer is connected to the DataControllers of the other clusters and communicates this changes to them to keep consistency between the dierent clusters. So the Synchronizer is the view of the other clusters on this repository. When the Synchronizer noties the other clusters, a timer is started to make sure all clusters received the update. If the Synchronizer doesnt receive an answer from a cluster within a specied time-interval, the update is sent again. Class name: Synchronizer <subsystem> Responsibilities: receive dataChanged events notify the other clusters of data changes

Collaborations: DataController DataController

68

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

13.4

White Box Scenarios

Figure 13.4: Perform a semantic analysis at the point of cycle 2.

69

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Figure 13.5: A user corrects the annotations of a page at the point of cycle 2.

70

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Figure 13.6: The admin selects annotated pages for machine learning that are processed by the MachineLearningComponent at the point of cycle 2.

71

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Figure 13.7: Login scenario at the point of cycle 2.

72

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Figure 13.8: The admin adds (or adapts) rules to the database at the point of cycle 2.

73

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Figure 13.9: The admin sends an update to the Application Layer at the point of cycle 2.

74

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Figure 13.10: The engine gets data at the point of cycle 2.

75

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Figure 13.11: Data synchronization between clusters at the point of cycle 2. 76

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

Figure 13.12: Adding a new cluster to the system at the point of cycle 2.

77

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

13.5

Deployment

Figure 13.13: The access component

78

CHAPTER 13. ATTRIBUTE DRIVEN DESIGN - CYCLE II

The Application Layer is divided into three separate (groups of) servers. A server is provided for the main functionality of the application layer. Next, there is a group of servers for the MachineLearningComponents and a group of servers for the engines. A server (when running in the network) or an additional instantiation of a component on a server can be added dynamically, at runtime, without inuencing the other servers.

79

Chapter 14

Attribute Driven Design - Cycle III

80

CHAPTER 14. ATTRIBUTE DRIVEN DESIGN - CYCLE III

14.1
14.1.1

Engine
Inputs for the system

Since the Engine is the subsystem that is responsible for the actual annotation of a given web page, it is safe to assume this subsystem will be the most CPU intense. Since we can receive a large amount of requests we need to be able to perform the jobs of an engine quickly. Furthermore it is crucial the annotation process delivers an accurate result.

14.1.2

Architectural drivers

Performance, Accuracy, Modiability The rst driver is performance. Because of the high complexity of this component we will have to make it as performant as possible in order to avoid, this component will become the bottleneck for the systems performance. Another advantage will be that individual requests will be handled more quickly and hence a larger amount of requests can be handled in the same time period. The most important use case of the global system is the actual annotating of the web applications. Failure or success of the system will depend on the quality of these annotations, in other words the accuracy. In order to be able to deliver sucient accuracy we dened the modiability driver. We interpret this as a way to quickly switch to a new algorithm thats better (in general) at annotating or to choose a certain algorithm thats better for a specic type of web page. Chosen pattern Pipes & Filters pattern After considering these drivers we concluded the Pipes & Filters pattern was the pattern we needed here. It allows us to split up each step of the annotation process. It has the advantages that it is adaptable and modiable. These properties allow us to change the used algorithm easily. Instantiation The engine (gure 14.1) exists of ve consecutive subsystems which all handle a part of the actual annotation. RequestFilter The RequestFilter subsystem is the rst step in the annotation pipeline. It evaluates all requests it receives (using the publisher-subscriber pattern dened in cycle II) and selects those requests which are meant for this Engine. It then passes those requests along to the next pipelinestep. Class name: RequestFilter <subsystem> Responsibilities: Picks out the requests meant for this Engine Passes requests meant for this engine InputReader The InputReader receives a request for annotation. This request will contain a URL to the web page that needs to be annotated. The InputReader is responsible for downloading the information at this URL. This subsystem can become quite large due to its functionality.

Collaborations: Application Layer.LoadBalancer InputReader

81

CHAPTER 14. ATTRIBUTE DRIVEN DESIGN - CYCLE III

Figure 14.1: The engine divided into subsystems using the Pipes & Filters pattern.

82

CHAPTER 14. ATTRIBUTE DRIVEN DESIGN - CYCLE III

Class name: InputReader <subsystem> Responsibilities: Collaborations: Receives annotation request RequestFilter Downloads the actual web page Forwards the data Transcoder Transcoder The transcoder analyses the web page and checks if it already contains semantic data. If it can nd such data and it is written in another standard/technology than the standard we chose, it is transcoded into our standard. Class name: Transcoder <subsystem> Responsibilities: Receives a web page Check for existing semantic data Transcode semantic data written in other standards to our standard Forward data DOMBuilder All semantic algorithms rely on the DOM structure of a web page. The DOMBuilder is the step of the pipeline that reads the structure from the web page and constructs the DOM tree in preparation of the actual algorithm. Class name: DOMBuilder <subsystem> Responsibilities: Collaborations: Read DOM structure Build DOM tree Forward data Annotator Annotator The Annotator is the actual annotation algorithm. This is a seperate step in the pipeline so algorithms can easily be swapped. The Annotator subsystem contains a cache that stores all rules used by the algorithm. The Annotator subsystem is respo Just like the algorithms in the engine, the algorithms in the machine learning component can be replaced. Modiability is therefore important for this component. nsible for requesting and storing data because the data type depends on the algorithm used. A disadvantage of this approach is that each algorithm needs to contain logic to access the DataController in order to be able to update. Class name: Annotator <subsystem> Responsibilities: Annotate web page using the DOM tree Keep rules up-to-date

Collaborations: InputReader

DOMBuilder

Collaborations: DOMBuilder Persistence Layer.DataController

14.2
14.2.1

Machine Learning
Inputs for the system

The machineLearningComponent must be able to analyse corrected annotations and, based on the modications, update its knowledge by formulating new rules and change existing rules. This rules need to be accurate to make sure the algorithm, using these rules, will perform well. Therefore its necessary to use a good machine learning algorithm. 83

CHAPTER 14. ATTRIBUTE DRIVEN DESIGN - CYCLE III

Figure 14.2: The machine learning component divided into subsystems using the Pipes & Filters pattern.

14.2.2

Architectural drivers

Modiability, Accuracy The rst driver is accuracy. Since the machineLearningComponent updates the rules used by the annotation engine, it is important that the updated rules are accurate. Otherwise, the annotation engine would have reduced accuracy. Just like the algorithms in the engine, the algorithms in the machine learning component can be replaced to ensure a high accuracy. Modiability is therefore important for this component.

Chosen pattern Pipes & Filters pattern The analysis of an annotated page is done in dierent, successive steps. This is supported by the Pipes & Filters pattern. Using this pattern, its easy to change the algorithm for machine learning by changing the element containing the actual algorithm. Instantiation The MachineLearningComponent (gure 14.2) exists of 3 subsystems which all handle a part of the machine learning process. 84

CHAPTER 14. ATTRIBUTE DRIVEN DESIGN - CYCLE III

RequestFilter The RequestFilter subsystem is the rst step in the machine learning pipeline. It evaluates all requests it receives (using the publisher-subscriber pattern dened in cycle II) and lters the internalMachineLearningRequests out of it. It then passes those requests along to the next pipeline step. Class name: RequestFilter <subsystem> Responsibilities: Picks out the requests meant for this Engine Passes machine learning requests Validator The MachineLearningComponent processes pages annotated by users. This pages can contains errors. Therefore the Validator validates the input before the actual machine learning. Class name: Validator <subsystem> Responsibilities: Collaborations: Validate the given pages RequestFilter MachineLearner MachineLearner The MachineLearner performs the actual analysis of the annotated page and updates the knowledge base. This component contains the machine learning algorithm and can be replaced if the accuracy is too low. Class name: MachineLearner <subsystem> Responsibilities: Collaborations: Analyse annotated pages Validator Update rules to the database Persistence Layer.DataController

Collaborations: LoadBalancer Validator

85

CHAPTER 14. ATTRIBUTE DRIVEN DESIGN - CYCLE III

14.3

White boxes

Figure 14.3: Perform a semantic analysis at the point of cycle 3.

86

CHAPTER 14. ATTRIBUTE DRIVEN DESIGN - CYCLE III

Figure 14.4: Analyse an annotated page for machine learning at the point of cycle 3. 87

CHAPTER 14. ATTRIBUTE DRIVEN DESIGN - CYCLE III

Figure 14.5: The engine checks for data updates at the point of cycle 3. 88

Chapter 15

Global system
Figure 15.1 gives an overview of the global system as result of the Attribute Driven Design. It shows the layers of cycle I and all subsystems, of cycles II and III, within these layers. The interaction between subsystems within a layer and between layers is also shown here. Detailed information about the subsystems and their interaction can be found in the previous chapters.

89

CHAPTER 15. GLOBAL SYSTEM

Figure 15.1: The global system and all subsystems. 90

Part VI

Conclusion

91

Throughout the report, we have developed an architecture for an automatic annotation engine, deployed as a web service. We went through several steps to accomplish this. We performed a State of the Art analysis about the background of the subject. Its safe to say the World Wide Web is evolving towards a Semantic Web where semantic analysis will play an important role. A large variety of existing algorithms already exist and a lot of them are able to perform such an automatic semantic analysis. We also dont have to worry about patents relying on these algorithms. Important to note is the fact that the system would be the rst engine which performs a semantic analysis completely automatically. This was taken into account when we formulated our quality attributes and quality scenarios. Based on the main use case, perform a semantic analysis and our choice to oer the functionalities as a web service, we deemed the quality attributes modiability, scalability and performance as most important. If we want to be sure the system will be used, accuracy and completeness are also very important, but we made sure the response measures are feasible. Especially because we are the rst to do the annotation automatically. The complete system exists of three layers to ensure modiability. Furthermore, the system is able to dynamically scale the number of annotation engines to ensure scalability and performance by the use of the Publisher-Subscriber pattern. Finally, we provided the possibility to deploy multiple clusters of the same system to ensure we dont have any single point of failure. Taking all elements into account, we designed a robust system with respect to scalability and performance.

92

Bibliography
[1] Amaya - w3cs editor. http://www.w3.org/Amaya/. [2] Annotea project. http://www.w3.org/2001/Annotea/. [3] The darpa agent markup language homepage. http://www.daml.org. [4] Del etools - elearning annotation web service. programmes/edistributed/eaws.aspx. http://www.jisc.ac.uk/whatwedo/

[5] Html microdata. http://www.w3.org/TR/microdata/. [6] Insemtives. http://cordis.europa.eu/fetch?CALLER=ICT_UNIFIEDSRCH&ACTION=D&DOC= 28&CAT=PROJ&QUERY=012bc0b1b72e:d11f:65f661f7&RCN=89483. [7] Introducing rich snippets. http://googlewebmastercentral.blogspot.com/2009/05/ introducing-rich-snippets.html. [8] Jupiterresearch site abandonment report. releases/2006/press_110606.html. http://www.akamai.com/html/about/press/ http://www.objs.com/OSA/

[9] Object service architectur - web annotation service. Annotations-Service.html. [10] Ocial microformats page. http://www.microformats.org.

[11] Ocial rdfa page. http://www.w3.org/TR/xhtml-rdfa-primer/. [12] Semantic web adoption. semantic-web-adoption. [13] The state of linked data. linked_data_in_2010.php. http://www.slideshare.net/guest262aaa/

http://www.readwriteweb.com/archives/the_state_of_

[14] Which future web. http://ercim-news.ercim.eu/which-future-web. [15] Wikipedia daml. http://en.wikipedia.org/wiki/DARPA_Agent_Markup_Language. [16] Wikipedia microdata. http://en.wikipedia.org/wiki/Microdata_(HTML5). [17] Wikipedia microformats. http://en.wikipedia.org/wiki/Microformat. [18] Wikipedia rdfa. http://en.wikipedia.org/wiki/RDFa. [19] Wordnet - a lexical database for english. http://wordnet.princeton.edu. [20] Ben Adida. hgrddl: Bridging microformats and rdfa. Journal of Web Semantics, 2007. [21] Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying Ma. Extracting content structure for web pages based on visual representation. 2003.

93

BIBLIOGRAPHY

[22] Yu Chen, Wei-Ying Ma, and Hong-Jiang Zhang. Detecting web page structure for adaptive viewing on small form factor devices. 2003. [23] Hui Na Chua, Simon David, See Leng Ng, and E Ken Pun (MY). Web server for adapted web content. http://www.google.com/patents/about?id=raCYAAAAEBAJ, Filed Mar. 8, 2004. [24] Hui Na Chua and See Leng Ng (MY). Web content adaptation process and system. http: //www.google.com/patents/about?id=rqCYAAAAEBAJ, Filed Mar. 8, 2004. [25] Blomme D., Goeminne N., Gielen F., and De Turck F. A two layer approach for ubiquitous web application development. 2009. [26] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, J. A. Tomlin, and J. Y. Zien. Semtag and seeker: Bootstrapping the semantic web via automated semantic annotation. Twelfth International World Wide Web Conference, 2003. [27] A. Dingli, F. Ciravegna, and Y. Wilks. Automatic semantic annotation using unsupervised information extraction and integration. K-CAP 2003 Workshop on Knowledge Markup and Semantic Annotation, 2003. [28] Kappel G., Pr ll B., W. Retschitzegger, and W. Schwinger. Customization for ubiquitous web applications - a comparison of approaches. International Journal of Web Engineering Technology, 1(1):79111, 2003. [29] Alexander Graf. Rdfa vs. microformats a comparison of inline matadata formats in (x)html. DERI technical report, 2007. [30] N. (US) Gross Evan. Method for annotating web content in real time. http: //v3.espacenet.com/publicationDetails/biblio?DB=EPODOC&adjacent=true&locale= en_EP&FT=D&date=20091001&CC=WO&NR=2009120775A1&KC=A1, Filed Mar. 25, 2009. [31] S. Handschuh, S. Staab, and F. Ciravogna. Semi-automatic creation of metadate. SAAKM 2002 - Semantic Authoring, Annotation & Knowledge Markup - Preliminary Worshop Programme, 2002. [32] Karim Heidari. The impact of semantic web on e-commerce. World Academy of Science, Engineering and Technology, 2009. [33] Masahiro Hori, Goh Kondoh1, Kouichi Ono1, Shin ichi Hirose1, and Sandeep Singhal. Annotation-based web content transcoding. 2000. [34] Rohit Khare. Microformats: The next (small) thing on the semantic web? 2006. [35] P. Kogut and W. Holmes. Applying information extraction to generate daml annotations from web pages. First International Conference on Knowledge Capture, 2001. [36] Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko Milutinovic. Recognition of common areas in a web page using visual information: a possible application in a page classication. 2009. [37] Timo Laakko and Tapio Hiltunen. Adapting web content to mobile user agents. 2005. [38] Stolze Markus. Web page annotation systems. http://v3.espacenet.com/ publicationDetails/biblio?DB=EPODOC&adjacent=true&FT=D&date=20040715&CC= US&NR=2004138946A1&KC=A1, Filed Jun. 15, 2004. [39] D. Maynard. Multi-source and multilingual information extraction. Expert Update.

94

BIBLIOGRAPHY

[40] Saikat Mukherjee, Guizhen Yang, and I. V. Ramakrishnan1. Automatic annotation of contentrich html documents: Structural and semantic analysis. 2003. [41] F. Nah. Americas Conference on Information Systems (AMCIS), 2003. [42] Ariel Pashtan, Shriram Kollipara, and Michael Pearce. Adapting content for wireless web services. 2003. [43] Soila Pertet and Priya Narasimhan. Causes of failure in web applications. Parallel Data Laboratory - Carnegie Mellon University - Pittsburgh, PA 15213-3890, 2005. [44] B. Popov, A. Kiryakov, A. Kirilov, D. Manov, D. Ognyano, and M. Goranov. Kim - semantic annotation platform. 2nd International Semantic Web Conference (ISWC2003), pages 834 849, 2003. [45] Lawrence Reeve and Hyoil Han. Survey of semantic annotations platform. 2005. [46] P. Selvidge. Examining tolerance for online delays. Usability News, 2003. [47] Hee Sook Shin, Dong Woo Lee, Pyeong Soo Mah, Bum Ho Kim, Soo Sun Cho, Dong Won Han, and Eunjeong Choi (KR). Web content transcoding system and method for small display device. http://www.google.com/patents/about?id=Rp2ZAAAAEBAJ, Filed Oct. 31, 2003. [48] Ashmeet S. Sidana. Web page annotating and processing. http://www.google.com/ patents/about?id=IVkMAAAAEBAJ&dq=annotating+web+pages, Filed May 27, 2003. [49] David Taniar and Johanna Wenny Rahayu. Web semantics and ontology. Idea Group Inc (IGI), 2006. [50] Rulespace Inc. (US). Method for scanning, analyzing and rating digital information content. http://www.freepatentsonline.com/6266664.html, Filed Oct. 1, 1998. [51] M. Vargas-Vera, E. Motta, J. Domingue, M. Lanzoni, A. Stutt, and F. Ciravegna. Ontology driven semi-automatic and automatic support for semantic markup. The 13th International Conference on Knowledge Engineering and Management (EKAW), pages 379391, 2002. [52] Yudong Yang and HongJiang Zhang. Html page analysis based on visual cues. 2001.

95

You might also like