Professional Documents
Culture Documents
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on July 23,2010 at 11:35:59 UTC from IEEE Xplore. Restrictions apply.
deployment so that the engine can be carried by and so on. Then the module constructs DOM tree
computing cloud distributed and parallel. Then a corresponding to the page based on DOM criterion. Web
distributed web page blocks management based on cloud page’s content and structure are reflected by the DOM
computing is provided and a specific algorithm is tree, including those elements that buildup web page like
designed to assign engine processing task, coordinate element name, element content and element attributes,
working between each computing cloud. Moreover, a including relationship between those elements as well.
prototype system and a set of evaluation experiments have Figure 2 shows an example that a DOM tree corresponds
been implemented. to its HTML document. A DOM tree of structure HTML
The rest of the paper is organized as follows. Section 2 is more easily to be accessed and operated than the raw
presents the software framework of our system. We web page stream data.
describe the key component modules of the system in ROOT
section 3. A prototype system is implemented and verified HEAD BODY
by a set of experiments in section 4. Finally, we indicate
our future work and conclude the approach in section 5. TITLE TABLE TABLE
META TR … TR TR
2. Framework TD TD
BR BR DATA … BR BR TABLE
The original intention of designing the system is
Figure 2 DOM tree corresponding to HTML document
utilizing huge computing ability and storage capacity of
cloud computing infrastructure to carry web page
(2) Iterative page blocker
adaptation engine. Figure 1 shows the framework of the
The main works of iterative page blocker are designing
whole system. Cloud portal is an interface between
a page blocker algorithm and making the algorithm can be
wireless end users and original websites. Here, it
deployed into cloud computing model distributed and
represents the new service for wireless web accessing.
parallel. So how to split a web page into small blocks by a
From the system framework, we can conclude that a
reasonable and iterative way is the key issue to this
distributed web page adaptation engine and distributed
module.
web page blocks management based on cloud computing
Up to now, because of the low available intelligence of
must be designed so that the engine can be carried by
semantic technology in practices, most of current studies
computing cloud distributed and parallel.
on web page adaptation are limited to special pages,
special websites, or special page format. These methods
computing cloud
include structure HTML analysis method [8], process
method based on nature language [9], machine learning [10]
computing cloud
cloud and Ontology [11]. However, these researches are too
Amazon EC2 specific to be applied broadly. As random of user surfing
behavior and high require of response time, when the
cloud portal website wireless devices access web pages of new field or new
Internet structure, these systems usually can not work successfully.
index.html first layer
Hyperlink
Figure 1 System framework Web page
3. Component modules
second layer
second layer
3.1. Distributed web page adaptation engine …
The distributed web page adaptation is composed of …
two modules. They are Structure Processor and Iterative Figure 3 Layered iterative relationship at vertical viewpoint
Page Blocker. Structure processor parsers normal web
page and creates DOM tree as output. Iterative page To design an iterative page blocker, page structure is
blocker splits web page into small blocks and makes the analyzed firstly here. Actually, no matter in vertical
refinement of adaptation engine can be carried by direction or in horizontal direction, there is some
computing cloud distributed and parallel. hierarchy iterative relationship between pages and
(1) Structure processor between blocks inside a web page.
Structure processor parsers normal web page based on From vertical viewpoint, we can organize all the web
HTML criterion fills in missing html tag like <li>, <hi> pages of a website layered iterative relationship by using
648
646
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on July 23,2010 at 11:35:59 UTC from IEEE Xplore. Restrictions apply.
theory of layered network for reference. As shown in which the current block contains links back the page in
figure 3, the page in the first layer is index.html usually, upper layer.
and the page will be parsed into many blocks by page From horizontal viewpoint, there is the similar
blocker we proposed [3]. Each block contains many situation to different blocks of a web page. Figure 4
hyperlinks that link to the pages in the next lower layer. shows a normal web page is split into three blocks at first
Repetition goes on until the current block only contains step by page blocker mentioned above. If we refine the
content information without any links or the hyperlink accuracy of the page blocker, some blocks will be split
further.
① …
②
① ② …
②
① …
Figure 4 Layered iterative relationship at horizontal viewpoint (http://www.cqupt.edu.cn/english/index.php)
649
647
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on July 23,2010 at 11:35:59 UTC from IEEE Xplore. Restrictions apply.
HTMLcode is HTML source code of each block respectively. We also compare experiment data under the
which is created by iterative page blocker. three different schemes. For comparison, we call the
(2) Page blocks mapping system in [3] as traditional system (TS), system in [4] as
This issue focuses on how to get HTML source code of P2P system (PS), whereas the system described in this
target page block in distributed cloud computing paper is referred as cloud computing system (C2S).
infrastructure by using Blocks Table defined above. By For efficiency evaluation, 3000 web pages of 10
using the Distributed Hash Table and layered iterative typical portal websites such as CQUPT, sina and yahoo
relationship while constructing BlockID, a distributed etc. are chosen to buildup efficiency evaluation test set.
mapping algorithm is designed. The pseudocode for the Total files’ size is 51600kB and 17.2kB per file. When in
function is as follows: data collection time, the collection system uses thread
distributedMappingFunc(String BlockID): return HTMLcode pool model and set the pool with 50 threads. System
if (BlocksTable.contain(BlockID))
efficiency is mainly determined by system processing
return BlocksTable.getHTMLcode(BlockID)
String newBlockID = BlocksTable.maxSubString(BlockID) speed. Table 1 shows test result, it reaches the conclusion
String blockURL = BlocksTable.getBlockURL(newBlockID) that efficiency of the cloud computing system can meet
request.forward(blockURL) the demand of processing large number online web pages
swiftly. And the C2S’s efficiency is higher than the TS’s
4. Implementation and evaluation and PS’s because the C2S makes full use of huge
computing ability and storage resource of cloud
4.1. Implementation of prototype computing infrastructure.
The system was implemented based on the above Table 1 System efficiency evaluation
design. We implemented the prototype system entirely in Deployment
Execute time(s) Speed(files/s) Speed(kB/s)
scheme
Java. Our scheme is carried out in Amazon EC2 Cloud. TS 18.93 158.48 2725.83
Amazon promises each server is equivalent to a system PS 23.83 125.89 2165.34
with a 1.7Ghz x86 processor, 1.75GB of RAM, 160GB of C2S 13.47 222.72 3830.73
local disk, and 250Mb/s of network bandwidth. We use a
cloud tool named EC2Deploy [12], which is a set of tools Table 2 System effectiveness evaluation
for deploying, managing and testing Java EE applications Test items TS PS C2S
Response time 86.47% 92.12% 94.76%
on Amazon EC2, to deploy the scheme. A Java open Distinguish page block 99.03% 88.32% 87.29%
source project named HTMLParser [13], which can parser Page block identify 94.54% 90.81% 90.32%
semi-structured HTML page, fulfills missing tag and Page block merge 87.30% 88.34% 85.61%
creates DOM tree. Besides, it supports powerful JAVA Content block title identify 81.21% -- --
library which can make multiple analysis and operation
based on DOM document. SUN Java ME Wireless For effectiveness of system, especially for accuracy of
Toolkit2.5.2 simulator is used to simulate handheld distributed web page adaptation engine, it is hard to give
terminal device. Bandwidth is set at 30Kbps to simulate formal evaluation conclusion. It is also subjective during
the current China Mobile GPRS’s access bandwidth. the evaluation of system effectiveness. So we use general
Figure 5 shows process result corresponding to part of the accuracy as evaluation standard, artificial evaluation
figure 4. method in practical process. Thirty students are invited to
finish a questionnaire with five test items in table 2. All
the participants are active users of web in their daily lives.
Firstly, we give two days to the participants to familiar
with operating the three systems; after that, each
participant is asked to finish the questionnaire. Here we
extract 100 web pages random and make them artificial
evaluated for evaluation of system effect, according to
“successful” or “unsuccessful” to give evaluation result
for each performance index shown in table 2. We can see
Figure 5 Snapshot of prototype system from the table that effect of each system is acceptable and
each performance index is above 80%.
4.2. System evaluation
650
648
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on July 23,2010 at 11:35:59 UTC from IEEE Xplore. Restrictions apply.
Universal Access in the Information Society Journal. Vol.
6, No. 3, 2007, pp.259–271.
[2] Zhigang Hua, Xing Xie, Hao Liu, Hanqing Lu and
WeiYing Ma. “Design and Performance Studies of an
Adaptive Scheme for Serving Dynamic Web Content in a
Mobile Computing Environment”. IEEE transactions on
mobile computing. vol. 5, no. 12, December 2006, pp.1650-
1662.
[3] Yunpeng Xiao, Yang Tao and Qian Li. “Web page
adaptation for mobile device”. in Fourth IEEE
International Conference on Wireless Communications,
Networking and Mobile Computing. October 2008, to be
published.
[4] Yunpeng Xiao, Yang Tao, Wenji Li and Qian Li. “An
Figure 6 Comparing customer satisfaction with three Effective P2P Collaborative Mode for Mobile Web
different modes Surfing”. in 2nd IFIP International Workshop on Advanced
Topics in Network Computing, Technology, and
Application. October 2008, to be published.
In particular, customer satisfaction is tested and
[5] Huan Liu, Dan Orban. “GridBatch: Cloud Computing for
compared with the three different modes. We simulate Large-Scale Data-Intensive Batch Applications”. in Eighth
different number of wireless terminal devices respectively IEEE International Symposium on Cluster Computing and
from 20 to 200. Twenty students are invited to finish the the Grid. Lyon, France, May 2008, pp.295-305.
questionnaire, expressing their satisfaction including [6] Christopher Moretti, Jared Bulosan, Douglas Thain, and
speed, GUI and whatever that their feeling about the three Patrick J. Flynn. “All-Pairs: An Abstraction for Data-
ways. From figure 6, we can conclude that the proxy Intensive Cloud Computing”. in IEEE International
mode will be more inefficient as more wireless terminals Symposium on Parallel and Distributed Processing, 2008.
appear. On the contrary, the P2P collaborative mode will Miami, FL, USA, April 2008, pp.1-11.
[7] Amazon Elastic Compute Cloud. Amazon.com Inc.,
be more inefficient when few terminals exist. As the huge
http://aws.amazon.com/ec2.
computing ability and storage resource of cloud [8] Robert Baumgartner , Sergio Flesca and Georg Gottlob.
computing infrastructure, no matter how many wireless “Visual Web Information Extraction with Lixto”. in
terminals exist, the cloud computing mode always has a Proceedings of the 27th International Conference on Very
high customer satisfaction. Large Data Bases, September 11-14, 2001, pp.119-128,
[9] Dayne Freitag. “Machine Learning for Information
5. Conclusions and future work Extraction in Informal Domains”. Machine Learning,
Vol.39, No.2-3, 2000, pp.169-202.
[10] Robert B. Doorenbos , Oren Etzioni and Daniel S. Weld.
With the evolution of wireless telecommunication “A scalable comparison-shopping agent for the World Wide
technology and information technology, it seems more Web”. in Proceedings of the first international conference
important to provide wireless users with a swift, on Autonomous agents, February 05-08, 1997, Marina del
convenient, economical, personalized and shining Rey, California, United States, pp.39-48.
appearance way for web surfing. A new wireless web [11] DW Embley, DM Campbell, YS Jiang, SW Liddle, DW
access mode based on cloud computing infrastructure is Lonsdale, et a1. “Conceptual-model-based data extraction
proposed in this paper. Compared with existing modes, from multiple-record Web page”. Data and Knowledge
the scheme makes full use of huge computing ability and Engineering, Vol.31, No.3, 1999, pp.227-251.
[12] Cloudtools, Tools for leveraging Amazon EC2, Google Inc.,
storage resource of cloud computing infrastructure,
http://code.google.com/p/cloudtools.
provides a new wireless web access service with more [13] HTMLParser project, Derrick Oswald, SourceForge Inc.,
swift and better user experience by experiments. http://sourceforge.net/projects/htmlparser.
In the next step, we will focus on optimizing the web
page adaptation algorithm for dealing with complex
structured web page and strengthening security of cloud
computing model.
References
651
649
Authorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on July 23,2010 at 11:35:59 UTC from IEEE Xplore. Restrictions apply.