Web - Sciences Complete Book

ABOUT KLCE THE HOST INSTITUTE
Koneru Lakshmaiah College of Engineering (KLCE) was established in the year 1980 with four branches of Engineering
under the aegis of Koneru Lakshmaiah Education Foundation and affiliated to Acharya Nagarjuna University. The College is
located on a 44 acre site abutting Buckingham Canal, about 8 km. from Vijayawada in Guntur District. Situated in the rural setting
of the lush, green and fertile fields of the Krishna delta, the college is a virtual heaven of rural quiet and idyllic beauty. The serene
green fields, the flowing Krishna water in the Buckingham Canal, and the pristine silence of trees makes it an ideal place for
scholastic pursuits. The campus has been aptly named Green Fields. The college has an administrative office in the heart of
Vijayawada City. Presently the college is running 9 UG Engineering programmes, 6 P.G. programs in Engineering and
Management. The National Board of Accreditation (NBA) granted its accreditation to all the 9 programs at UG level, from the
academic year 2007-08. KLCE is also accredited by the NAAC under the new methodology with Institutional CGPA 3.76 and with
an A Grade. AICTE has approved all the existing UG and PG programs. The programs are given permanent affiliation by Acharya
Nagarjuna University. KLCE is certified by ISO 9001-2000. Good infrastructure facilities are provided for the teaching-learning
process. The whole campus is on a wi-fi enabled network. 100% placement of the registered students has been achieved by the
college since the academic year 2005-06. The R & D activity in the campus is boosted by better budgetary provisions. KLCE was
granted the Research Centre status by Acharya Nagarjuna University. KLCE has MoUs with leading Industries & Technical
Institutions and Universities at State, National and International Level.
ABOUT COMPUTER SOCIETY OF INDIA (CSI)
Computer Society of India is the largest association of Computer Professionals in India and committed to the advancement of
the theory and practice of computer science, Computer Engineering and Technology, System science and Engineering, Information
Processing and related arts and sciences. The objectives of CSI include:
To promote interchange of information in informatics related disciplines among specialists and between the specialists
and the public.
To assist the professionals to maintain the integrity and competence of the profession.
To foster a sense of partnership amongst the professionals engaged in these fields.
ABOUT THE SCHOOL OF COMPUTING AT KLCE
The School of Computing is the leading academic and research school in Koneru Lakshmaiah College of Engineering,
inaugurated on 07-07-07. It has emerged after a sustained growth over the past 20 years in the form of four departments. It has 90
faculty members with good academic and research experience. The school has the following departments as its constituent
branches:
Computer Science and Engineering (CSE)
Electronics and Computer Engineering (ECM)
Information Science and Technology (IST)
Master of Computer Applications (MCA)
We operate on the clear recognition that computer science fundamentals must play a critical role in many emerging
technologies, not to mention that software knowledge continues to be key in the IT industry in general. This operating philosophy
drives our approach to both teaching and research. Under the constituent departments the School offers three undergraduate
programs: Computer Science & Engineering, Information Science & Technology and Electronics & Computer Engineering, a
graduate program in Computer Science & Engineering and Master of Computer Applications. The graduate programs are research
oriented, thriving on the broad range of active research conducted by faculty members in the School. Key strengths include
computer algorithms, software engineering, embedded systems, parallel computing, computer networks, theoretical computer
science, data mining and warehousing. With 25-research faculty, over 250 graduate students and good number of under graduate
students, the School fosters a dynamic research environment.
Web Sciences
(ICWS-2009)
Proceedings of the
International Conference on
1011 January, 2009
Editors
L.S.S. Reddy
P. Thrimurthy
K. Rajaskhara Rao
H.R. Mohan
Kodanada Rama Sastry J.
Co-editors
K. Thirupathi Rao
M. Vishnuvardhan
Organised by
W
e
b
S
c
i
e
n
c
e
s
(
I
C
W
S
-
2
0
0
9
)
P
r
o
c
e
e
d
i
n
g
s
o
f
t
h
e
I
n
t
e
r
n
a
t
i
o
n
a
l
C
o
n
f
e
r
e
n
c
e
o
n
9 7 8 8 1 9 0 7 8 3 9 9 6
I SBN 8 1 - 9 0 7 8 3 9 - 9 - 8
Rs. 1800
School of Computing
Koneru Lakshmaiah College of Engineering
Green Fields Vaddeswaram, Guntur-522502, Andhra Pradesh, India
CSI Koneru Chapter and Division II of CSI
In Association with
E
d
it
o
r
s
L
.S
.S
.R
e
d
d
y
P
.T
h
r
im
u
r
th
y
K
.R
a
ja
s
k
h
a
r
a
R
a
o
H
.R
.M
o
h
a
n
K
o
d
a
n
a
d
a
R
a
m
a
S
a
s
tr
y
J
.
C
o
-
e
d
it
o
r
s
K
.T
h
ir
u
p
a
th
iR
a
o
M
.V
is
h
n
u
v
a
r
d
h
a
n

Proceedings of the
Web Sciences
(ICWS-2009)

www.excelpublish.com

Proceedings of the
Web Sciences
(ICWS-2009)

(1011 January, 2009)

Editors
L.S.S. Reddy P. Thrimurthy
H.R. Mohan K. Rajasekhara Rao
Kodanada Rama Sastry J.
Co-editors
K. Thirupathi Rao M. Vishnuvardhan

Organised by

School of Computing
Koneru Lakshmaiah College of Engineering
Green Fields Vaddeswaram, Guntur-522502
Andhra Pradesh, India
In Association with

CSI Koneru Chapter & Division II of CSI

EXCEL INDIA PUBLISHERS
New Delhi
First Impression: 2009
K.L. College of Engineering, Andhra Pradesh
Proceedings of the International Conference on Web Sciences (ICWS-2009)
ISBN: 978-81-907839-9-6
No part of this publication may be reproduced or transmitted in any form by any means,
electronic or mechanical, including photocopy, recording, or any information storage and
retrieval system, without permission in writing from the copyright owners.
DISCLAIMER
The authors are solely responsible for the contents of the papers compiled in this volume. The
publishers or editors do not take any responsibility for the same in any manner.
Published by
EXCEL INDIA PUBLISHERS
61/28, Dalpat Singh Building, Pratik Market, Munirka, New Delhi-110067
Tel: +91-11-2671 1755/ 2755/ 5755 Fax: +91-11-2671 6755
E-mail: publishing@excelpublish.com
Website: www.excelpublish.com
Typeset by
Excel Publishing Services, New Delhi - 110067
E-mail: prepress@excelpublish.com
Printed by
Excel Seminar Solutions, New Delhi - 110067
E-mail: seminarsolutions@excelpublish.com
Preface
The terms WEB Sciences were coined by internationally reputed Professor P Thrimurthy who
is presently the Chief Advisor of ANU College of Engineering & Technologies in Acharya
Nagarjuna University, Guntur, India and Shri H.R. Mohan, Chairman, Division II, CSI. The
WEB is known as the interconnection of various networks spread across and primarily known
for hosting and accessing the information of various forms. The young generations now
cannot think of anything excluding the usage of internet and the WEB. WEB is
synonymously known as Information management over Internet. WEB made the world as a
Global Village. Today no field exists (Engineering, Physics, Zoology, Biology etc) that does
not use the Internet as the backbone. Internet is being used for various purposes which
include research, consultancy, academics, and business as the most important medium for
communication.
Internet is being used predominantly by the information mangers and it is high time that all
professionals must be made to know how important it is to use Internet and the WEB in their
day to day endeavors. It is in this pursuit that Dr Thrimurty and Shri H.R.Mohan, have
advocated and advised to conduct an International conference on WEB sciences.
Koneru Lakshmaiah College Engineering (KLCE) known for its quality of education,
Research, and consultancy offering diversified courses in the Field of Engineering,
Humanities and management have picked up the leaf and have decided to host the conference
under the aegis of Computer Society of India which is one the oldest premier societies of
India. KLCE is formed by Koneru Lakshmaiah Educational Foundation (KLEF). KLCE is an
Autonomous Engineering College and poised to become Deemed University
(K.L.UNIVERSITY) soon.
KLCE is one of the premier institutes of India established in the Year 1980. It has state of the
art infrastructure and have implemented very high standards in imparting Technical and
Management Courses. KLCE is known all over the world through its alumni and through
various tie-ups with Industry and other peer organizations situated in India and Abroad. The
driving force behind the success of KLCE is Mr. Koneru Satyanarayana, Chairman of KLEF.
KLCE is situated adjacent to the Buckingham Canal, Green Fields, Vaddeswram, Guntur
(Dt), Andhra Pradesh State, India, PIN 522502(Near Vijayawada).
Computer Society of India (CSI) is a premier professional body of India which is committed
to the advancement of the theory and practices of computer science, Computer Engineering
and Computer Technology. CSI has been tirelessly helping India in promoting the computer
literacy, National policy and Business.
The main aim of the international conference on WEB sciences is to bring together Industry,
Academia, Scientists, Sociologists, entrepreneurs, and decision makers situated around the
world. This Initiative to combine Engineering, Management and Social Sciences would lead
to creating and using new knowledge which is needed for the benefit of the Society at large.
In association with Computer Society of India, The conference is being conducted by the
School of computing of KLCE comprising of four departments which include Computer
Science and Engineering, Information science and technology, Electronics and Computers
Engineering and Master of Computer Applications. The School is strategically positioned to
conduct this International Conference in terms of having state of art infrastructure, eminent
faculty derived from academics, Industry and Research Organizations. The School is
publishing a Half Yearly Journal (ISSN 0974-2107) International Journal of Systems and
vi Preface
Copyright ICWS-2009
Technologies (IJST). The journal is publishing papers submitted by the scholars all over the
world by following international adjudication system.
The conference is planned to deliver knowledge through Keynote Address, Invited Talks,
Paper Presentations and the proceedings of the conference shall be delivered through a
separate publication. Some of the well accepted papers delivered at the conferences shall be
published in IJST. The papers published shall also be hosted at http://www.klce.ac.in
We anticipate that excellent knowledge shall be emanated through discussions, forums and
conclusions on the future course of developments in making use of the WEB by all
disciplines of Engineering, Technology and the Sciences.
We would like to place on record our sincere thanks to the Chairman of KLCE Sri Koneru
Satynarayana and Prof L.S.S Reddy, the principal of KLCE, for their continuous help and
appreciation and also making available all the infrastructural facilities needed for organizing
the international Conference of grate magnitude. We thank all the National and International
Members who served in technical and Organizing Committees. We also thank the
management, faculty, Staff and students of KLCE for excellent support and cooperation. All
the Faculty and Staff of School of Computing need to be appreciated for making the
conference, a Grand success and for all their efforts in bringing the proceedings of the
conference on time, with high quality standards.

January 2009

Dr. K. Rajasekhara Rao
Organising Committee
Chief Patron
K. Satyanaryana
Chairman, KLCE

Correspondent and Secretary
K. Siva Kanchana Latha
KLCE

Director
K. Lakshman Havish
KLCE

Patron
L.S.S. Reddy
Principal, KLCE

Convener
Dr. K. Rajasekhara Rao
Vice-Principal, KLCE
CONFERENCE ADVISORY COMMITTEE
Hara Gopal Reddy ANU-Guntur
K. K. Aggarwal President CSI
S. Mahalingam Vice President CSI
Raj Kumar Gupta Secretary CSI
Saurabh H. Sonawala Treasurer CSI
Lalit Sawhney Immd. Past President
M. P. Goel Region I Vice President CSI
Rabindra Nath Lahiri Region II Vice President CSI
S. G. Shah Region III Vice President CSI
Sanjay Mohapatra Region IV Vice President CSI
Sudha Raju Region V Vice President CSI
V. L. Mehta Region VI Vice President CSI
S. Arumugam Region VII Vice President CSI
S. V. Raghavan Region VIII (Intrnl) Vice President CSI
Dr. Swarnalatha R.Rao Division-I Chair Person, CSI
H R Mohan Division-II Chair Person, CSI
Deepak Shikarpur Division-III Chair Person, CSI
C. R. Chakravarthy Division-IV Chair Person, CSI
H. R. Vishwakarma Division-V Chair Person, CSI
P. R. Rangaswami Chairman, Nomination Committee, CSI
Satish Doshi Member, Nomination Committee, CSI
Shivraj Kumar (Retd.) Member, Nomination Committee, CSI
viii Committees
Copyright ICWS-2009
COLLEGE ADVISORY COMMITTEE
P. Srinivasa Kumar OSD, KLCE
Y. Purandar Dean IRP
G. Rama Krishna KLCE
C. Naga Raju KLCE
K. Balaji HOD ECM
V. Srikanth HOD IST
N. Venkatram HOD ECM
V. Chandra Prakash IST
M. Seeta Ram Prasad CSE
K. Thirupathi Rao CSE
Conference Chair
P. Thrimurthy ANU, Guntur
Conference Co-Chair
J.K.R. Sastry KLCE
Technical Programme Chair
H.R. Mohan Chairman, Division II
Technical Programme Committee
Allam Appa Rao VC, JNTU, Kakinada
M.Chandwani Indore
B. Yagna Narayana IIIT Hyderabad
S.N. Patel USA
R.V. Raja Kumar IIT Kharagpur
Wimpie Van den Berg South Africa
K. Suzuki Japan
Trevol Moulden USA
Ignatius John Canada
Ranga Vemuri USA
N.N. Jani India
Yasa Karuna Ratne SriLanka
Prasanna Sri Lanka
Sukumar Nandi IIT Guwahati
Viswanath Nandyal

Contents
Preface v
Committees vii

Session-I: Web Technology

Semantic Extension of Syntatic Table Data
V. Kiran Kumar and K. Rajasekhara Rao 3
e-Learning Portals: A Semantic Web Services Approach
Balasubramanian V., David K. and Kumaravelan G. 8
An Efficient Architectural Framework of a Tool for Undertaking Comprehensive
Testing of Embedded Systems
V. Chandra Prakash, J.K.R. Sastry, K. Rajashekara Rao and J. Sasi Bhanu 13
Managed Access Point Solution
Radhika P. 21
Autonomic Web Process for Customer Loan Acquiring Process
V.M.K. Hari, G. Srinivas, T. Siddartha Varma and Rukmini Ravali Kota 29
Performance Evaluation of Traditional Focused Crawler and Accelerated
Focused Crawler
N.V.G. Sirisha Gadiraju and G.V. Padma Raju 39
A Semantic Web Approach for Improving Ranking Model of Web Documents
Kumar Saurabh Bisht and Sanjay Chaudhary 46
Crawl Only Dissimilar Pages: A Novel and Effective Approach for Crawler
Resource Utilization
Monika Mangla 52
Enhanced Web Service Crawler Engine (A Web Crawler that Discovers
Web Services Published on Internet)
Vandan Tewari, Inderjeet Singh, Nipur Garg and Preeti Soni 57
Session-II: Data Warehouse Mining

Web Intelligence: Applying Web Usage Mining
Techniques to Discover Potential Browsing Problems of Users
D. Vasumathi, A. Govardhan and K. Suresh 67
Fuzzy Classification to Discover On-line User Preferences Using
Web Usage Mining
Dharmendra T. Patel and Amit D. Kothari 71
Data Obscuration in Privacy Preserving Data Mining
Anuradha T., Suman M. and Arunakumari D. 76
Mining Full Text Documents by Combining Classification
and Clustering Approaches
Y. Ramu 83
x Contents
Copyright ICWS-2009
Discovery of Semantic Web Using Web Mining
K. Suresh, P. Srinivas Rao and D. Vasumathi 90
Performance Evolution of Memory Mapped Files on Dual Core Processors
Using Large Data Mining Data Sets
S.N. Tirumala Rao, E.V. Prasad, N.B. Venkateswarlu and G. Sambasiva Rao 101
Steganography Based Embedded System Used for Bank Locker System:
A Security Approach
J.R. Surywanshi and K.N. Hande 109
Audio Data Mining Using Multi-Perceptron Artificial Neural Network
A.R. Ebhendra Pagoti, Mohammed Abdul Khaliq and Praveen Dasari 117
A Practical Approach for Mining Data Regions from Web Pages
K. Sudheer Reddy, G.P.S. Varma and P. Ashok Reddy 125
Session-III: Computer Networks

On the Optimality of WLAN Location Determination Systems
T.V. Sai Krishna and T. Sudha Rani 139
Multi-Objective QoS Based Routing Algorithm for Mobile Ad-hoc Networks
Shanti Priyadarshini Jonna and Ganesh Soma 148
A Neural Network Based Router
D.N. Mallikarjuna Rao and V. Kamakshi Prasad 156
Spam Filter Design using HC, SA, TA Feature Selection Methods
M. Srinivas, Supreethi K.P. and E.V. Prasad 161
Analysis & Design of a New Symmetric Key Cryptography Algorithm
and Comparison with RSA
Sadeque Imam Shaikh 168
An Adaptive Multipath Source Routing Protocol for Congestion Control
and Load Balancing in MANET
Rambabu Yerajana and A.K. Sarje 174
Spam Filtering using Statistical Bayesian Intelligence Technique
Lalji Prasad, RashmiYadav and Vidhya Samand 180
Ensure Security on Untrusted Platform for Web Applications
Surendrababu K. And Surendra Gupta 186
A Novel Approach for Routing Misbehavior Detection in MANETs
Shyam Sunder Reddy K. and C. Shoba Bindu 195
Multi Layer Security Approach for Defense Against MITM
(Man-in-the-Middle) Attack
K.V.S.N. Rama Rao, Shubham Roy Choudhury, Manas Ranjan Patra
and Moiaz Jiwani 203
Contents xi
Copyright ICWS-2009
Video Streaming Over Bluetooth
M. Siddique Khan, Rehan Ahmad, Tauseef Ahmad and Mohammed A. Qadeer 209
Role of SNA in Exploring and Classifying Communities within B-Schools
through Case Study
Dhanya Pramod, Krishnan R. and Manisha Somavanshi 216
Smart Medium Access Control (SMAC) Protocol for Mobile Ad Hoc Networks
Using Directional Antennas
P. Sai Kiran 226
Implementation of TCP Peach Protocol in Wireless Network
Rajeshwari, S. Patil Satyanarayan and K. Padaganur 234
A Polynomial Perceptron Network for Adaptive Channel Equalisation
Gunamani Jena, R. Baliarsingh and G.M.V. Prasad 239
Implementation of Packet Sniffer for Traffic Analysis and Monitoring
Arshad Iqbal, Mohammad Zahid and Mohammed A. Qadeer 251
Implementation of BGP Using XORP
Quamar Niyaz, S. Kashif Ahmad and Mohammad A. Qadeer 260
Voice Calls Using IP enabled Wireless Phones on WiFi / GPRS Networks
Robin Kasana, Sarvat Sayeed and Mohammad A. Qadeer 266
Internet Key Exchange Standard for: IPSEC
Sachin P. Gawate, N.G. Bawane and Nilesh Joglekar 273
Autonomic Elements to Simplify and Optimize System Administration
K. Thirupathi Rao, K.V.D. Kiran, S. Srinivasa Rao, D. Ramesh Babu
and M. Vishnuvardhan 283
Session-IV: Image Processing

A Multi-Clustering Recommender System Using Collaborative Filtering
Partha Sarathi Chakraborty 295
Digital Video Broadcasting in an Urban Environment an Experimental Study
S. Vijaya Bhaskara Rao, K.S. Ravi, N.V.K. Ramesh, J.T. Ong, G. Shanmugam
and Yan Hong 301
Gray-level Morphological Filters for Image Segmentation and Sharpening Edges
G. Anjan Babu and Santhaiah 308
Watermarking for Enhancing Security of Image Authentication Systems
S. Balaji, B. Mouleswara Rao and N. Praveena 313
Unsupervised Color Image Segmentation Based on Gaussian Mixture Model
and Uncertainity K-Means
Srinivas Yarramalle and Satya Sridevi P. 322
xii Contents
Copyright ICWS-2009
Recovery of Corrupted Photo Images Based on Noise Parameters
for Secured Authentication
Pradeep Reddy C.H., Srinivasulu D. and Ramesh R. 327
An Efficient Palmprint Authentication System
K. Hemantha Kumar 333
Speaker Adaptation Techniques
D. Shakina Deiv, Pradip K. Das and M. Bhattacharya 338
Text Clustering Based on WordNet and LSI
Nadeem Akhtar and Nesar Ahmad 344
Cheating Prevention in Visual Cryptography
Gowriswara Rao G. and C. Shoba Bindu 351
Image Steganalysis Using LSB Based Algorithm for Similarity Measures
Mamta Juneja 359
Content Based Image Retrieval Using Dynamical Neural Network (DNN)
D. Rajya Lakshmi, A. Damodaram, K. Ravi Kiran and K. Saritha 366
Development of New Artificial Neural Network Algorithm for Prediction
of Thunderstorm Activity
K. Krishna Reddy, K.S. Ravi, V. Venu Gopalal Reddy and Y. Md. Riyazuddiny 376
Visual Similarity Based Image Retrieval for Gene Expression Studies
Ch. Ratna Jyothi and Y. Ramadevi 383
Review of Analysis of Watermarking Algorithms for Images in the Presence
of Lossy Compression
N. Venkatram and L.S.S. Reddy 393
Session-V: Software Engineering

Evaluation Metrics for Autonomic Systems
K. Thirupathi Rao, B. Thirumala Rao, L.S.S. Reddy,
V. Krishna Reddy and P. Saikiran 399
Feature Selection for High Dimensional Data: Empirical Study on the Usability
of Correlation & Coefficient of Dispersion Measures
Babu Reddy M., Thrimurthy P. and Chandrasekharam R. 407
Extreme Programming: A Rapidly Used Method in Agile Software Process Model
V. Phani Krishna and K. Rajasekhara Rao 415
Data Discovery in Data Grid Using Graph Based Semantic Indexing Technique
R. Renuga, Sudha Sadasivam, S. Anitha, N.U. Harinee, R. Sowmya
and B. Sriranjani 423
Design of Devnagari Spell Checker for Printed Document: A Hybrid Approach
Shaikh Phiroj Chhaware and Latesh G. Mallik 429
Contents xiii
Copyright ICWS-2009
Remote Administrative Suite for Unix-Based Servers
G. Rama Koteswara Rao, G. Siva Nageswara Rao and K. Ram Chand 435
Development of Gui Based Software Tool for Propagation Impairment
Predictions in Ku and Ka Band-Traps
Sarat Kumar K., Vijaya Bhaskara Rao S. and D. Narayana Rao H. 443
Semantic Explanation of Biomedical Text Using Google
B.V. Subba Rao and K.V. Sambasiva Rao 452
Session-VI: Embedded Systems
Smart Image Viewer Using Nios II Soft-Core Embedded Processor
Based on FPGA Platform
Swapnili A. Dumbre, Pravin Y. Karmore and R.W. Jasutkar 461
SMS Based Remote Monitoring and Controlling of Electronic Devices
Mahendra A. Sheti and N.G. Bawane 464
An Embedded System Design for Wireless Data Acquisition and Control
K.S. Ravi, S. Balaji and Y. Rama Krishna 473
Bluetooth Security
M. Suman, P. Sai Anusha, M. Pujitha and R. Lakshmi Bhargavi 480
Managing Next Generation Challenges and Services through Web
Mining Techniques
Rajesh K. Shuklam, P.K. Chande and G.P. Basal 486
Internet Based Production and Marketing Decision Support System
of Vegetable Crops in Central India
Gigi A. Abraham, B. Dass, A.K. Rai and A. Khare 495
Fault Tolerant AODV Routing Protocol in Wireless Mesh Networks
V. Srikanth, T. Sai Kiran, A. Chenchu Jeevan and S. Suresh Babu 500
Author Index 505

Web Technology
Proceedings of the International Conference on Web Sciences
ICWS-2009 January 10th and 11th, 2009
Koneru Lakshmaiah College of Engineering, Vaddeswaram, AP, INDIA
Semantic Extension of Syntatic Table Data

V. Kiran Kumar K. Rajasekhara Rao
Dravidian University KL College of Engineering
Kuppam Vijayawada
kirankumar.v@rediff.com rajasekhar.kurra@klce.ac.in

Abstract

I intend to explain how to convert syntactic HTML tables into RDF
documents. Most of the data on the web pages are designed in HTML tables.
As the present data does not have a semantic approach the machines are
unable to use it. This paper discusses the issues related to the conversion from
syntactic HTML tables to semantic content (RDF file). HTML tables can be
created with any HTML editor available today. The advantage of this process
is, that, with little knowledge of RDF, one can easily create RDF document by
designing his data in a HTML table.
Keywords: Semantic Web, RDF, RDFS, WWW
1. Introduction
The Semantic Web is an emerging technology intended to transform documents on the
World Wide Web (WWW) into knowledge that can be processed by machines. RDF is a
language representing resources on the web, which can be processed by machines rather than
just displaying them. Most of the data is available in web pages are designed in HTML tables.
As, the HTML table lacks semantic approach it can not processed by machines. This paper
discusses the issues related to the conversion from syntactic HTML tables to semantic
content (RDF file).
This paper contains seven sub-headings: Section 2 discusses about HTML tables, Section 3
introduces RDF and RDFS. Section 4 discusses the conceptual view of mapping from HTML
tables to RDF Files. Section 5 is a scenario on population survey and employee tables.
Section 6 deals with the key issues related to mapping process. Section 7 concludes
explaining future work of the tool.
2. HTML Tables
Table represents relationship between data. Before the creation of HTML table model, the
only method available for relative alignment of text is through the use of PRE element.
Though it was useful in some situations, the effects of PRE were very limited. Tables were
introduced in HTML 3.0, since then a great deal of refinement has occurred. The table may
have an associated caption that provides short description of tables purpose. Table rows may
be grouped into a head, body and foot sections (THEAD, TFOOT, TBODY elements). The
THEAD and TFOOT elements contain header and footer rows, respectively, while TBODY
elements supply the table's main row groups. A row group contains TR elements for
4 Semantic Extension of Syntatic Table Data
Copyright ICWS-2009
individual rows, and each TR contains TH or TD elements for header cells or data cells,
respectively. This paper uses the classification of HTML tables as regular and irregular.
A regular table is briefly described as a table in which metadata for the data items are
represented in Table Header and Stub, which can be organized into one or more nested levels.
The hierarchical structure of the column headers should be in top-down. The table may
contain additional metadata information in the Stub header cell. The table may optionally
contain footnotes. For any data cell, the cells metadata are positioned directly either column
header or row header. An irregular table is a table that breaks one or more of the rules of
regular tables. This paper attempts to use regular tables for converting into RDFS ontology.
3. RDF and RDF Schema
In February 2004, The World Wide Web Consortium released the Resource Description
Framework (RDF) as W3C Recommendation. RDF is used to represent information and to
exchange knowledge in the Web. If one wants to utilize machine processing information RDF
would be much useful. RDF uses a general method to decompose knowledge into pieces
called triples. The triples can be represented as subject, predicate and object. In RDF, the
English statement
Tim Berners-Lee invented World Wide Web
Could be represented by RDF statement having
A subject Tim Berners-Lee
A predicate invented
And an object World Wide Web
RDF statements may be encoded using various serialization syntaxes. The RDF statement
above would be represented by the graph model as shown below
invented

RDF Graph
Subject and objects are represented by nodes and predicate is represented by an arc. RDF's
vocabulary description language, RDF Schema, is a semantic extension of RDF. It provides
mechanisms for describing groups of related resources and the relationships between these
resources. RDF Schema vocabulary descriptions are written in RDF using the terms
described in this document. These resources are used to determine characteristics of other
resources, such as the domains and ranges of properties.
4. The Conceptual View of Mapping
Based on the regular table concept, the table may have a caption, table header, stub header
and optional footnotes. In General the table header and stub header is organized as <TH>
cells. A Class in RDF Schema is somewhat like the notion of a class in object-oriented
programming languages. This allows resources to be defined as instances of classes and
Tim
Berners
Lee
World Wide Web
Semantic Extension of Syntatic Table Data 5
Copyright ICWS-2009
subclasses of classes. The caption of a table is mapped to an RDFS class. Similarly stub, table
header and data cells of a table are mapped to subject, predicate and object of RDF triples.
The row of a table is treated as an instance of a class. For a property in RDF, the domain will
be treated as caption of a table and range will be specified by the data type of the data cell. If
the table header and stub header is organized in nested levels, then those headers are clubbed
to single header by appending the headers one after another. If the table doesnt have a stub,
then we should create on stub before converting into RDF. The following diagram clearly
shows the mapping process.
5. The Scenario A Population Survey and Employee Table
For example fig.1 is a population survey table. The diagram clearly explains the parts of the
table. As the table header and stub are organized into multiple levels, a conversion is needed
from multiple levels to single level by appending the headers one after another. The resultant
table was converted into fig.2.

Fig. 1: Table with multiple levels of stub and table header

Fig. 2: Converted Table of Fig.1: with single level of stub and table header
Let us discuss another type of regular table which does not have a stub. For these types, a
stub should be added with some information regarding the table data. The stub can be created
by the user. For example fig.3 represents employee table which has no stub. As the table does
not have a stub, it was created with caption of a table and row number and appended as one
of the columns of a table. The resultant table is shown in fig.4. The user is free to create the
stub of his choice. Once an RDF file is created, the stub plays a crucial role in answering the
queries on the particular RDF file.
6 Semantic Extension of Syntatic Table Data
Copyright ICWS-2009

Fig. 3: A regular table without stub Fig. 4: Converted table of fig. 3 with user-defined stub
Fig.5 shows the mapping of Population table in RDF graph. This representation uses table
stub as subject, column header as predicate and corresponding data cell as object of RDF
graph. Each row represents an instance. As there are four rows in population table, the RDF
graph converts into four instances of RDFS class. The conversion uses
http://www.dravidianuniversity.ac.in# as a namespace, where all the user defined variables
are placed.

Fig. 5: Represents RDF graph representation of population table
6. Key Steps Involved in Mapping Process
Based on the scenario outlined above, the following key points are identified
Use the necessary name space and URI in RDF file.
Mapping of caption of a table to RDFS class
The row of a HTML table is treated as an instance of an RDFS class
Mapping of stub, table header and data cells into RDF triples
The domain and range of the properties were defined by the caption of a table and
type of the data cells in HTML table.
Semantic Extension of Syntatic Table Data 7
Copyright ICWS-2009
7. Conclusion and Future Plan
This conversion process helps the machine to process the data available in syntactic format
by converting into semantic form. A tool is being developed for the conversion of syntactic
table data into semantic RDF file. There are two main advantages of the tool. One, the
syntactic data existing on web in table format can easily be extended to semantic content
(RDF file). Hence, we can provide semantic approach to table data, so that machine is able to
process it. Two, a layman with little knowledge on RDF can create RDF documents by
simply creating HTML tables. This process is possible only when the table contains textual
data, further research becomes necessarily to see how best non-textual data can also be taken
for such conversion. Another limitation of this process is that it cannot handle nested tables.
References
[1] [T. Berners-Lee, J. Hendler and O. Lassila]. The Semantic Web, Scientific American, 2001.
[2] [D. Brickley and R. V. Guha]. Resource description framework(rdf) schema specification 1.0: Rdf schema.
W3C WorkingDraft, 2003.
[3] [Karin K.Breitman, Marco Antonio Casanova, Walter Truszkowski]. Semantic Web, concepts,
Technologies and Applications:5777, September 2006.
[4] [P. Hayes]. Resource description framework (rdf) semantics.W3C Working Draft, 2003.
[5] [T. Berners-Lee]. Weaving the WebThe Past, Present and Future of the World Wide Web by its Inventor.
Texere, 2000.
[6] [Stephen Ferg, Bureau of Labor Statistics.] Techniques for Accessible HTML Tables, August 23,2002
[7] [F. Manola and E. Miller], RDF primer, W3C Recommendation, 10 Feb, 2004.
http:///www.w3c.org/TR/rdf-primer.
[8] [Beckett, D., (editor)], RDF/XML Syntax Specification (Revised), W3C Recommendation, 10th February
2004.
e-Learning Portals: A Semantic
Web Services Approach

Balasubramanian V. David K. Kumaravelan G.
DCS, Bharathidasan DCS, Bharathidasan DCS, Bharathidasan
University, Tiruchirappalli University, Tiruchirappalli University, Tiruchirappalli
balav.research@gmail.com

Abstract

In recent years there has been a movement from conventional type learning to
e-Learning. The semantic web can be used to bring e-Learning to a higher
level of collaborative intelligence by means of creating a new educational
model. The Semantic Web has opened new horizons for internet applications
in general and for e-Learning in particular. However, there are several
challenges involved in improving this e-Learning model. This paper discusses
about the motivation for Semantic Web Service (SWS) based e-Learning
model by addressing major issues and proposes a SWS based conceptual
architecture for e-Dhorna, an ongoing e-Learning project of the Department of
Computer Science, Bharathidasan University, India.
1 Introduction
e-Learning is commonly referred to the intentional use of networked information and
communications technology in teaching and learning. e-Learning is not just concerned with
providing easy access to learning resources, anytime, anywhere, via a repository of learning
resources. It is also concerned with supporting features such as the personal definition of
learning goals, the synchronous and asynchronous communication, collaboration between
learners and instructors. Many institutions and Universities in India have initiated e-Learning
projects. Few United States based companies such as QUALCOM, Microsoft, etc., have
already committed for funding to those e-Learning projects. Few Indian Universities have
associations with many American Universities such as California, Cornell, Carnegie Mellon,
Harvard, Princeton, Yale and Purdue for creating e-Content.
1.1 The Technological Infrastructure of e-Learning Portals
Today many e-Learning applications achieve high standards in providing instructors to
manage online courses via web technologies and database system. WebCT and Blackboard
are the two most advanced and popular e-Learning systems which provide sets of very
comprehensive tools and have the capabilities to support sophisticated Internet-based learning
environment [Muna, 2005]. The International Data Corporation (IDC) predicted in its
January 2005 report that e-Learning will have $21 billion market in 2008 [Valden, 2004]. A
truly effective e-Learning solution must be provided to meet the growing demands of
students, employees, researchers and lifelong learners. Efficient management of the available
information on the web can lead to an e-Learning environment that provides learners with
interaction and the most relevant materials.
e-Learning Portals: A Semantic Web Services Approach 9
Copyright ICWS-2009
2 Issues in Developing an e-Learning Portal
The traditional learning process could be characterized as centralized authority (content is
selected by educator), strong push delivery (teacher push knowledge to students), lack of
personalization (content must fulfill the needs of many) and the static/linear learning process
(unchanged content). In e-Learning the course contents are distributed and student oriented.
A learner can customize and personalize the contents based on his/her requirements. The
learning process can be carried out any time, any place with asynchronous interaction [Naidu,
2006].
The learning materials are scattered across the application and the user finds it very hard to
construct a user centered course. Due to the lack of commonly agreed service language, there
is no co-ordination between the software agents that locate the resources for a specific
content. Selecting the exact learning materials is also a big issue, since the resources are not
properly defined with metadata. The contents are not delivered in a personalized way to the
learner, as there is no infrastructure to find out the real requirements of the user. If different
university portals are implemented with different tools, then sharing of contents is not
feasible between them due to interoperability issues.
3 Need for Semantic Portals
The heterogeneity and the distributed nature of web has led to the need for web portals and
web sites providing access to collections of interesting URLs and information that can be
retrieved using search. With the advent of Semantic Web which has powerful features
enables the content publishers to express a crude meaning of the page instead of merely
dumping html text. Autonomous agent software can then use this information to organize and
filter the data to meet the users needs. Current research in the Semantic Web area should
eventually enable Web users to have an intelligent access to Web services and resources
[Berners Lee, 2001].
The key property (common-shared-meaning and machine-processable metadata) of the
Semantic Web architecture, enabled by a set of suitable agents, establishes a powerful
approach to satisfy the e-Learning requirements: efficient, just-in-time and task relevant
learning. Learning material is semantically annotated and for a new learning demand it may
be easily combined in a new learning course. According to the preferences, a user can find
and combine useful learning materials easily. The process is based on semantic querying and
navigation through learning materials, enabled by the ontological background thus making
the semantic portals more effective when compared with traditional web portals [Valden,
2004].
3.1 The Role of Semantic Web Services (SWS) in e-Learning Portals
The advent of Semantic Web and its relevant technologies, tools and applications provide a
new context for exploitation. The expression of meaning relates directly to numerous open
issues in e-Learning [Muna, 2005]. Semantic web services are aimed at enabling an
automatic discovery, composition and invocation of available web services. Based on
semantic descriptions of functional capabilities of available web services, a SWS broker
automatically selects and invokes web services appropriate to achieve a given goal. The
10 e-Learning Portals: A Semantic Web Services Approach
Copyright ICWS-2009
benefits of this approach are semantics-based browsing and semantic search. Semantic
browsing locates metadata and assembles point-and-click interfaces from a combination of
relevant information. Semantic searching is based on the metadata tagging process which
enables content providers to describe, index, and search their resources. The metadata tagging
process helps in effective retrieval of relevant content for a specific search term. By adding
semantic information with the help of metadata tagging, the search process goes beyond
superficial keyword matching thus allowing easy removal of non-relevant information from
the result set.
4 The e-Dhrona
The e-Learning Portal e-Dhrona is a virtual environment for various on-line and regular
courses of the University as well as for teachers and students for academic exchange.
Through e-Dhrona, effective educational and training courses can be brought to the PCs of
students of Bharathidasan Universitys Internal Departments, Constituent Colleges and
Affiliated Colleges. The objective is to provide academic information and centralized
knowledge which is customizable, accessible 24/7, flexible, convenient and user-centric.
With the abundance of courses and shortage of faculty support, the e-Dhrona project helps in
providing a standard for academic content.
4.1 The e-Dhrona Architecture with Semantic Web Services
The e-Dhrona project can cater the needs of its users if it is enabled with Semantic web
Services (SWS) technology. The major benefit of SWS technology is the ability to compose
web services that are located at different sources. Semantic Web Services are the result of the
evolution of the syntactic definition of web services and the semantic web. One solution to
create semantic web services is by mapping concepts in a web service description to
ontological concepts [Fensel, 2007]. Using this approach user can explicitly define the
semantics of a web service for a given domain. The role of ontology is to formally describe
shared meaning of used vocabulary (set of symbols). Ontology contains the set of possible
mapping between symbols and their meanings. But shared-understanding problem in e-
Learning occurs on several ontological levels, which describe several aspects of document
usage. When a student searches for learning material, the most important thing to be
considered is the content of the course, the context of the course and the materials associated
to it. The figure (Fig. 1) shows a conceptual Semantic e-Learning architecture which provides
high-level services to people looking for appropriate online courses.
The process of building this multi step architecture comprises of the following:
Knowledge Warehouse: This is the basic and core element of the architecture. It is a
repository where ontologies, metadata, inference rules, educational resources, course
descriptions, and user profiles are stored. The metadata is placed within the external metadata
repository (e.g. The RDF repository). The Knowledge Base includes e-Dhrona ontology
creation and RDF repository building.
e-Dhrona Ontology: The knowledge engineer creates and maintains an e-Dhrona ontology.
An ontology editor (like Onto Edit, Protg or OI-modeler) can be used for creating the initial
ontology. The Knowledge Engineer updates the ontology at a later stage using the appropriate
ontology editor. In this way, the development of the ontology is an iterative process, centered
on the architecture and driven by use cases, where each stage refines the previous one.
e-Learning Portals: A Semantic Web Services Approach 11
Copyright ICWS-2009

Fig. 1: The Semantic e-Learning Architecture of e-Dhrona
RDF Repository: The RDF repository includes the metadata in the form of RDF triples of
every web page that can be provided by any of the subsequent services. The context and
content parser will be used to generate the RDF repository. The Web services engine captures
new RDF and updates the RDF repository. Off-line updates on any of the databases involved
can be imported and processed by the context and contents parser on a regular basis.
The Web Services Interface: It represents the Semantic Web portal. The dynamic Web
generator displays portal page for each user. e-Dhrona Web users access their pages via the
Common User Interface.
Search Engine: It provides an API with methods for querying the knowledge base. RDQL
(RDF Data Query Language) can be used as an ontology query language.
Inference Engine: The inference engine answers queries and it performs derivations of new
knowledge by an intelligent combination of facts in the knowledge warehouse with the
ontology.
Software Agents: The software application which accesses the e-Dhrona Knowledge Base
repository and Web resources.
Common Access Interface: It provides an integrated interface through which readers as well
as authors/administrators of academic institutions can access the contents, upload or modify
the data with particular authority.
12 e-Learning Portals: A Semantic Web Services Approach
Copyright ICWS-2009
From a pedagogical perspective, semantic portals are an enabling technology allowing
students to determine the learning agenda and be in control of their own learning. In
particular, they allow students to perform semantic querying for learning materials (linked to
shared ontologies) and construct their own courses, based on their own preferences, needs
and prior knowledge [Biswanath, 2006].
5 Related Work
The Ontology-based Intelligent Authoring Tool can be used for building the ontologies
[Apted, 2002]. The tool uses four ontologies (domain, teaching strategies, learner model and
interfaces ontology) for the construction of the learning model and the teaching strategy
model, but it fails in exploiting modern web technologies. Our proposed framework is
developed only in parts. The future work is concerned with the implementation of complete
ontological representations of the introduced semantic layers as well as of current e-Learning
metadata standards and their mappings. Nevertheless, the availability of appropriate Web
services aimed at supporting specific process objectives has to be perceived as an important
prerequisite for developing SWS-based applications.
6 Conclusion
In developing interactive learning environments, semantic web is playing a big role and with
ontological engineering, XML and RDF it is possible to build practical systems for the
learners and the teachers. In this paper we have presented a case study of e-Dhrona in e-
Learning scenario that exploits ontologies for describing the semantics (content), for defining
the context and for structuring the learning materials. This three-dimensional, semantically
structured space enables easier and more comfortable search and navigation through the
learning material. The proposed model can provide related useful information for searching
and sequencing the learning resources in the web-based e-Learning systems.
References
[1] [Adelsberger et al. 2001] Adelsberger H., Bick M., Krner F., Pawlowski J.M. Virtual Education in
Business Information Systems (VAWI) - Facilitating collaborative development processes using the Essen
Learning Model, In Proceedings of the 20th ICDE World Conference on Open Learning and Distance
Education, Dsseldorf, Germany, 2001.
[2] [Apted et al.,2002] Apted, T., & Kay, J. Automatic Construction of Learning Ontologies, Proceedings of
ICCE Workshop on Concepts and Ontologies in Web-based Educational Systems (pp. 57-64). Auckland,
New Zealand, 2002.
[3] [Berners-Lee et al., 2001] Berners-Lee T, Hendler, J., & Lassila, O., The Semantic Web, Scientific
American, 284, pp 3443, 2001.
[4] [Biswanath, 2006] Biswanath Dutta, Semantic Web Based E-learning, Proceeding of the International
Conference on ICT for Digital Learning Environment, Banglore, India, 2006.
[5] [Fensel et al., 2007] Fensel D, Lausen H et al. Enabling Semantic Web Services, Springer-Verlag Berlin
Heidelberg, 2007.
[6] [Muna et al., 2005] Muna, S, Hatem, Haider, A Ramadan, Daniel C Neagu, e-Learning Based on Context
Oriented Semantic Web, Journal of Computer Science, 1 (4): 500-504, 2005.
[7] [Naidu, 2006] Som Naidu, E-Learning: A Guide Book of Principles, Procedures and Practices,
Commonwealth Educational Media Center for Asia Publication, New Delhi, 2006.
[8] [Vladen, 2004] Vladan Devedzic, Education and the Semantic Web, International Journal of Artificial
Intelligence in Education, Vol. 14 (pp. 39-65). 2004.
An Efficient Architectural Framework of a Tool
for Undertaking Comprehensive Testing
of Embedded Systems

V. Chandra Prakash J.K.R. Sastry
Koneru Lakshmaiah College of Engineering Koneru Lakshmaiah College of Engineering
Green Fields Vaddeswaram Green Fields Vaddeswaram
Guntur District- 522501 Guntur District- 522501
K. Rajashekara Rao J. Sasi Bhanu
Green Fields Vaddeswaram Green Fields Vaddeswaram
Guntur District- 522501 Guntur District- 522501

Abstract

Testing and debugging embedded systems is difficult and time consuming for
simple reason that the embedded systems have neither storage nor user
interface. The users are extremely intolerable of buggy embedded systems.
Embedded systems deals with external environment by way of sensing the
physical parameters and also must provide outputs that control the external
environment
In case of embedded systems, the issue of testing must consider both hardware
and Software. The malfunctioning of hardware is detected through software
failures. Cost-Effective testing of embedded software is of critical concern in
maintaining competitive edge. Testing an embedded system manually is quite
time taking and also will be a costly preposition. Tool based testing of an
embedded system has to be considered and put into use to reduce the cost of
testing and the ability to complete the testing of the system rather quickly as
fast as possible.
Tools[5,6,7] are available in the market for testing embedded systems but they
carry fragments of testing and even the fragments of Testing does not address
the unified manner of testing. The tools fail to address the integration testing
of the Software Components, Hardware Components and the interface
between them. In this paper an efficient Architectural Framework of a Testing
tool that help undertaking comprehensive testing of embedded system is
presented.
1 Introduction
In the case of Embedded Systems, the issue of testing must consider both hardware and
Software. The malfunctioning of hardware is detected through software failures. The target
Embedded System does not support the required hardware and software platform needed for
14 An Efficient Architectural Framework of a Tool for Undertaking Comprehensive Testing
Copyright ICWS-2009
development and testing the Software. The software development cannot be done on the
Target Machine. The Software is developed on host machine and then installed in the Target
machine which is then executed. The testing of Embedded System must broadly meet the
following testing goals [1].
Finding the bugs early in the development process is not possible as the target
machine often is not available early in the development stage or the hardware being
developed parallel to software development is either unstable or buggy.
Exercising all of the code including dealing with exceptional conditions in a target
machine is difficult as most of the code in an embedded system is related to
uncommon or unlikely situations or events occurring in certain sequences and timing.
Overcome the difficulty of developing reusable and repeatable tests needed to be
executed due to repeatable event occurrence sequence in the target machine
Maintenance of audit trail related to test results, event sequences, code traces, and
core dumps etc. which are required for debugging.
To realize the testing goals, it is necessary that testing be carried in the host machine first and
then be carried along with the Target. Embedded Software must be of the highest Quality and
must adapt to excellent strategies for carrying testing. In order to decide on the testing
strategy or the type of testing carried or the phases in which the testing is carried, it is
necessary to carry on with the analysis of different types of test cases that must be used for
carrying the testing [2].
Every embedded Application comprises of two different types of tasks. While one type of
tasks deals with emergent processing requirements, the other type of tasks undertakes the
processing of input/output and also various processing related to house keeping. The tasks
that deal with emergent requirements are initiated for execution on interrupt.
Test cases must be identified sufficiently enough that all types of tasks that comprise the
application must be tested. Several types of testing such as integration testing, Regression
testing, functional testing, module testing etc. must also to be conducted.
Several types of testing are to be carried which include unit testing, Black Box Testing to test
the behavior of the system during and after the occurrence of external events, Environment
testing to test the user interface through LCD output, push button input etc,, Integration
testing to test the integration of Hardware Components, Software Components and the
Interface between the Hardware and the Software, Regression testing to test the behavior of
the system due to the reasons of incorporating changes to code. The testing system must help
in carrying testing to test the hardware, software and software along with the hardware.
2 Testing Approaches
Several authors have proposed different approaches to conducting testing of embedded
systems. Jason [8] and others have suggested testing of modules of embedded systems by
isolating the modules at run time and improving the integration of testing into development
environment. This method has, however, failed to support the regression of events. Nancy [9]
and others suggested an approach of carrying unit testing of the embedded systems using
agile methods and using multiple strategies. Testing of embedded software is bound up with
An Efficient Architectural Framework of a Tool for Undertaking Comprehensive Testing 15
Copyright ICWS-2009
testing of hardware. Even with evolving hardware in the picture, agile methods work well
provided, multiple testing strategies are used. This has powerful implications for improving
the quality of high reliability systems, which commonly have embedded software at their
heart. Tsai [14] and others have suggested END-TO-END Integration testing of embedded
system by specifying test scenarios as thin threads, each thread representing a single function.
They have even developed a WEB based tool for carrying END-TO-END Integration
Testing.
Nam Hee Lee [11] suggested a different approach for conducting integration testing by
considering interaction scenarios, since the integration testing must consider sequence of
external input events and internal interactions. Regression testing [10] has been a popular
quality testing technique. Most regression testing are based on code or software design, Tsai
and others have suggested regression testing based on Test scenarios and the testing approach
suggested is functional regression testing. They have even suggested a WEB based tool to
undertake the Regression testing. Jakob [16] and others have suggested testing of embedded
systems by simulating the Hardware on the host and combining the software with the
simulators. This approach however will not be able to deal with all kinds of test scenarios
related to Hardware. The complete behavior of Hardware, specially unforeseen behavior,
cannot be simulated on a host machine. Tsai [17] and others have suggested a testing
approach based on verification patterns, the key concept of this being recognizing the
scenarios into patterns and applying the testing approach whenever similar patterns are
recognized in any Embedded application. But the key to this approach is the ability to
identify all test scenarios that occur across all types of embedded applications.
While all these approaches, no doubt, address a particular type of testing, no coverage has
been made to carry comprehensive testing considering the testing to be carried to test
Hardware, Software and both.
3 Testing Requirement Analysis
Looking at the software development, hardware development, integration & migration of
code into the target system and then carrying testing of the Target system, the following
testing scenarios exists. [19] The entire Embedded System application code is divided
primarily into components namely Hardware independent code, and Hardware Dependent
Code. Hardware independent code are tasks that carry mundane house keeping and data
processing and tasks that controls processing on a particular device, whereas the hardware
dependent code are either interrupt service routines or the drivers that control the operation of
the device. It is necessary to identify different types of test cases that test all the three
different types of code segments.
Unit testing, integration testing and the regression testing of the hardware independent code
can be carried by scaffolding the code by simulating the hardware.. Some of the testing
related to response time, throughput, testing for portability, testing the built-in peripherals
like ROM, RAM, DMA, and UART etc. requires usage of Instruction set simulator within the
testing tool.
Some of the testing such as the existence of pre conditions can also be made by using the
assert macros. The testing system should have the ability to insert inline assert macros to test
existing of a particular condition before a piece of code is executed and the result of
Copyright ICWS-2009
execution of a such a macro must also be recorded as a test case result. Testing such as Null
Pointer evaluation, Validation for range of values, verification of whether a function is called
by ISR (Interrupt Service Routine) or a Task, checking and resetting event bits can be carried
by using the assert macros.
As 80% of code testing can be done on the host, the following types of testing however
cannot be carried on the Host alone and therefore there is a need for a testing process that
uses both host machine and Target machine or just the target machine itself.
Logic analyzers help in testing the Hardware. The tests related to timing of the signals,
triggering of the signals, occurrence of the events, computation of the response time, patterns
of occurrence of the signals etc, can only be carried with the help of Logic Analyzers. This
kind of testing is done in unison without having any integration with the software. Therefore
a Logic analyzer driven by testing software shall provide a good platform for testing.
The real testing of Hardware along with Software can be achieved through in-circuit
Emulators which replace the Microprocessors in the target machine. Other chips in the
embedded systems are connected in the similar way as connected to Microprocessor.
Emulator has software built into a separate memory called Overlay memory. Overlay
Memory is additional memory and different from either ROM or RAM. The emulator
software provides the support for debugging, recording of memory content, recording the
trace of program execution during dumps, tracing of program execution for a test case. In the
event of any failures, Emulators still helps in interacting with Host machine. The entire dump
of overlay memory can be viewed and the reason for failure can be investigated. The testing
software thus should be interfaced with in-circuit emulator particularly for supporting the
debugging mechanism during failures and breakdowns.
Monitors are software that reside on the host and provide debugging interface and testing
interface with the monitor on the target. Monitors also provide debugging Interface, and
provide communication interface with the target to place the code in RAM or flash. If
necessary, some of the locator functions are executed in the process. Users can interact with
the monitor on the host to set break points, run the program and the data are communicated to
monitor on the target for facilitating the execution. Monitors can be used to test memory
leakages, function usage analysis, testing for week code, testing for changes in data at
specified memory locations, and testing for high use functions. Monitors also help in testing
inter task communication through Mail boxes, Queues and Pipes including the overflow and
underflow conditions. Thus the testing tool should have built-in monitors.
4 Architectural Analysis of Existing Tools
In literature, most important tools that are in use are developed by the companies like
Tornado [6], Windriver [5] and Tetware [7]. Tsai [10, 14, and 17] and others have introduced
tools which provide limited testing on the host for the purposes of carrying either unit testing,
integration testing, regression testing, END-TO-END testing but not all of them together.
4.1 Tornado Architecture
Tornado architecture has provided a solution that provides a testing process both at Target
machine and host machine and provide for proper interface required. Fig 4.1 shows the
architecture of the Tornado tool. The model provides for support of a simulator on the Target
Copyright ICWS-2009
side which adds too much of a code and may hamper the achievement of the response time
and throughput. The architecture has provision to test the hardware under the control and
initiation of software. This model has no provision for scaffolding, instruction set simulation
etc. The testing of Hardware using Logical Analyzer is undertaken at the Target that again
adds a heavy overhead on the target. This architecture does not support third party tools that
help in identification and testing of memory leakages, functional coverage etc.

Fig. 4.1: Tornado Tool Architecture
4.2 Windriver Architecture
This architecture is an improvement over the Tornado architecture. The Architecture is
shown at Fig 4.2. This architecture uses the interface of third party tools, provides for user
interface with which testing is carried and also uses simulator software on the host side. This
architecture also uses an emulator on the target side, thus helping testing and debugging
under the failure conditions. This architecture has no support for scaffolding; Testing for
Hardware initiated either at the HOST or at the Target.
4.3 Tetware Architecture
The Tetware architecture is based on test cases that are fed as input at the host. The Tetware
provides huge API Library that interfaces with the library that is resident on the target. This
architecture is shown at Fig. 4.3. This architecture relies on heavy code to be resident on the
Copyright ICWS-2009
target. This architecture hampers the response time and throughput very heavily. This
architecture also has no support for scaffolding, instruction simulation and environment
checking through Assert Macros.

Fig. 4.2: Windriver Tool Architecture

Fig. 4.3: Tetware Tool Architecture
Copyright ICWS-2009
5 Proposed Architectural Framework for a Comprehensive Testing Tool
Considering the different testing requirements form the point of view of scaffolding,
instruction set simulation, assert macros that are needed for testing at the host, the host based
architecture is proposed. The architecture is shown at Fig. 5.1. The proposed architecture
provides an user interface and also a provision for a Database to store the test data, Test
results and the historical data transmitted by the target. The Host side architecture also
provides a communication interface. The most important advantage of this model is the
provision of a interface to test the Hardware through probes that connect the Target through
either USB or serial interface. The HOST side architecture also provides for scaffolding
facility to test for Hardware independent code on the HOST itself.
On the TARGET side, testing and debugging is done through in-circuit Emulator and a
provision for flash programmer for burning the program into either flash or ROM. The
communication interface resident on the target shall provide the communication interface
with HOST.
The proposed architecture surely reduces the size of the code on the target and therefore
ensures original intended response time & throughput and does not demand any extra
hardware on the target side, thus providing the cost effective solution.

Fig. 5.1: Proposed Architecture
6 Summary and Conclusions
Testing an embedded System is complex as the target machine has limited resources and as
such has no user interface. The testing goals as such cannot be achieved when testing is to be
Copyright ICWS-2009
done with target machine only. If testing is done using Host machine, the Hardware
dependent code can never be tested. So it is evident that testing an embedded system requires
an architecture that considers both Host Machine and Target Machine. Comprehensive testing
can only be carried by using a suite of tools which includes scaffolding software, simulators,
Assert macros, Logic Analyzers, and In-Circuit Emulators. All the tools used must function
in an integrated manner so that compressive testing can be carried. The architectural
framework proposed shall meet the entire functional requirements for testing an embedded
system comprehensively.
References
Books
[1] David E. Simon, An embedded Software Primer, Pearson Education
[2] Raj Kamal, Embedded Systems Architecture, Programming and Design, Tata McGraw-Hills Publishing
Company
[3] Prasad KVKK, Software Testing Tools, DreamTech Press, India
[4] Frank Vahid and Tony Givargis, Embedded Systems Design A unified Hardware/Software Introduction,
John Wiley and Sons
WEB Sites
[5] Windriver,http://www.windriver.com
[6] Open Group, http://www.opengroup.com
[7] Real Time Inc, http://www.rti.com
Journal Articles
[8] Jayson McDonald etc. al., Module Testing Embedded Software An industrial Project, Proceedings of
seventh international conference on Engineering of Complex Computer Systems 2001
[9] Nancy Van etc. al, Taming the Embedded Tiger, Agile test Technique for Embedded Software,
Proceedings of Agile development conference ADC-04
[10] Wei-Tek Tsai etc, al, Scenario based Functional Regression Testing, Proceedings of the 25
th
Annual
International Computer Software and Application Conference 2002
[11] Nam Hee Lee etc. al., Automated Embedded Software testing using Task Interaction Scenarios
[12] D. Deng etc. al., Model based Testing and maintenance, Proceedings of the IEEE Sixth International
Symposium on Multimedia Software Engineering 2004
[13] Raymond Paul, END-TO-END Integration testing, Proceedings of the second Asia-Pacific Conference on
Quality Software 2001
[14] W.T. Tsai etc. al, END-TO-END Integration Testing design, Proceedings of the 25
th
Annual
International Computer Software and Application Conference 2001
[15] Jerry Gao etc. cal., Testing Coverage Analysis for Software Component Validation, Proceedings of the
29
th
Annual International Computer Software and Applications Conference 2005
[16] Jakob etc. al., Testing Embedded Software using simulated Hardware, ERTS 2006-25-27 January 2006
[17] W.T. Tsai L. Yu etc. al, Rapid verification of Embedded Systems using Patterns, Proceedings of the 27
th

Annual International Computer Software and Applications Conference 2003
[18] Dr. Sastry JKR, Dr. K. Rajashekara Rao, Sasi Bhanu J, Comprehensive Requirement Specification of a
Cost-Effective Embedded Testing Tool A Paper presented at National conference on software
engineering(NCSOFT), CSI May 2007
Managed Access Point Solution

Radhika P.
Vignans Nirula Institute of Technology & Science for Women Pedpalakalur; Guntur-522 005
e-mail: rspaturi@yahoo.com

Abstract

This paper highlights the general framework of Wireless LAN access point
software called Managed Access Point Solution (MAPS). It is a software
package that combines the latest 802.11 wireless standards with networking
and security components. MAPS enables Original Equipment Manufacturers
(OEMs) / Original Design Manufacturers (ODMs), to deliver leading-edge
Wi-Fi devices, such as business-class wireless gateways, broadband access
points / routers and hot-spot infrastructure nodes, for the small-to-medium
business (SMB) market. The software is designed with security for Wi-Fi use
and secure client software supporting personal and enterprise security modes.
1 Introduction
Today's embedded systems are increasingly built by integrating pre-existing software
modules, allowing OEMs to focus their efforts on their core competitive advantage - the
embedded device's application. The first wave was the move towards standard commercial
and open-source operating systems replacing home-grown ones. The next one is the move to
well-designed, configurable building blocks of software IP which just plug into the operating
environment for the application. The focus is on building modular software products that are
pre-integrated with the hardware and operating systems they support ensures that you can
spend more time using the functionality in the way that best suits your need, rather than
"porting" it to your environment. The components can be fixed into an embedded software
application with a minimum amount of effort, using a streamlined and simplified licensing
model that includes royalty-free distribution and full source code.
With ever-increasing cost and time-to-market pressures, building leading-edge embedded
devices is a high-risk proposition. Until now, OEMs and ODMs had to go with monolithic
software packages or use in-house resources to engineer device software to their
requirements. As a result, customizing the devices to meet customers specific requirements,
have increased time to market and added to the cost. Therefore true turnkey solutions are
needed that combine:
A rich set of field-proven, standard components.
An array of customizable options.
A team of professional services experts to provide all hardware/software integration,
porting, testing, and validation.
Flexible licensing options.
22 Managed Access Point Solution
Copyright ICWS-2009
1.1 Wireless Technology
Wireless LAN technology is propagating in various embedded application domains, and
WLAN security standards have become the key concern for many organizations. New
standards address many of the initial security concerns while still maintaining and enhancing
the mobility and untethered aspects of a wireless LAN. New applications in industrial
networks, M2M and in the consumer space are in a demand for more secure, more
standardized middleware, to hook up to traditional wired LANs, rather than just a monolithic
wireless access box for PCs.

2 MAPS Product Overview
MAPS provides a production-ready solution for building secure, managed access point
devices to OEMs/ODMs, while reducing development cost, risk, and time to market. With
MAPS, OEMs can easily differentiate their Wi-Fi Access Point products by choosing from
a wide range of advanced networking and security modules.
Managed Access Point Solution 23
Copyright ICWS-2009

2.1 Field-proven Software Modules
Specific instances of the Managed Access Point Solution are created by leveraging pre
existing software blocks that have proven their merit in thousands of deployments and also
minimizes risk for OEMs by keeping licensing terms flexible. Only the Managed Access
Point Solution can offer such a comprehensive set of features with completely modular
packaging that allows for full customization, to meet an OEM's specific requirements.
2.2 SMBware Software Modules
Small-to-Medium Business ware family of embedded software solutions gives OEMs/ODMs
the ability to bring differentiated, leading-edge devices for the small-to-medium business
(SMB) market segment. To create a fully customized device, OEMs first select from a
comprehensive set of SMBware software modules. In addition to SMBware modules, OEMs
can select from third-party modules or modules developed in-house by OEMs. These
modules are then integrated to create validated software packages that meet an OEMs
specific needs.
The final custom touches are added to develop specific features such as BSPs, bootloaders,
drivers, and hardware accelerators for OS platforms running a Managed Access Point
Solution, software modules are integrated and end-user device management interfaces are
customized. The result is a standard, field-tested software solution in a production-ready
custom package, with all hardware integration, porting, testing, and validation completed.
Copyright ICWS-2009

2.3 Features of MAPS
The Dual-band, multimode networks (5 GHz and 2.4 GHz, 54 Mbps) capable of
delivering high-performance throughput.
Power over Ethernet (PoE), which eliminates extra cabling and the necessity to locate
a device near a power source.
Wireless distribution system (WDS), which extends a networks wireless range
without additional cabling.
Advanced 802.11 security standards, including WEP, WPA, WPA2, (802.11i), for
any generation of wireless security, in Personal and Enterprise modes.
802.11n MIMO technology, which uses multiple radios to create a robust signal that
travels farther with fewer dead spots at high data rates.
Wi-Fi Multimedia (WMM), which provides improved quality of service over wireless
connections for better video and voice performance.
2.4 IEEE 802.11
The IEEE 802.11 is a set of standards for wireless local area network (WLAN) computer
communication, developed by the IEEE LAN/MAN Standards Committee (IEEE 802) in the
5 GHz and 2.4 GHz public spectrum bands. The 802.11 family includes over-the-air
modulation techniques that use the same basic protocol. The most popular are those defined
by the 802.11b and 802.11g protocols, and are amendments to the original standard. 802.11-
Copyright ICWS-2009
1997 was the first wireless networking standard, but 802.11b was the first widely accepted
one, followed by 802.11g and 802.11n. Security was originally purposefully weak due to
export requirements of some governments,[1] and was later enhanced via the 802.11i
amendment after governmental and legislative changes.
802.11n is a new multi-streaming modulation technique that is still under draft development,
but products based on its proprietary pre-draft versions are being sold. Other standards in the
family (cf, h, j) are service amendments and extensions or corrections to previous
specifications. 802.11b and 802.11g use the 2.4 GHz ISM band, operating in the United
States under Part 15 of the US Federal Communications Commission Rules and Regulations.
Because of this choice of frequency band, 802.11b and g equipment may occasionally suffer
interference from microwave ovens and cordless telephones. Bluetooth devices, while
operating in the same band, in theory do not interfere with 802.11b/g because they use a
frequency hopping spread spectrum signaling method (FHSS) while 802.11b/g uses a direct
sequence spread spectrum signaling method (DSSS). 802.11a uses the 5 GHz U-NII band,
which offers 8 non-overlapping channels rather than the 3 offered in the 2.4GHz ISM
frequency band.
The segment of the radio frequency spectrum used varies between countries. In the US,
802.11a and 802.11g devices may be operated without a license, as allowed in Part 15 of the
FCC Rules and Regulations. Frequencies used by channels one through six (802.11b) fall
within the 2.4 GHz amateur radio band. Licensed amateur radio operators may operate
802.11b/g devices under Part 97 of the FCC Rules and Regulations, allowing increased
power output but not commercial content or encryption.[2]
2.5 KEY 802.11 standards
MAPS supports key IEEE standards for WLANs including:
2.5.1 802.11e
Full Wi-Fi Multimedia standard plus MAC enhancements for QoS. Improves audio, video
(e.g., MPEG-2), and voice applications over wireless networks and allows network
administrators to give priority to time-sensitive traffic such as voice.
2.5.2 802.11i
Strengthens wireless security by incorporating stronger encryption techniques, such as the
Advanced Encryption Standard (AES), into the MAC layer. Adds pre-authentication support
for fast roaming between APs.
2.5.3 802.11n
Uses multiple-input, multiple-output (MIMO) techniques to boost wireless bandwidth and
range. Multiple radios create a robust signal that travels farther, with fewer dead spots.
3 Benefits of MAPS
Complete turnkey solution for building Wi-Fi devices with secure, managed access
points lessens OEMs development costs, risk, and time to market.
Selected MAPS and SMBware networking and security modules enable OEMs to
easily differentiate products.
Copyright ICWS-2009
Adherence to standards enables:
802.11 a/b/g/n support for maximum flexibility and high performance.
PoE for simplified power requirements.
WEP, WPA, WPA2 (802.11i) for advanced security.
WDS, using a wireless medium, for a flexible and efficient distribution
mechanism.
MIMO technology for stronger signals and fewer dead spots.
Comprehensive management capabilities including secure remote management.
Support for a broad range of Wi-Fi chipsets.
Branding options offer a cost-effective, customized look and feel.
4 Technical Specifications of MAPS
The deployment scenario of MAPS can be shown as follows:

4.1 Interfaces
Ethernet connection to wired LAN (single or multiple)
DSL/Cable/Dialup/WWAN connection to ISP
Optional Ethernet LAN switch (managed/unmanaged)
Wi-Fi Supplicant upstream connection
Copyright ICWS-2009
4.2 Protocol Support
IP routing
Bridging
TCP/IP, UDP, ICMP
PPPoE, PPTP client
DHCP, NTP
RIP v1, v2
Optional IPSec (ESP, AH), IKE, IKEv2
IEEE 802.11 standards
4.3 Networking Capabilities
Static routing, dynamic routing
Unlimited users (subject to capacity)
Static IP address assignment
DHCP client for device IP configuration
4.4 DHCP Address Reservation
NAT or classical routing
Port triggering
uPNP
Configurable MTU
Multiple LAN sub-nets
802.11 MIB support
4.5 Device Management
Intuitive, easily brandable browser-based GUI
SNMP v2.c and v3 support
Advanced per-client, AP and radio statistics
Telnet and serial console CLI support
Remote management restricted to IP address or range
Custom remote management port
GUI-based firmware upgrade
SMTP authentication for email
Copyright ICWS-2009
5 Conclusion
In this paper, the general framework of Wireless LAN access point software called Managed
Access Point Solution (MAPS) is discussed. It explains how this software package combines
the latest 802.11 wireless standards with networking and security components. MAPS
enables OEMs/ODMs, to deliver leading-edge Wi-Fi devices. The technical specifications,
features and benefits of MAPS have been highlighted. The software is designed with security
for Wi-Fi use and secure client software supporting personal and enterprise security modes.
References
[1] [http://books.google.co.in/books?hl=en&id=uEc4njiIXhYC&dq=ieee+802.11+handbook&printsec=frontco
ver&source=web&ots=qH5LuA0v2y&sig=j_baDrrbtCrbEJZuoXT4mMpKk1s]
[2] [http://en.wikipedia.org/wiki/IEEE_802.11]
[3] [IEE802.11 handbook A Designer's Companion (By Bob O'Hara and Al Petrick)]
[4] [http://www.teamf1.com]
Autonomic Web Process for Customer
Loan Acquiring Process

V.M.K. Hari G. Srinivas
Department of MCA Department of Information Technology
Adikavinannaya University GITAM University
T. Siddartha Varma Rukmini Ravali Kota
Department of Information Technology Department of Information Technology
GITAM University GITAM University

Abstract

Web services are developed to address common problems associated with
RPC(Remote procedural calls). As Web services are agnostic about
implementation details, operating system, programming language, platform
they will play very important role in distributed computing, inter process
communication, B2B interactions. Due to rapid development in web service
technologies and semantic web services, it is possible to achieve automated
B2B interactions. By using these services we are going to apply autonomic
web process for customer loan acquiring process. In this work we are going to
adapt frame work for dynamic configuration of web services and process
adaptation incase of events proposed by Kunal verma, to provide improved
services to loan acquiring process. During this work we give details of how to
achieve autonomic computing features to loan acquiring process. Because of
this autonomic loan acquiring process, a customer service can configure
automatically with optimal bank loan services to request loans, and if loan
approvals are delayed, it can take optimal actions by changing loan service
providers for risk avoidance.
1 Introduction
As web services provide agnostic environment for B2B communication, by adding semantics
to these web services standards, the web services publication, discovery, composition can be
automated [1][2].We provide details about autonomic web process services to loan acquiring
process for customers. In order to achieve loan acquiring process automation we try to define
loan service requirements, policies, and constraints semantically. To add semantics to loan
services we suggest a Bank Ontology for standard interfaces (operation, Input, output,
exceptions).Using this Bank Ontology each Banks loan service interfaces are semantically
annotated and WSDL-S file is generated using tools like MWSAF. These WSDL-S file are
published in UDDI structures for global access for their interfaces through queries. To search
the UDDI semantically and syntactically, we provide domain ontology for loan service
interfaces of Banks.
Loan service policies are described using WS-Policy. Quantitative constraints are given input
to ILP solver in LINDO API format matrix format. Logical constraints are represented as
SWRL rules and stored in form of ontology rules [4][5].
30 Autonomic Web Process for Customer Loan Acquiring Process
Copyright ICWS-2009
In dynamic configuration environment for loan acquisition process all services are not known
during configuration time. So to adapt process in run time, abstract process with well defined
process flow and controls are needed. To create abstract process WS-BPEL constructs are
used [6].This abstract process can be deployed in IBM BPWS4J SOAP engine server. During
executing process abstract process service templates are replaced by actual services. By
analyzing process constraints, policies, loan acquiring process services are selected for
execution. During process execution to handle events like loan approval delayed, loan request
cancellation process state maintenance is required. To maintain process state with various
services can be done with service manager for each service. Service manger is modeled as a
MDP for handling decision frame work. Runtime changes are efficiently handled by
METEOR-S architecture by using service managers [7].
2 Autonomic Loans Acquiring Process
In general Online LOANS ACQUIRE PROCESS involves two modules: Company Loans
Requirement module functions and Loan Acquiring module functions
Company Loans Requirement module functions:
They will set constraints (quantitative and logical) for LOANS ACQUIRE PROCESS.
Quantitative constraints include:
a. Maximum rate of interest of whole loan acquiring process.
b. Maximum rate of interest of each loan.
c. Maximum approval time of loan acquisition.
d. Maximum installment rate (EMI) for each loan.
e. Maximum number of equal monthly installments (EMIs).
Logical constraints include:
a. Faithfulness of Bank service.
b. Restrictions on acquiring more than one type of loan, etc.
Loan Acquiring module functions are:
a. It gets loan details from different Bank services.
b. It should select the best Bank services.
c. It must satisfy constraints of Company Loans Requirement module
d. It should place loan requests for optimal Bank services.
In this LOANS ACQUIRE PROCESS the Loan Acquiring module functions are done by
humans.
In this whole process following events may occur:
Loans approvals are delayed or cancelled because the Bank services may not have
required capital.
Physical failures like trusted bank services are not available.
Logical failures like some loan requests are cancelled due to delay of other Bank
services.
To react above events optimally autonomic web process is evolved. Autonomic web process
means a web process with autonomic computing. Because of adding autonomic computing
features like self configuration, self healing, self optimizing to web process, it provides
improved services to customers. [5]
Autonomic Web Process for Customer Loan Acquiring Process 31
Copyright ICWS-2009
3 Autonomic Computing Features for LOANS ACQUIRE PROCESS
3.1 Self Configuration
Whenever a new optimal Bank services are registered then LOANS ACQUIRE PROCESS
should configure with new optimal service without violating constraints of Company Loans
Requirement module.
3.2 Self Healing
LOANS ACQUIRE PROCESS should continuously detect and diagnose various failures (e.g.
preferred services are not available or delay in loan approval time).
3.3 Self Optimization
LOANS ACQUIRE PROCESS should monitor optimal criteria attributes (e.g. rate of interest,
loan approval time).It should reconfigure the process with new optimal suppliers. While
LOANS ACQUIRE PROCESS is reconfiguring the process it should obey the constraints,
policies, requirements of Company Loans Requirement module.
In order to achieve autonomic computing features, the LOANS ACQUIRE PROCESS
requirements, policies, constraints are represented semantically. To get autonomic computing
environment, we can use existing web services standards by adding semantics. Web services
provide inter process communication for distributed computing paradigm. By adding
semantics to web service standards we can provide automated capabilities related to web
services composition, publication, discovery [7].LOANS ACQUIRE PROCESS require
dynamic configuration environment with Event handling capability. To provide this
environment we follow the Kunal Vermas research work[7].
First we try to represent policies, constraints, requirements of LOANS ACQUIRE PROCESS
semantically.
As LOANS ACQUIRE PROCESS requires various Bank services details, we have to provide
Bank services ontology so that a loan requirement process can use standard messages,
protocols for implementing LOANS ACQUIRE PROCESS B2B interactions. We define loan
acquirement process in term of 4 steps. (We are trying to define loan acquire process in favor
of customer, so we do not consider banks loan approval process)
1. Customers should register with their details and amount of loan required.(assumption:
bank services may analyze details of customers and then give details about banks
loan)
2. It will get loan details from Bank services.
3. Analyze loan details from different bank services.
4. It will place loan requests with faithfulness Bank service.
A LOANS ACQUIRE PROCESS requesting loan services should have following
capabilities.
1. Get Loan Details.
2. Loan Request.
3. Cancel Loan.
Copyright ICWS-2009
Bank Domain ontology standard provides standard messages for above capabilities. It also
provides standard input, output, faults, and assertions for messages. The entire Bank loan
services should semantically annotated with this ontology.

Fig. 1: LOANS ACQUIRE PROCESS steps.

Fig. 2: Service templates annotated using Bank Loan service ontology standard. [9][10]
4 Semantics about Bank Loan Web Services
Web services have primarily been designed for providing inter-operability between business
applications. WSDL is an XML standard for defining web services. As number of Bank
Bank Service
Interfaces
Has input messages
Has output messages
isa
isa
Input
Messages
Output
Message
getLoanDeatils Loan Request
loanDetailsRequest
LoanRequestDetail
s
loanDetailsResponse loanReqeustResponse
isa
isa
isa
isa
Loan Cancel
loanCancelResponse
loanCancelRequs
Get Loan details
Place Loan requests
stop
Start
Customer registration
Analyze Constraints
Copyright ICWS-2009
supplier services increases, interoperability among these services is difficult because names
of service inputs, outputs, functions differ among services. We can solve it by relating service
elements with ontological concepts of Bank ontology standard.
Semantic annotations on WSDL elements used [2] for annotating Bank loan web service
interfaces are:
Annotating Loan service message types (XSD complex types and elements) can use
Extension attribute: model Reference (semantic association) extension attribute
schema Mapping
Annotating Loan service operations can use
Extension elements: precondition and effect (child elements of the operation
element)
Extension attribute: category (on the interface element)
Extension attribute: Model Reference (action) (on operation element)
LOANS ACQUIRE PROCESS service template should consist of service level meta data,
semantic operations, service level policy assertions. To represent service request template we
can use WSDL-S.WSDL-S provide semantic representations for input, output, operation,
preconditions, effects, faults using its attributes, elements.[11][12]
Service requesting template WSDL-S=< service level meta data, semantic operations, service
level policy assertions> (semantic template)
To provide autonomic communication and interoperability among bank web services
semantics are added to bank services. We add semantics to the web services by using various
attribute of WSDL-S to relate Bank loan service elements with ontology concepts of Bank
ontology standard elements.[11]
Web service descriptions have two principal entities.
1. Functions provided by service.
2. Data (input, output) exchange by service.
Adding semantics to Data
Loan web service input, output can be semantically associated with Bank Ontology standard
input, output using model Reference attribute of WSDL-S.[12][19][2]
Bank web service actual invocation detail mapping are needed to indicate exact
correspondence of data types of two Services. To solve this problem SAWSDL provides two
attributes for schema mapping
1. Lifting Schema Mapping(XML instance to ontology instance)
2. Lowering Schema Mapping(ontology instance to XML instance)
Adding semantics to Functions: A Bank service function should be semantically defined so
that it can be invoked from other service.
A semantic operation is represented by following tuple: [12][19][2]
<Operation:FunctionalConcept,
Copyright ICWS-2009
input:Semantictype,
output:Semantictype,
fault:semanticfault,
pre:semanticprecondition,
effect:semanticefect>
WSDL-S well defined interfaces (URIs) are provided by the WSDL-S Services.
When a semantic operation invoked service manager has to check following list.
1. Is input of semantic type?
2. Are all preconditions satisfied?
3. Execute the operation.
4. Is output of semantic type?
5. After completion of operation are all effects satisfied?
6. If any fault thrown during operation execution, then throw Fault.
Non Functional requirements of Loan web services can be specified using WS-Policy:[13]
The Web Services Policy Framework (WS-Policy) provides a general purpose model and
corresponding syntax to describe the policies of a web service. WS-Policy defines a base set
of constructs that can be used and extended by other Web services specifications to describe a
broad range of service requirements and capabilities.[13]
The goal of WS-Policy is to provide the mechanisms needed to enable Web services
applications to specify policy information. Specifically, this specification defines the
following:
An XML Info set called a policy expression that contains domain-specific, Web
Service policy information for e.g. loan services.
A core set of constructs to indicate how choices and/or combinations of domain
specific policy assertions apply in a (e.g. loan) Web services environment.
Policy is a collection of assertions. Each assertion can be defined using following tuple
Policy (P)=Union of Assertions(A)
A=<Domain attribute, operator, value, unit, assertion type, assertion category>
A Bank Service Loan policy expressed using WS-Policy. The policy contains information
about the expected delay probability and penalties from various states. The Bank Service
Loan policy has following information.
The Loan service gives a probability of 85% for loan approval.
The Loan can cancel at any time based on the terms given below.
If the Loan has not been delayed, but it has not been approved later it can be cancelled
with a penalty of 5% to the customer.
If the Loan has been approved without a delay, it can be cancelled with a penalty of
20% to the customer.
Copyright ICWS-2009
To create dynamic configuration environment for autonomic web process LOANS
ACQUIRE PROCESS[14][7] we adopt the 3 steps proposed by Kunal Verma:[7]
Abstract process creation.
Semantic web service discovery.
Constraints Analysis.
4.1 Abstract Process Creation for Loan Acquiring Process: [14][7][6]
To create abstract process all the constructs of WS-BPEL can be used. [6] [6] WS-BPEL
provides a language for the specification of Executable and Abstract business processes. By
doing so, it extends the Web Services interaction model and enables it to support business
transactions. WS-BPEL defines an interoperable integration model that should facilitate the
expansion of automated process integration in both the intra corporate and the business-to-
business spaces.
Business processes can be described in two ways. Executable business processes model
actual behavior of a participant in a business interaction. Abstract business processes are
partially specified processes that are not intended to be executed.[6]
Abstract Processes serve a descriptive role, with more than one use case. One such use case
might be to describe the observable behavior of some or all of the services offered by an
executable Process. Another use case would be to define a process template that embodies
domain-specific best practices. Such a process template would capture essential process logic
in a manner compatible with a design-time representation, while excluding execution details
to be completed when mapping to an Executable Process.
Advantage of WS-BPEL is to configure the process by replacing semantic templates with
actual service at a later time. Since WSDL-S adds semantic to WSDL by using extensibility
attribute it allows us to capture all the information in semantic templates and also makes the
abstract process executable.
4.2 Semantic Web Service Discovery: [14][15][7]
How does LOANS ACQUIRE PROCESS find out what loan web services are available that
meet its particular needs?
To answer this question we can use UDDI registry. UDDI is central replica table registry of
information about web service. UDDI based on catalog of services. It supports look up both
humans and machines. UDDI catalogs three types of registrations.[16]
Yellow Pages-Let you find services by various industry categories.
White Pages.-Let you find business by its name or other characteristics.
Green Pages-Provides information model for how an organization does business
electronically.
Identifies Business process as well as how to use them.
In UDDI
Bank Loan Organizations populate registry with information about their web services.
UDDI Registry assigns a unique identifier to each service and business registration.
Copyright ICWS-2009
While storing these organization their services in UDDI registry we can impose technical
note to store services.[4]
That is semantic template information of the form:
WSDL-S=< service level metadata, union of semantic operations, service level policy
assertions>
Service level meta data stored in Business Service template. Semantic operations stored in
Binding templates under category bags. Semantic operation parts are stored in key references
of category bags.
While UDDI implementations only search for string matches, we can incorporate SNOBASE
based ontology inference search mechanism to also consider Bank Loan Service Process
domain ontological relationships for matching. This discovery module can be implemented
using the UDDI4J API.[15]
To search UDDI using WSDL and SOAP message can be used.
4.3 Constraints Analysis [14][1][7]
In order to perform constraints analysis for Loan Acquire Process, its constraints should be
represented in consistent form for ILP solver. Loan acquiring process Quantitative
Constraints can be described as follows.[4]
Equations for Set up
1. Set the bounds on i and j, where i iterates over the number of different loans (M) for
which operations are to be selected and j iterates over the number of candidate loan
services for each loan - N(i). For example, M = 2, as the operations have to selected
for two activities - personal Loan Request and Business Loan Request. Also,
since there are two candidate services for both the operations, N(1)=2 and N(2)=2.
2. Create a binary variable for each selected operation of candidate service. Each
candidate service is assigned a binary variable. The candidate services for personal
Loan Request (i=1) are assigned and and the candidate services for Business Loan
Request (i=2) are assigned and X
11
,X
12
,X
21
,X
22
,.
3. Set up constraints that state that only one operation must be chosen for each activity.
N(1)
j=1
X
1j=1
or X
11
+X
12
=1 (a)
N (2)

j=1
X
2j=1
or X
21
+X
22
=2 (b)
Equations for Quantitative Constraints
4. It is also possible to have constraints on particular loan service activities. There is a
constraint on activity 1 (Business Loan Request) that number of installments
(NEMIS) can be greater than 30.
This can be expressed as the following constraint.

N(1)

j=1
NEMIS
j
*X
1j
>=30
5. There is a entire process constraint that loan Approval Time of the process should be
less than 8 days.
Copyright ICWS-2009
M

N(1)

i=1
j=1
loan Approval Time
ij
*X
ij
<=8
6. There is a entire process constraint that loan installment amount (EMI) of the process
should be less than 1200 rupees.. The constraint for this can be represented as the
following.
E.g:
M

N(1)
i=1
j=1EMIij
*X
ij
<=1200
7. Create the objective function. In this case, interest should be minimized. This is
expressed as the following.
Minimize
M

i=1
N(1)
j=1
INTEREST ij*Xij
These equations can be given as input LINDO API for solving constraints. It will give
optimal services.
To create logical constraints first we need to provide LOANS ACQUIRE PROCESS domain
knowledge using ontology rules. Ontology rules are represented using SWRL (Semantic Web
Rule Language). These rules are stored in form of ontology. There are two aspects of logical
constraint analysis Step 1) creating the rules based on the constraints at design time and
Step 2) applying the SWRL reasoner to see if the constraints are satisfied at configuration
time[5]. Let us first examine creating the rules. These rules are created with the help of the
ontology shown in Figure3. Here are sample rules that capture the requirements outlined in
the motivating scenario.
1. SBBI Bank Loan Service 1 should be a trusted service. This is expressed in SWRL
abstract syntax using the following expression.
2. BankService (?S1) and faithfullness(?S1, trusted) => trustedService(?S1)

Fig. 3: Bank Domain Ontology.
Home
Loan
Personal
Loan
Business
Loan
Bank
Service
Loans provides
isa isa
isa
sbbi sbhi
abbi icii
Pl:
NEMIs:
Instalment:
ROI:
Pl:
NEMIs:
Instalment:
ROI:
Bl:
NEMIs:
Instalment:
ROI:
Hl:
NEMIs:
Instalment:
ROI:
Hl:
NEMIs:
Instalment:
ROI:
Bl:
NEMIs:
Instalment:
ROI:
Pl:
NEMIs:
Instalment:
ROI:
Hl:
NEMIs:
Instalment:
ROI:
Copyright ICWS-2009
5 Conclusion and Future Work
In this work we provide a frame work to create autonomic loan acquiring process.We give a
Loan acquiring process domain ontology standard for Bank loan service messages, and its
interface. We present this work with examples. This work explores ideas for autonomic loan
acquire process for improving customer services.
In future we are trying to evaluate this autonomic loan process under METEOR-S
environment.We will provide the results of autonomic loan acquiring process.
References
[1] R. Aggarwal, K. Verma, J. Miller and W. Milnor, Constraint Driven Web Service Composition in
METEOR-S, Proceedings of the 2004 IEEE International Conference on Service Computing (SCC 2004),
Shanghai, China, pp. 23-30, 2004.
[2] SAWSDL, Semantic Annotations for Web Services Description Language Working Group, 2006,
http://www.w3.org/2002/ws/sawsdl/
[3] A. Patil, S. Oundhakar, A. Sheth, K. Verma, METEOR-S Web service Annotation Framework, The
Proceedings of the Thirteenth International World Wide Web Conference (WWW 2004), New York, pp.
553-562, 2004.
[4] LINDO API for Optimization, http://www.lindo.com/
[5] J. Colgrave, K. Januszewski, L. Clment, T. Rogers, Using WSDL in a UDDI Registry, Version 2.0.2,
http://www.oasis-open.org/committees/uddi-spec/doc/tn/uddi-spec-tc-tn-wsdl-v202-20040631.htm
http://lsdis.cs.uga.edu/projects/METEOR-S
[6] SWRL, http://www.daml.org/2003/11/swrl/
[7] L. Lin, and I. B. Arpinar Discovery of Semantic Relations between Web Services, IEEE International
Conference on Web Services (ICWS 2006), Chicago, Illinois, 2006 (to appear).
[8] wsbpel-specification-draft-01 http://docs.oasis-open.org/wsbpel/2.0/Web Services Business Process
Execution Language Version 2.0 Public Review Draft, 23rd August, 2006.
[9] RosettaNet eBusiness Standards for the Global Supply Chain, http://www.rosettanet.org/ configuration and
adaptation of semantic web processes by kunal verma doctor of philosophy athens, georgia 2006.
[10] Rosetta Net Ontology http://lsdis.cs.uga.edu/ projects/meteor-s/wsdl-/ontologies/rosetta.owl
[11] K. Sivashanmugam, K. Verma, A. P. Sheth, J. A. Miller, Adding Semantics to Web Services Standards,
Proceedings of the International Conference on Web Services (ICWS 2003), Las Vegas, Nevada, pp. 395-
401, 2003.
[12] Web Service Description Language (WSDL), www.w3.org/TR/ws
[13] Web Service Policy Framework (WS-Policy), available at http://www106.ibm.com/developerworks/library/
ws-polfram/, 2003.
[14] K. Verma, K. Gomadam, J. Lathem, A. P. Sheth, J. A. Miller, Semantics enabled Dynamic Process
Configuration. LSDIS Technical Report, March 2006.
[15] M. Paolucci, T. Kawamura, T. Payne and K. Sycara, Semantic Matching of Web Services Capabilities, The
Proceedings of the First International Semantic Web Conference, Sardinia, Italy, pp. 333-347, 2002.
[16] Universal Description, Discovery and Integration (UDDI), http://www.uddi.org
[17] K. Verma, R. Akkiraju, R. Goodwin, Semantic Matching of Web Service Policies, Proceedings of Second
International Workshop on Semantic and Dynamic Web Processes (SDWP 2005), Orlando, Florida, pp. 79-
90, 2005.
[18] K. Verma, A. Sheth, Autonomic Web Processes. In Proceedings of the Third International Conference on
Service Oriented Computing (ICSOC 2005), Vision Paper, Amsterdam, The Netherlands, pp. 1-11, 2005.
[19] K. Verma, P. Doshi, K. Gomadam, J. A. Miller, A. P. Sheth, Optimal Adaptation in Web Processes with
Coordination Constraints, Proceedings of the Fourth IEEE International Conference on Web Services
(ICWS 2006), Chicago, IL, 2006 (to appear).
[20] R. Bellman, Dynamic Programming and Stochastic Control Processes Information and Control 1(3), pp.
228-239, 1958.
[21] WSDL-S, W3C Member Submission on Web Service Semanticshttp://www.w3.org/Submission/WSDL-S/
[22] Web Service Modeling Language (WSML), http://www.wsmo.org/wsml/
Performance Evaluation of Traditional Focused
Crawler and Accelerated Focused Crawler

N.V.G. Sirisha Gadiraju G.V. Padma Raju
S.R.K.R Engineering College S.R.K.R Engineering College
Bhimavaram Bhimavaram
siri_gnvg@yahoo.co.in gvpadmaraju@gmail.com

Abstract

Search Engines collect data from the Web by crawling it. In spite of
consuming enormous amounts of hardware and network resources these
general purpose crawlers end up fetching only large fraction of the visible
web. When information need is only about a specific topic special type of
crawlers called as Topical Crawlers complement search engines. In this paper
we compare and evaluate the performance of two topical crawlers called
Traditional Focused Crawler and Accelerated Focused Crawler. Bayesian
Classifier guides these crawlers in fetching topic relevant documents. The
crawlers are evaluated using two methods. One based on the number of topic
relevant target pages found and retrieved and the second based on the lexical
similarity between crawled pages and topic descriptions for the topic provided
by the editors of Dmoz.org. Due to the limited amount of resources consumed
by these crawlers they have applications in niche search engines and business
intelligence.
1. Introduction
The size of the publicly indexable World-Wide-Web has exceeded 23.68 billion pages in the
year 2008. This is very large compared to the one billion pages in the year 2000. Dynamic
content on the web is also growing day by day. Search engines are therefore increasingly
challenged when trying to maintain current indices using exhaustive crawling. Exhaustive
crawls also consume vast storage and bandwidth resources.
Focused crawlers [Chakrabarti et al., 1999] aim to search and retrieve only subset of the
World-Wide Web that pertains to a specific topic of relevance. Due to the limited resources
used by a good focused crawler, users can make use of them in their PCs.
The major problem in focused crawling is that of properly assigning credit to all pages along
a crawl route that yields highly relevant documents. A classifier can be used to assign credit
(priority) to unvisited URLs. The classifier is trained with specific topics positive and
negative example pages. It is then used to predict the relevance of unvisited URL to the
specific topic. Nave Bayesian Classifier is a popular classifier used to automatically tag
documents and it is based on the fact that if we know the probabilities of words (features)
appearing in a certain category of document, given the set of words (features) in a new
document, we can correctly predict the relevance of the new document to the given category
or topic. Relevance can have any value between 0 and 1.
A major characteristic or difficulty of the text classification problem is the high
dimensionality of the feature space. The native feature space consists of the unique terms
40 Performance Evaluation of Traditional Focused Crawler and Accelerated Focused Crawler
Copyright ICWS-2009
(words or phrases) that occur in documents, which can be tens or hundreds or thousands of
terms for even a moderate-sized text corpus. This is prohibitively high for the classification
algorithm. It is highly desirable to reduce the number of features automatically without
sacrificing classification accuracy. Stop word removal [Sirotkin et al., 1992], stemming
[Porter, 1980] along with the term goodness criterion like document frequency helps in
achieving a desired degree of term elimination from the full vocabulary of a document
corpus.
The remainder of this paper is structured as follows: Section 2 describes the two crawling
methods. Section 3 describes the document frequency feature selection method. Section 4
states the procedure used in obtaining the test data. Evaluation schemes, results comparison
are also given in the section 4. Section 5 summarizes our conclusions.
2. Traditional Focused Crawling and Accelerated Focused Crawling
Traditional Focused Crawler and Accelerated Focused Crawler start from a set of topic
relevant URLs called seeds URLs. The documents represented by these seed URLs are
fetched and the links embedded in these seed documents are collected. The relevancy of these
links to target topic is found out by means of a classifier. The links are then added to a
priority queue of unvisited links with a priority equal to the relevancy calculated above.
Document representing the link present at the front of the queue is fetched and process
repeats until a predefined goal is attained.
In determining the relevancy of a link to the target topic, Traditional Focused Crawler uses
features of the parent page where as Accelerated Focused Crawler uses features around the
link itself. Both the crawlers are trained on set of topic relevant documents known as seed
pages. Accelerated Focused Crawler is also trained on documents retrieved by Traditional
Focused Crawler.
3. Feature Selection Method
Features are gathered from the DOM tree [Chakrabarti et al., 2002] representation of the
document using a DOM parser. Of all the features gathered from the document corpus, good
features are selected using document frequency criterion.
Document frequency is the number of documents in which the feature appeared. Only the
terms that occur in a large number of documents are retained. DF thresholding is the simplest
technique for vocabulary reduction. It scales easily to very large corpora with an
approximately linear computational complexity [Yang et al., 1997] in the number of training
documents. DF of a feature is given by
=
n
i 1 i
i
class in documents of no. Total
class in word the containing documents No.of

4. Experiments
4.1. Test Beds Creation
For evaluating the crawlers the topic relevant seed URLs, training URLs, target URLs,
topic descriptions are collected from content.rdf file of Open Directory Project (ODP). Topics
Performance Evaluation of Traditional Focused Crawler and Accelerated Focused Crawler 41
Copyright ICWS-2009
that are at a distance of 3 from the ODP root are picked and all topics with less than 100
relevant URLs are removed so that we have topics with a critical mass of URLs for training
and evaluation. Among these the topics Food_Service, Model_Aviation,
Nonprofit_Resources, Database_Theory, Mutual_Funds, Roller_Skating are actually crawled.
The ODP relevant set for a given topic is divided into two random subsets. The first set is the
seeds. This set of URLs was used to initialize the crawl as well as provide the set of positive
examples to train the classifiers. The second set is the targets. It is a holdout set that was not
used either to train or to seed the crawlers. These targets were used only for evaluating the
crawlers. The topics were divided into two test beds. The crawlers have crawled all the topics
present in the two test beds. The first test bed is Food_Service, Model_Aviation,
Nonprofit_Resources. The second test bed is Mutual_Funds, Roller_Skating,
Database_Theory. For training a classifier we need both positive and negative examples. The
positive examples for a topic are the pages corresponding to the seed URLs for that topic in
the test bed. The negative examples are the set of seed URLs of other two topics in the same
test bed.
4.2 Evaluation Scheme
Table 1 Evaluation Schemes:
t
c
S is the set of pages crawled by the crawler C at time t. T
d
is the target set
and D
d
,p are the vectors representing the topic description and crawled page respectively. Finally is
the cosine similarity function
Relevance Assessments Recall Precision
Target Pages
d d
t
c
T T S | |
t
c d
t
c
S T S / | |

Target Descriptions
) , (
t
c
S p
d
D p

| | / ) , (
t
c
S p
d
S D p
t
c

The crawlers effectiveness is found out using two measures called recall and precision.
Table 1 shows the assessments schemes used in this paper. It consists of two sets of crawler
effectiveness measures differentiated mainly by the source of evidence to assess relevance.
The first set (precision, recall) focuses only on the target pages that have been identified for
the topic. The second set (precision, recall) employs relevance assessments based on the
lexical similarity between crawled pages (whether or not they are in target set) and the topic
descriptions. All the four measures are dynamic in that they provide a temporal
characterization of the crawl strategy. It is suggested that these four measures are sufficient to
provide a reasonably complete picture of the crawl effectiveness [Pant et al., 2001].
To find the cosine similarity between the crawled page and the topic description for the
specific topic, both of them have to be represented in mutually compatible format. For this
features and term frequency values of features are found from the crawled page as well as
topic description of the topic and vectors D, p are constructed. Cosine similarity function
( () ) is given by
) )( (
) , (
2 2

=
p i D i
i i
D p i
i i
D p
D p
D p

where
i
p and
i
D are the term frequency weights of term i in page p and topic description D
respectively. Recall for the full crawl is estimated by summing up the cosine similarity scores
Copyright ICWS-2009
over all the crawled pages. For precision, the proportion of retrieved pages that is relevant is
estimated as the average similarity of crawled pages.
4.3 Results Comparison
Offline analysis is done after all the pages have been downloaded and the experiments are
over. Precision and Recall are measured as shown in Table 1. The horizontal axis in all the
plots is time approximated by the number of pages crawled. The vertical axis shows the
performance metric i.e. recall or precision.Nature of topic is said to have impact on crawler
performance [Chakrabarti et al., 1999; Pant et al., 2001]. From the Figures 1 and 2 we can see
that the crawlers have fetched more number of topic relevant target pages when crawled for a
co-operative topic like Nonprofit_Resources. The effect of the size of training data and test
data on crawler performance is also studied. When crawled for the topics Food_Services,
Model_Aviation and Nonprofit_Resources training data is large (nearly 200 to 300
documents relevant to topic). Crawlers were able to fetch more number of target pages as
shown by recall values in Figures 1 and 2.

Fig. 1: Target Recall of Traditional Focused Crawler Fig. 2: Target Recall of Accelerated Focused Crawler

Fig. 3: Target recall of Traditional Focused Crawler Fig. 4: Target recall of Accelerated Focused Crawler
The number of URLs specified as targets is only 30 for each of the topics in this test bed.
When crawled for the topics Database_Theory, Mutual_Funds, Roller_Skating the crawlers
were trained with only 75 documents relevant to the specified topic and the number of URLs
specified as targets are 25 for Database_Theory, 306 for Mutual_Funds and 455 for
Roller_Skating. Though the target set is large in the second test bed the crawlers were able to
Copyright ICWS-2009
fetch only few of them as shown in Figures 3 and 4. It shows that when classifier is used to
guide the crawlers, good training (large set of training data) yields good crawler performance.
Accelerated Focused Crawler is trained over the documents fetched by the Traditional
Focused Crawler i.e. it has large training data set. This once again suggests the use of
Accelerated Focused Crawler instead of Traditional Focused Crawler in fetching topic
relevant documents

Fig. 5: DMoz recall for Traditional Focused Crawler. Fig. 6: DMoz recall for Accelerated Focused Crawler

Fig. 7: DMoz recall of Traditional Focused Crawler Fig. 8: DMoz recall of Accelerated Focused Crawler
When crawlers were evaluated using lexical similarity between crawled pages (whether are
not they are in target set) and topic descriptions high recall values are attained by Accelerated
Focused Crawler for all topics as shown in Figures 5,6,7,8. In Figure 8 the DMoz recall
values are in decending order of magnitude for the topics Mutual_Funds, Roller_Skating and
Database_Theory. This is because Mutual_Funds was described briefly when compared to
Roller_skating and Roller_Skating was described briefly when compared to
Database_Theory by the Dmoz editors. This result suggests that when crawlers are driven by
keyword queries or topic descriptions it is better to describe the topic or theme using only and
all prominent terms of the topic. This results in collecting more number of pages relevant to
the topic.
Figure 10 shows that when training set of documensts available is small (in the case of
Database_Theory, Mutual_Funds, Roller_Skating) Accelerated Focused Crawler has
Copyright ICWS-2009
outperformed Traditional Focused crawler in average target recall. Figure 9 show us that the
performance of Traditional Focused Crawler and Accelerated Focused Crawler are nearly the
same when training data set is large. Traditional Focused Crawler has found 0 targets for the
topic Database_theory (Figure 3). This is because the target set size is only 27 for this topic.
In the case of other two topics in the same test bed the number of targets is greater than 300.

Fig. 9: Average target recall of the crawlers. Topics-FMN Fig. 10: Average target recall of crawlers.
Topics-DMR
These target sets are not picked in that way intentionally but it was because the number of
links specified as relevant by DMoz editors for the topic Database_theory was less compared
to the other two topics. This result may indicate that Database_Theory is less popular topic in
the Web. Accelerated Focused Crawler (Figure 4) has shown better performance in this case
also. This shows that the popularity of topic also effects crawler performance.
5 Conclusion
Accelerated Focused Crawler is a simple enhancement to Traditional Focused crawler. It
assigns better priorities to the unvisited URLs in the crawl frontier. No manual training is
given to the crawlers. They are trained with documents relevant to the topic gathered from
Dmoz.org. Features are extracted from the DOM representation of parent (source) page
which is simple compared to all other techniques. When a small training data set is available,
it is suggestive to use Traditional Focused Crawler only to train Accelerated Focused Crawler
and not for fetching topic relevant documents. When training data is large we can make use
of Traditional Focused Crawler for fetching topic relevant documents. In any case
Accelerated Focused Crawler has performed well compared to Traditional Focused Crawler.
Popularity of the topic as well as the nature of the topic i.e. whether it is competitive or
collaborative also has effect on crawler performance. When crawlers are driven by keyword
queries or topic descriptions, describing the topic or theme using only and all prominent
terms of the topic enhances the performance of the crawl.
References
[1] [Chakrabarti et al., 1999] S. Chakrabarti, M. vander Berg and B. Dom, Focused crawling: a new approach
to topic-specific web resource discovery. In the Proceedings of the eighth international conference on
World Wide Web, Toronto, Canada, Pages 16231640, 1999.
[2] [Chakrabarti et al., 2002] S. Chakrabarti, K. Punera, M. Subramanyam, Accelerated Focused Crawling
through Online Relevance Feedback, In the Proceedings. of the 11
th
international conference on World
Wide Web, Honolulu, Hawaii, USA, Pages.148-159, 2002.
Copyright ICWS-2009
[3] [Pant et al., 2001] F. Menczer, G. Pant, M. E. Ruiz, P. Srinivasan, Evaluating topic-driven Web crawlers, In
the Proceedings. of the 24
th
annual international ACM SIGIR conference on Research and development in
information retrieval, New Orleans, U.S, Pages 241249 2001.
[4] [Porter, 1980] M.F. Porter, An algorithm for suffix stripping, Program, Volume 14, Issue 3, Pages. 130-
137, 1980
[5] [Sirotkin et al., 1992] J.W. Wilbur and K. Sirotkin, The automatic identification of stop words, Journal of
Information Science, Volume 18, Issue 1, Pages 45 55, 1992.
[6] [Yang et.al., 1997] Y. Yang, Jan O. Pedersen, A Comparative Study on Feature Selection in Text
Categorization, In the Proceedings of the Fourteenth International Conference on Machine Learning,
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, Pages 412420, 1997.
A Semantic Web Approach for Improving Ranking
Model of Web Documents

Kumar Saurabh Bisht Sanjay Chaudhary
DA-IICT DA-IICT
Gandhinagar, Gujarat 382007 Gandhinagar, Gujarat 382007
kumar_bisht@daiict.ac.in sanjay_chaudhary@daiict.ac.in

Abstract

Ranking models are used by Web search engines to answer user queries based
on key words. Traditionally ranking models are based on a static snapshot of
the Web graph, which is basically the link structure of the Web documents.
The visitors browsing activities is directly related to importance of the
document. However in this traditional static model the document importance
on account of interactive browsing is neglected. Thus this model lacks the
ability of taking advantage of user interaction for document ranking.
In this paper we propose a model based on semantic web to improve the local
ranking of the Web documents. This model works on Ant Colony algorithm to
enable the Web servers to interact with Web surfers and thus improve the local
ranking of Web documents. The local ranking then can be used to generate the
global Web ranking.
1 Introduction
In todays ICT era, information seeking has become a part of social behavior. With the
plethora of information available on Web, an efficient mechanism for information retrieval is
of primal importance. The search engines are an important tool for finding information on
Web. All of the search engines try to retrieve data based on the ranking of Web documents.
However the traditional ranking models on static snapshot of the Web graph, including the
document content and the information conveyed. An important missing part of this static
model is that the information based on user interactive browsing is not accounted. This
affects the relevancy and importance metrics of a document, as the judgments of users
collected by user at run time can be very important.
In this paper we propose a model based in semantic web that enables a web server to record
interactive experience of user.
The given approach works on three levels:
1. To make the existing model flexible i.e. the metrics related to relevance and
importance can be modified according to run/browsing time user judgments.
2. The new enhancement should be automated in processing of the metrics recorded
during browsing time.
3. The model should enable web server to play active role in the users choice for highly
ranked pages.
A Semantic Web Approach for Improving Ranking Model of Web Documents 47
Copyright ICWS-2009
Our model has two components:
1. An ontology that keeps the interactive experience of user in machine understandable
form.
2. A processing module based on Ant algorithm.
Preliminary experiments have shown encouraging results in the improvement of local
document ranking.
The following sections give details about Semantic Web and various related terminologies
afterwards there is brief discussion about Ant algorithm followed by proposed approach and
experiment results with conclusion.
2 Semantic Web
Semantic Web is an evolving extension of the Web in which the semantics of information
and service on the web is defined [Lee 2007], which enables information not only to be
processable by people but as well as machine. At its core the semantic web framework
compromises a set of design standards and technologies.

Fig. 1: A typical RDF triple in ontology from proposed approach
The formal specifications for information representation in Semantic Web are Resource
Description Framework, a metadata model for modeling information through a variety of
syntax formats and Web Ontology Language: OWL. The RDF metadata model is based upon
the idea of making statements about Web resources in the form of subject-predicate-object
expressions called RDF triple. Figure 1 provide a formal description of concepts, terms and
relationships within a given knowledge domain.
2.1 Terminologies
2.1.1 Ontology
An ontology is an explicit and formal specification of a conceptualization [Antoniou and
Harmelen, 2008]. An ontology describes formally a domain of discourse. It is a formal
representation of a set of concepts within a domain and the relationships between those
concepts. Typically, an ontology consists of a finite list of terms and the relationships
between these terms. The terms denote important concepts (classes of objects) of the domain.
The relationships typically include hierarchies of classes. See Figure 1 where has Importance
is a relationship between two concepts document and importance. The document and
x are the instances of these concepts. The major advantage of ontologies is that they
support semantic interoperability and hence provide a shared understanding of concepts.
Ontologies can be developed using data models like RDF, OWL.
/system_document/world/document
/system_document/world/document 1/importance
document1
X

hasImportance Resource ID
48 A Semantic Web Approach for Improving Ranking Model of Web Documents
Copyright ICWS-2009
2.1.2 Owl
The Web Ontology Language (OWL) is a family of knowledge representation languages for
authoring ontologies, and is endorsed by the World Wide Web Consortium [Dean et al.,
2004]. The data described by an OWL ontology is interpreted as a set of "individuals" and a
set of "property assertions" which relate these individuals to each other. An OWL ontology
consists of a set of axioms which place constraints on sets of individuals (called "classes")
and the types of relationships permitted between them. These axioms provide semantics by
allowing systems to infer additional information based on the data explicitly provided
[Baader et al., 2003].
3 Ant Colonies
Ant colonies are a highly distributed and structured social organization [Dorigo et al., 1991].
On account of this structure these colonies can perform complex tasks, which has formed the
basis of various models for the design of algorithms for optimization and distributed control
problems. Several aspects of ant colonies have inspired different ant algorithms suited for
different purposes. These kinds of algorithms are good for dealing with distributed problems.
One of these is Ant Colony, which works on the principle of ants coordination by depositing
a chemical on the ground. These chemicals are called pheromones and they are used for
marking paths in the ground, which increases the probability that other ants will follow the
same path.
The functioning of an ACO algorithm can be summarized as follows. A set of computational
concurrent and asynchronous agents (a colony of ants) moves through paths looking for food.
Whenever an ant encounters an obstacle it moves either left or right based on a decision
policy. The decision policy is based on two parameters, called trails () and attractiveness
(). The trail refers to the pheromones deposited by preceding ants and the ant following that
path increases the attractiveness of that path by depositing more pheromones in that path.
Each paths attractiveness decreases with time as the trail evaporates (update) [Colorni et al.,
1991]. With more and more ant following the shortest path to food the pheromone trail of that
path keeps increasing and hence shortest optimal path is fund.
4 Proposed Model
In our model we emulate the web surfing with the Ant Colony model. The pheromone
counter of the links represents the attractiveness of the path to the desired Web document in
our model. The more number of hits, the more important the link is but at the same time the
hits are necessary to maintain the level of importance the pheromone counter dwindle with
time. The Web surfers are the ants that navigate through the links of the Web documents to
go to particular information. The Web server in this model is not a passive listener to cater
the request of users but it is also the maintaining agent who records the pheromone of web
links and ensures that updation of the pheromone is taken care of by the processing module.
4.1 Model Working
People looking for information visits web page through various links/page. Every visit is
converted by the server into pheromone count and recorded. So if the person doesnt find the
page useful he/she will not visit that page again and the pheromone count of that page will
Copyright ICWS-2009
dwindle with time reducing its attractiveness. Whereas repeated visits will increase the
attractiveness of that page by increasing pheromone count. The web server records the
pheromone count and also other interactive counts (more detail in the following section of
server side enhancement).
4.1.1 Server Side Enhancement
The server maintains the pheromone count and other interaction corresponding to a page in
an ontology and also periodically updates them. A sample ontology in Figure 2 describes the
sample ontology. Currently we have two interaction metric recorded in the ontology:
1. Number of hits.
2. Visitor evaluation (1 = informative or 0 = not informative) relevance of the page.
3. Time stamp (last visit)
<owl:Class rdf:ID="hits">
<rdfs: rdf:resource="h_200 "/>
</owl:Class>
<owl:Class rdf:ID="evaluation">
<rdfs: rdf:resource="e_1 "/>
<owl:Class rdf:ID="Date">
<rdfs:subClassOf>
<owl:Class rdf:ID="time_stamp"/>
</rdfs:subClassOf>
</owl:Class>
<owl:Class rdf:ID="time_hr">
<rdfs:subClassOf rdf:resource="#time_stamp"/>
</owl:Class>
<Date rdf:ID="d_11_7_2008"/>
<time_hr rdf:ID="t_1320"/>
Fig. 2: Server ontology
The above ontology shows that the document was visited on 11 July 2008 at 1:20 pm and it
was the 200
th
hit.
4.1.2 Pheromone Representation
Now we can sum up the entire picture by representing how the pheromone count works. The
role of the pheromone is to record the trail and thus indicates the importance of the
link/document. The count is always changing based on the time stamp last visited (i.e. the
time elapsed after last count change by the visit).
So here is the how it works:
The pheromone associated with the link/model is defined as:
P
count :
D {V, T} (1)
50 A Semantic Web Approach for Improving Ranking Model of Web Documents
Copyright ICWS-2009
Where V is pheromone density at a particular time and T is time stamp of last visit. Now the
value of v can be updated in two ways:
Positive update: When the user visits the page, user input of evaluation of page (positive)
Negative update: With time the negative update decreases the pheromone count
(evaporation). Also user input of evaluation of page (negative)
It may be noted that equal weightage is given to user visit and input to avoid malicious
degradation of pheromone so that the visit will cancel the malicious input. Say for example a
user repeatedly visits a page and give negative input 0 but the updation account for the visit
also so net negative updation is 0 but the evaporation will continue to take place with the last
time stamp count of pheromone.
The pheromone accumulation of a page at n+1 visit is done as follows:
P
new
= P
current
+ 1 (2)
The negative pheromone mechanism is realized by using the radioactive degradation formula:
P
count
(t)

= P
count
(T) * (1/2) exp (t - P
count
(T)/) (3)
is the degradation parameter set heuristically. T is the last time stamp of updation so P
count
at time t is dependent on P
count
at last updation.
5 Experiment Results and Conclusion
We used the following model to ascertain the local page rank on a server setup using Apache
Tomcat 5.5 for a collection of 70 web documents. Table 1 and 2 show the observed result.
Table 1: Result for = 2
Percentage of page ranked within error margin of 10% 48
*
Percentage of page ranked within error margin of 25% 56*
Percentage of page ranked within error margin of 40% 73*
Table 2: Result for = 4
*

*

*

*
Result value is approximated
The results clearly show the potential of this model. More than 50% of the pages were ranked
within the error margin of 25%, which is encouraging in view of the sandbox environment of
the experiment. In this paper we have presented our idea based on Ant Colony algorithm in
the context of learning and web data mining. The proposed model proof of concept
implementation shows the improvements in the existing system.
The future work also holds promises in the area of improvement of current algorithm of Ant
Colony by making it more relevant based on the information gain of the user experience.
Also other optimization can be in the fine-tuning of parameter and increment strategy of
pheromone accumulation. We expect better results with more fine-tuning of the approach in
future.
Copyright ICWS-2009
References
[1] [Antoniou and Harmelen, 2008] Grigoris Antoniou and Frank van Harmelen, A Semantic Web Primer, pp.
11, MIT Press, Cambridge, Massachusetts, 2008.
[2] [Dorigo, Maniezzo and Colorni, 1991] M. Dorigo, V. Maniezzo, and A. Colorni, The ant system: an
autocatalytic optimizing process, Technical Report TR91-016, Politecnico di Milano, 1991.
[3] [Lee 2007], Tim Berners-Lee, MIT Technology Review, 2007
[4] [Dean et al., 2004] M Dean, G Schreiber, W3C reference on OWL, W3C document, 2004
[5] [Colorni et al., 1991] A. Colorni, M. Dorigo, and V. Maniezzo, Distributed optimization by ant colonies,
Proceedings of ECAL'91, European Conference on Artificial Life, Elsevier Publishing, Amsterdam, 1991.
[6] [Baader et al., 2003] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, P. F. Patel-Schneider (Eds.), The
Description Logic Handbook: Theory, Implementation, and Applications, Cambridge University Press,
2003
Crawl Only Dissimilar Pages: A Novel and Effective
Approach for Crawler Resource Utilization

Monika Mangla
Terna Engineering College, Navi Mumbai
manglamona@gmail.com

Abstract

Usage of search engines has become a significant part of todays life[Page and
Brin, 1998] While using search engines, we come across many web
documents that are replicated on the internet [Brin and Page, 1998] [Bharat
and Broder, 1999]. Identification of such replicated sites is an important task
for search engines. Replication limits crawler performance (processing time,
data storage cost) [Burner, 1997]. Some time even the entire collections (such
as JAVA FAQS, Linux manuals) are replicated that limit usage of system
resources [Dean and Henzinger, 1999]. In this paper, usage of graphs has been
discussed to evade crawling a web page if mirror version of the said page has
been crawled earlier. Crawling of only dissimilar web pages enhance the
effectiveness of web crawler. The paper here discusses how to represent web
sites in form of graph, and how this graph representation is to exploited for
crawling of non-mirrored web pages only, so that similar web pages are not
crawled multiple times. The method proposed is capable of handling the
challenge for finding replicas among the input set of several millions of web
pages having hundreds of gigabytes of textual data.
Keywords: Site replication; Mirror; Search engines; Collection
1 Introduction
The World Wide Web (WWW) is a vast and day by day growing source of information
organized in the form of a large distributed hypertext system [Slattery and Ghani, 2001]. The
web has more than 350 million pages and it is growing in the tune of one million pages per
day. Such enormous growth and flux necessitates the creation of highly efficient crawling
systems [Smith, 1997] [Pinkerton, 1998]. World Wide Web depends upon crawler (also
known as robot or spider) for acquiring relevant web pages [Miller and Bharat, 1998]. Such
enormous growth and flux necessitates the creation of highly efficient crawling systems. A
crawler crawls through hyperlinks present in the documents to move from one web page to
another or sometimes one web site to another web site also. Many of the documents across
the web are replicated; sometimes entire documents are replicated over multiple sites. For
example the documents containing JAVA Frequently Asked Questions(FAQs) are found
replicated over many sites, which results in accessing the same documents number of times,
thus limiting the resource utilization and effectiveness of crawler [Cho, Shivkumar and
Molina]. Other examples of replicated collections are C tutorials, C++ tutorials, Windows
manuals. Even the same job opening is advertised on multiple sites linking to the same web
page. Replicated collections consist of thousands of pages which are mirrored in several sites
Crawl Only Dissimilar Pages: A Novel and Effective Approach for Crawler Resource Utilization 53
Copyright ICWS-2009
in order of tens or sometimes even in hundreds. A considerable amount of crawling resources
are used for crawling of these mirrored or similar pages multiple times.
If some method is devised to crawl a page if and only if no similar page has been crawled
earlier; therefore visiting similar web pages number of times can be avoided and resources
could be utilized in an effective and efficient manner. This paper suggests a methodology
utilizes graph representation of web sites in form of a graph for performing the said task. This
paper focuses on the general structure of hypertext documents and their representation in
form of graph in section 2 and working of crawler has been discussed in section 3. In section
4, a methodology has been proposed that performs crawling of a page if and only if no similar
page has been crawled by this time.
2 Representing Web in form of Graph
World Wide Web is a communication system for retrieval and display of multimedia
documents with help of hyperlinks. All hyperlinks present among web pages are employed by
web crawler for moving from one web page to another; and thus crawled web pages are
stored in the database of search engine [Chakrabarti, Berg and Dom]. Hyperlinks present in
web pages may provide link to some where within the same web page, some other web page
or it may be on some other web site also[Sharma, Gupta and Aggarwal]. Thus hyperlinks
may be divided into three types:
Book Mark: A link to a place within the same document. It is also known as a Page link.
Internal: A link to a different document within the same website.
External: A link to a web-site outside the site of the document.
While representing web in form of graph; hyperlinks among web pages are used. Here in the
graph representation of web; all web pages are represented by vertices of the graph. A
hyperlink from page P
i
to page P
j
is represented by an edge from vertex V
i
to V
j
, where V
i

and V
j
represents page P
i
and P
j
respectively. Graph representation of web does not
differentiates among internal and external links, while book marks are not represented in
form of graph representation.
Documents on WWW can be moved or deleted. The referenced information may change
resulting in breaking of hypertext link. Thus, the flexible and dynamic nature of WWW
necessitates the regular maintenance of graph representation to cope with the dynamic nature
of web in order to prevent structural collapse. A module may be run at some fixed interval
that reflects changed hyperlinks in its graph representation if any.
3 The Crawler
Web crawler is a program that indexes, automatically navigates the web and downloads web
pages. Web crawlers utilize the graph structure of the program to move from page to other
page. it picks up a seed URL and downloads corresponding Robot.txt file, which contains
downloading permissions and the information about the files that should be excluded by the
crawler. A crawler stores a web page and then extracts any URLs appearing in that web page.
The same process is repeated for all web pages whose URLs have been extracted from the
earlier page. Architecture of web crawler is shown in Figure 1. Key purpose of designing web
54 Crawl Only Dissimilar Pages: A Novel and Effective Approach for Crawler Resource Utilization
Copyright ICWS-2009
crawlers is to retrieve web pages and to add them to local repository. Crawlers are utilized by
search engines for preparation of repository of web pages in the search engine. Different
actions performed by web crawler are as follows:

Fig. 1: Web crawler Architecture
1. The downloaded document is marked as having been visited.
2. The external links are extracted from the document and put into a queue.
3. The contents of the document are indexed.
4 The Proposed Solution
In the suggested methodology clustering techniques is used to group all similar pages. There
are different algorithms available for finding if two web pages are similar. Web page can be
divided into number of chunks, which are converted to some hash value later. If two web
pages share number of hash values more than some threshold value then pages are considered
to be similar. Many variations of finding similar web pages are available. Some algorithms
consider web pages similar based on the structure of web pages only. Here the proposed
approach does not emphasize on finding if the pages are similar or not. Approach confirms
that a page is to be crawled only if no similar page has been crawled by the time.
In the proposed modus operandi, a color tag is associated with every web page; initially value
of color tag for all web pages is set to white. In the proposed methodology, color tag can have
a value from set {white, gray, black}. Color tag gray signifies that a similar page is being
crawled, thus if color of any web page is set to gray; which means that crawling process for
the web page under consideration is in continuation. Therefore, if any similar web page is
fetched for the process of crawling, the web page will not be crawled just to facilitate
crawling of dissimilar pages only.
In the proposed method, a cluster consists of set of all similar pages. Cluster is associated
with additional information containing number of web pages in the cluster having color tag
set to gray (represented by Gray(C
i
) for cluster C
i
), it also stores number of web pages in the
cluster for which color tag is set to black ( represented by Black(C
i
) for cluster C
i
. Before the
crawling process starts, color tag of all web pages is set to white; therefore value of Gray(C
i
)
and Black(C
i
) for every cluster is initialized to zero. To cope with the dynamic nature of web,
a refresh tag is also associated with every web page[Cho and Molina, 2000]. In the beginning
value of refresh tag is set to zero, value of refresh tag for the modified web page is changed to
value of maximum refresh tag in the cluster incremented by one; thus the web page with
Crawl Only Dissimilar Pages: A Novel and Effective Approach for Crawler Resource Utilization 55
Copyright ICWS-2009
maximum value of refresh tag is the one which was modified in the last. Addition of refresh
tag maintains crawling of most updated web page at any time thus the proposed method is
armed with the capability of crawling the most refresh data.
In the suggested approach, whenever a web page is crawled it is indispensable that no page in
the same cluster should be crawled in the future. The cluster, which a web page belongs to is
represented by the function Cluster(Pi) for any web page Pi.
Proposed Algorithm
Step 1: Initialize Gray(Ci)= 0, Black(Ci)=0 for every Cluster Ci
Step 2:For each ClusterC
i

Step 2.1 For Page P
i
with Color(P
i
) = White and refresh(P
i
) = max refresh value
Step 2.2 Color(P
i
) = Gray
Step 2.3 Gray(C
i
)= Gray(C
i
)+1
Step 2.4 Scan the page and find list of Adjacent Pages of Page P
i

Step 2.5 For all Adjacent pages of Page P
i

If Gray(Cluster(Adaj(P
i
)))=0 AND Black(Cluster(Adaj(P
i
)))=0
Crawl Adaj(P
i
)
Color(Adaj(P
i
)= Gray
Gray(Cluster(Adaj(P
i
))) = Gray(Cluster(Adaj(P
i
)))+1
Step 3: Color(P
i
) = Black
Step 4:Black(C
i
) = Black(C
i
)+1
In the suggested Algorithm, Adjacent pages of a web page P
i
refers to the web pages that are
linked from web page P
i
by means of hyperlink. In the algorithm, while following hyperlinks
present in any web page it is tested out if any similar web page has been crawled earlier or
not using Step 2.5; the step checks the cluster which adjacent of page P
i
belongs to and find
the number of pages with color tag set to Gray and that set to Black. If no page exists in the
cluster with color tag set to Gray and Black, all pages will be having color tag set to White;
which means no similar page has been crawled, therefore web page is downloaded,
hyperlinks are extracted and then color tag of web page is set to Gray. While picking a web
page from a cluster, page that was refreshed last recently is selected so that results cope with
changing nature of World Wide Web.
In the proposed technique, finding set of similar web pages needs to be a continuous
procedure so that changes if any, could be implemented in the clusters that are used in the
algorithm. Similar to change in content on the World Wide Web, structure of Web pages is
also prone to change; Some Web pages are deleted; Some new Web pages are inserted on
daily basis; there may be some changes in the structure of hyperlinks also. Implementing all
these structural changes necessitates creation of Web Graph to be a continuous procedure;
repeated after regular intervals.
5 Conclusion
It has been observed that number of web pages are replicated over number of web sites.
Crawler, while crawling the Web crawls through these replicated web pages multiple times
and thus resources are under utilized. These resources could be utilized in an efficient and
effective manner if visiting replicated web pages is restricted to once. In this paper, An
approach for crawling only dissimilar web pages has been suggested. In order to ensure
quality and freshness of downloaded pages from the set of similar pages, inclusion of refresh
56 Crawl Only Dissimilar Pages: A Novel and Effective Approach for Crawler Resource Utilization
Copyright ICWS-2009
tag has been proposed. Following the suggested approach the efforts of crawler can be
reduced by a significant amount, while producing results which are better organized, most
updated and relevant when presented to the user.
References
[1] [Page and Brin, 1998] L. Page and S. Brin, The anatomy of a search engine, Proc. of the 7th International
WWW Conference (WWW 98), Brisbane,
[2] [Brin and Page, 1998] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web
search engine. Proceedings of the Seventh International World Wide Web Conference, pages 107117,
April 1998.
[3] [Bharat and Broder, 1999] Krishna Bharat and Andrei Z. Broder. Mirror, Mirror, on the Web: A study of
host pairs with replicated content. In Proceedings of 8th International Conference on World Wide Web
(WWW'99), May 1999.
[4] [Burner, 1997] Mike Burner, Crawling towards Eternity: Building an archive of the World Wide Web,
Web Techniques Magazine, 2(5), May 1997.
[5] [Dean and Henzinger, 1999] J. Dean and M. Henzinger, Finding related pages in the world wide web,
Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1467-1479, 1999.
[6] [Slattery and Ghani, 2001] Y. Yang, S. Slattery, and R. Ghani, A study of approaches to hypertext
categorization, Journal of Intelligent Information Systems. Kluwer Academic Press, 2001.
[7] [Smith, 1997] Z. Smith, The Truth About the Web: Crawling towards Eternity, Web Techniques
Magazine, 2(5), May 1997.
[8] [Pinkerton, 1998] Brian Pinkerton, Finding what people want: Experiences with the web crawler, Proc. of
WWW Conf., Australia, April 14-18, 1998.
[9] [Miller and Bharat, 1998] Robert C. Miller and Krishna Bharat, SPHINX: A framework for creating
personal, site-specific Web Crawlers, Proceedings of the Seventh International World Wide Web
Conference, pages 119130, April 1998.
[10] [Cho, Shivkumar and Molina] Junghoo Cho, Narayanan Shivakumar and Hector Garcia-Molina, Finding
replicated web Collections,
[11] [Chakrabarti, Berg and Dom] S. Chakrabarti, M. van den Berg, and B. Dom, Distributed hypertext
resource discovery through examples, Proceedings of the 25th International Conference on Very Large
Databases (VLDB), pages 375-386,.
[12] [Sharma, Gupta and Aggarwal] A. K. Sharma, J. P. Gupta, D. P. Agarwal, Augmented Hypertext
Documents Suitable for Parallel Retrieval Of Information
[13] [Cho and Molina, 2000] Junghoo Cho and Hector Garcia-Molina. Synchronizing a database to improve
freshness. In Proceedings of the 2000 ACM SIGMOD, 2000.
Enhanced Web Service Crawler Engine
(A Web Crawler that Discovers Web Services Published on Internet)

Vandan Tewari Inderjeet Singh
SGSITS, Indore SGSITS, Indore
vandantewari@gmail.com inderjeet_kalra10@yahoo.com
Nipur Garg Preeti Soni
SGSITS, Indore SGSITS, Indore
garg_nipur15@yahoo.co.in preeti_soni25@yahoo.co.in

Abstract

As Web Services proliferate, size and magnitude of UDDI Business Registries
(UBRs) are likely to increase. The ability to discover web services of interest
across multiple UBRs then becomes a major challenges. specially, when using
primitive search methods provided by existing UDDI APIs. Also UDDI
Registration is voluntary and therefore web services can easily be passive. For
a client finding services of interest should be time effective and highly
productive (i.e a searched services of interest should also be active else the
whole searching time and efforts will be wasted) If an explored service from
UBRs results to be passive, it leads to wastage of lot of processing power and
time of both service provider as well as client. The previous research work
shows an intriguing results of only 63 % of available web services to be
active. This paper proposes Enhanced Web Service Crawler Engine which
provides more relevant results and gives output within the acceptable time
limits. Proposed EWSCE is intelligent and it performs verification &
validation test on discovered Web Services to ensure that these are active
before presenting these to the user. Further this crawler is able to work with
federated UBRs. During discovery, if some web services fail the validation
test, EWSCE stores it in a special database for further reuse and will
automatically delete it from corresponding UBR.
1 Background
1.1 Web Service
A Web service is a software system designed to support interoperable machine-to-machine
interaction over a network. It has an interface described in a machine-processable format
(specifically WSDL). Other systems interact with the Web service in a manner prescribed by
its description using SOAP-messages, typically conveyed using HTTP with an XML
serialization in conjunction with other Web-related standards.
1.2 WSDL
Web Service Description language is an XML based language that provides a mode for
describing web service. WSDL is often used in combination with SOAP and XML Schema to
58 Enhanced Web Service Crawler Engine
Copyright ICWS-2009
provide web services over the Internet. A client program connecting to a web service can read
the WSDL to determine what functions are available on the server. Any special data types
used are embedded in the WSDL file in the form of XML Schema. The client can then use
SOAP to actually call one of the functions listed in the WSDL.
1.3 UDDI
Universal Description, Discovery and Integration (UDDI) is a platform-independent,
XML-based registry for businesses worldwide to list themselves on the Internet. UDDI was
originally proposed as a core Web service standard. It is designed to be interrogated by SOAP
messages and to provide access to Web Services Description Language documents describing
the protocol bindings and message formats required to interact with the web services listed in
its directory.
1.4 Web Crawler
A web crawler (also known as a web spider, web robot, or web scutter) is a program or
automated script which browses the World Wide Web in a methodical, automated manner.
This process is called web crawling or spidering. Many sites, in particular search engines, use
spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a
copy of all the visited pages for later processing by a search engine that will index the
downloaded pages to provide fast searches. Crawlers can also be used for automating
maintenance tasks on a website, such as checking links or validating HTML code. Also,
crawlers can be used to gather specific types of information from Web pages, such as
harvesting e-mail addresses.
2 Related Work
Web services discovery is the first step towards usage of SOA for business applications over
internet and is an interesting area of research in ubiquitous computing. Many researchers
have proposed discovering Web services through a centralized UDDI registry [M Paolucci et
al.,2002;U Thaden et.al.,2003]. Although centralized registries can provide effective methods
for the discovery of Web services, they suffer from problems associated with having
centralized systems such as a single point of failure, and other bottlenecks. Other approaches
like [C.Zhou et.al.,2003,K.Sivashanmugam et.al.,2004] focused on having multiple
public/private registries grouped into registry federations Web services over a federated
registry sources but, similar to the centralized registry environment, it does not provide any
means for advanced search techniques which are essential for locating appropriate business
applications. In addition, having a federated registry environment can potentially provide
inconsistent policies to be employed which will significantly have an impact on the
practicability of conducting inquiries across the federated environment and can at the same
time significantly affect the productiveness of discovering Web services in a real-time
manner across multiple registries. Some other approaches focused on the peer-to-peer
framework architecture for service discovery and ranking [E.Al-Masri et.al., 2007], providing
a conceptual model based on Web service reputation, and providing keyword-based search
Enhanced Web Service Crawler Engine 59
Copyright ICWS-2009
engine for querying Web services. Finding relevant services on web is still an active area of
research since the study provides some details and statistics from Web services. In Previous
work where web services were discovered through search engines [E.Al-Masri et.al.,2008],
Web Search engines treats web services & general web documents in the similar way for
search criteria which results irrelevancy in fetched information about any type of web
services. Further limited Search methods used by these crawlers also limits the relevancy of
fetched data and expand the time duration of search too due to large search space. We have
tried to mitigate these problems in our proposed EWSCE by using a local cache and a refined
search mechanism. Also providing an example binding instance increases the effectiveness of
discovery. In this paper we propose a search engine capable of discovering web services
effectively on web.
3 Our Proposal
As Web Service proliferates size and magnitude of UDDI Business Registry(UBRs) are
likely to increase. UDDI registration is Voluntarily for service providers and therefore Web
service can easily become passive i.e. provider has revoked their Web Services but still there
is an entry in UDDI. If a client is asking for searching a particular web service, search Engine
can return a service which does not exists. To overcome this deficit it is proposed to design
an Enhanced Web Service Crawler Engine for discovering web services across multiple
UBRs. This EWSCE automatically refreshes the UDDI or UBRs for web service update. The
refresh rate of EWSCE is very high which ensures none of the service existing in UBRs can
be passive.
This Crawler has few basic modules as follows:
UBRs Crawl Module: This module maintains a table of IP Addresses of all federated
UBRs as initial seeds in its local memory area and also maintain a cache for fast
processing.
Validation Module: This modules check validity of searched web services. If WSDL
of a web services exists it ensures that it is an active service and then the Find module
fetches Access point URLs and WSDL document corresponding to that web service.
We then parse this WSDL document and find its method name which ensures the
validity of the discovered web service.
Find Module: This module takes initial seeds from the local IPtable of local disk and
finds out the respective web services corresponding to the keyword entered by the
user.
Modify Module: Those Services which fail the validation test, their access point URL
is sent to this Modify module.Modify module deletes the entry of those passive web
services from the UBRs.
3.1 Proposed Algorithm for Search Engine
For implementation of proposed search engine, following algorithm is proposed:
Copyright ICWS-2009

3.2 Proposed Architecture: Following is the Proposed Architecture of EWSCE

Step1: START
Step2: Accept the Keyword from enduser and initialize its iptable for initial seeds
Do
Step3 :Visit each seeds and find out the Access Point URL against the requested
keyword and store it locally.
Step4: Parse the WSDL document against Access Point URL for each discovered
service.
Step5: If web service is Active then
Store it locally
Else
Fetch Business Key against that Access Point URL from UBRs
and Pass it to the modify module that deletes the web services from
UBRs and Store it locally for future Reference.
Untill all seeds are visited from WSlist to crawl
Step6: Display the list of Access Point URL to the EndUser in the form of
hyperlink to show binding instance of web services..
Step7: END
Copyright ICWS-2009
4 Results
4.1 Scenario I for Testing: User wants to find out the web services only related to
calculation.
Fig 4.1 contains the form in which user entered the keyword Calculation to search for Web
Services related to calculations.

Fig. 4.1: User Entered the keyword to perform the search
Now the User has list of Web Services related to keyword Calculation as shown in fig 4.2.

Fig. 4.2: Search Results- List of Access Point URL
4.2 Results for Scenario1
As shown in fig 4.2, Crawler will give the list of web services as a hyperlink to its actual
definition i.e link to the deployed web services at provider side. User can check the validation
of Search results by binding to the Access point URL as given. That means this will give only
relevant list of web services (i.e Active Web Services) to the user, there will be no any
irrelevant link. Suppose User clicks powerservice from the above list, then the input window
Copyright ICWS-2009
with two text box will be open for the user to calculating the a to the power b as shown in fig
4.3 and output of that will be displayed at client side as shown in fig 4.4.

Fig. 4.3: Input Screen against hyperlink

Fig. 4.4: Final Result : After executing web service
5 Conclusions & Discussions
In this paper an EWSCE has been presented here for purpose of effective and fruitful
discovery of web services. This proposed solution provides an efficient Web service
discovery model in which client neither has to search multiple UBRs nor has to suffer from
the problem of handling with passive web services. As number of web services increases the
success of the business depends on both speed and accuracy of getting the information of
relevant required web service. In ensuring accuracy EWSCE has an edge over the other
WSCE. The crawler update rate of the proposed engine is high, besides this the engine
periodically refreshes the repository whenever idle and automatically deletes passive web
services from there.
Copyright ICWS-2009
In future this proposed search engine can be made intelligent to extend our current framework
by using AI Techniques such as Service Rating for returning relevant services., Further the
response time for discovery of required service can be improved by using local cache etc.
Also to make Virtual UBRs more smarter we can include the procedure for dealing with new
popped up services
References
[1] [C. Zhou et.al., 2003] C Zhou, L.Chia, B.Silverajan and B. Lee. UX-an architecture providing QoS-aware
and federated support for UDDI.In Proceedings of ICWS, pp.171-176,2003.
[2] [E. Al-Masri et.al.,2007] E. Al-Masri, & Q.H., Mahmoud, A Framework for Efficient Discovery of Web
Service across Heterogeneous Registries, In Proceedings of IEEE Consumer Communication and
Networking Conference(CCNC), 2007.
[3] [E. Al-Masri et.al., 2008] Eyhab Al-Masri and Qusay H. Mahmoud, Investigating Web Services on the
world wide web.In proceeding of www2008, pg.no.795-804. 2008.
[4] [K. Sivashanmugam et.al., 2004] K. Sivashanmugam, K. Verma and A Seth. Discovery of web services in a
federated environment. In proceedings of ISWC, pp270-278, 2004.
[5] [M Paolucci et al.,2002] M Paolucci, T. Kawamura, T. Payne and K Sycara. Semantic matching of web
service capabilities. In proceedings of ISWC, pp1104-1111, 2002.
[6] [U Thaden et.al.,2003] U Thaden, WSiberski and W. Nejdl. A semantic web based peer to peer Service
registry network. In Technical report, Learning Lab Lowry Saxony, 2003

Data Warehouse Mining
Web Intelligence: Applying Web Usage
Mining Techniques to Discover Potential
Browsing Problems of Users

D. Vasumathi A. Govardhan
Dept. of CSE Dept. of CSE
JNTU Hyderabad JNTU Hyderabad
vasukumar_devara@yahoo.co.in govardhan_cse@yahoo.co.in
K. Suresh
Dept. of I.T
VCE, Hyderabad
kallamsuresh@yahoo.co.in

Abstract

In this paper, a web usage mining based approach is proposed to discover
potential browsing problems. Two web usage mining techniques in the
approach are introduced, including Automatic Pattern Discovery (APD) and
Co-occurrence Pattern Mining with Distance Measurement (CPMDM). A
combination method is also discussed to show how potential browsing
problems can be identified
1 Introduction
Website design is an important criterion for the success of a website. In order to improve
website design, it is essential to understanding how the website is used through analyzing
users browsing behaviour. Currently there are many ways to do this, and analysis of the click
stream data is claimed to be the most convenient and cheapest method [3]. Web usage mining
is a tool that applied Data Mining techniques to analyze web usage data [1], and it is a
suitable technique that can be used to discover potential browsing problems. However,
traditional web usage mining techniques are not sufficient enough for discovering potential
browsing problems, such as Clustering, Classification and Association Rule. In this paper, we
proposed an approach, which is based on the concept of web usage mining and follows the
KDD (Knowledge Discovery in Database) process.
Two main techniques are included, which are Automatic Pattern Discovery, and a
co-occurrence pattern mining, which is improved from traditional traversal pattern mining.
These techniques are claimed can be used to discover potential browsing problems.
2 An Approach for Applying Web Usage Mining Techniques
In this paper, we proposed an approach for applying web usage mining techniques to discover
potential browsing problems. Figure 1 presents the proposed approach, which is based on the
KDD process [2]. In this approach, the KDD process will be run as a normal process, from
data collection and preprocessing, to pattern discovery and analysis, recommendation and
action. The second step (pattern discovery and analysis) will be the main focus of this paper.
68 Web Intelligence: Applying Web Usage Mining Techniques to Discover Potential Browsing Problems
Copyright ICWS-2009

Fig. 1: A KDD based Approach for Discovering Potential Browsing Problems
3 Automatic Patterns Discovery (APD)
In our previous work [4], some interesting patterns have already been identified, including
Upstairs and Downstairs pattern, Mountain pattern and Fingers pattern.
The Upstairs pattern is found when the user moves forward in the website and never back to
the web page visited before. The Downstairs pattern is that the user moves backward, that is
the user returns to the visited pages. The Mountain pattern occurs when a Downstairs pattern
immediately follows an Upstairs pattern. The Fingers pattern occurs when a user moves from
one web page to browse another web page and then immediately returns to the first web page.
These patterns are claimed to be very useful for discovering potential browsing problems (see
[4] for further detail). The APD method is based on the concept of sequential mining to parse
the browsing routes of users. The APD method is performed by a three-level browsing route
transformation algorithm. The level-1 elements include Same, Up and Down. The level-2
elements are Peak and Trough, and the final level is to discover the Stairs, Fingers and
Mountain pattern (See [5] for more detail about the APD method). Table 1 shows an example
of number-based browsing sequences, which are transformed from the browsing routes of
users (the number denotes the occurrence sequence of the visited web page in a users
session). Table 2 shows the discovered final patterns by performing the APD method.
Table 1. Number-based Browsing Sequences
Number Number-based sequence
1 0,1,2
2 0,0,1,0,2,0,3,0,4,0,5,6,7,6,7,8,6,4,5,0
Table 2. Final Patterns
Number Patterns
1 Upstairs
2 Finger,Finger,Finger,Finger,Mountain, Mountain, Mountain

Action: web page
redesign
Recommendation
Check stream data
and data preprocessing
APD and distance
based association
Rule mining
Evaluation
Evaluation
KDD
Web Intelligence: Applying Web Usage Mining Techniques to Discover Potential Browsing Problems 69
Copyright ICWS-2009
4 Co-occurrence Pattern Mining with Distance Measurement (CPMDM)
CPMDM is another technique that can be used to analyse the browsing behavior of users,
which is an improvement of co-occurrence pattern mining by introducing a Distance
measurement. Co-occurrence pattern is a pattern that used to describe the co occurrence
frequency (or probability) of two web pages in users browsing routes. The additional
measurement, Distance, is a measurement that used to measure how many browsing steps
from one page to another in a cooccurrence pattern. There are three different directions of the
distance measurement, including Forward, Backward and Two-Way. The Forward distance
measures the distance from web page A to B of the co-occurrence pattern AB. The
Backward distance on the other hand measures the distance from B to A of the co occurrence
pattern AB. The Two-Way distance combines forward and backward distance. It ignores the
direction of the association rule, and takes all co occurrence patterns about A and B.
5 Combining APD and CPMDM for Discovering Browsing Problems
The analysis results of the APD and CPMDM are two totally different analyses of users
browsing behaviour. However, there will be some biases if only one of these two methods is
used to assess the websites design. Therefore, if the analysis results of the APD and
CPMDM can be combined, more concrete indications of potential problems in the websites
design can be discovered.
Table 3 shows an example about combining the APD and CPMDM methods for discovering
potential, browsing problems. In this case, the starting page of the co-occurrence patterns is
the home page of the University of York website. In the table, the Support means the
probability of the co-occurrence pattern and the Distance is the average forward distance of
the pattern. The proportion of Stairs and Fingers pattern is measured by using the APD
method. In this case, we consider that the fingers pattern is a problematical pattern, and the
longer the distance means the more difficult for a user to traverse from one page to another.
Therefore, the browsing route from home page to /uao/ugrad/course page can easily to be
discovered as a route that potential browsing problem may occur.
Table 3. Combining the APD and CPMDM of The people who view home page then view
URL support Distance
(average)
Stairs
Pattern
Finger
Pattern
/uao/ugrad 0.25 9.1271 44% 39%
/gso/gsp/ 0.173 5.3195 52% 26%
/uao/ugrad/courses/ 0.127 16.9021 34% 47%
6 Conclusion
This paper proposed a users browsing behaviour analysis approach which is based on
applying web usage mining techniques. The concepts of the APD and CPMDM have been
briefly introduced, and the combination method has been discussed in this paper as well.
From the example of the combination method, it showed that potential browsing problems of
users can be discovered easily. The approach that proposed in this paper is therefore
beneficial for the area of website design improvement.
70 Web Intelligence: Applying Web Usage Mining Techniques to Discover Potential Browsing Problems
Copyright ICWS-2009
References
[1] Cooley, R. Mobasher, B. and Srivastave, J. (1997) Web Mining: Information and Pattern Discovery on the
World Wide Web In Proceedings of the 9th IEEE ICTAI Conference, pp. 558-567, Newport Beach, CA,
USA.
[2] Lee, J., Podlaseck, M., Schonberg, E., Hoch, R. (2001) Visualization and Analysis of Click stream Data of
Online Stores for Understanding Web Merchandising, Journal of Data Mining and Knowledge Discovery,
Vol. 5, pp. 59-84.
[3] Kohavi, R., Mason, L. and Zheng, Z. (2004) Lessons and Challenges from Mining Retail E-commerce
Data Machine Learning, Vol. 57, pp. 83-113
[4] Ting, I. H., Kimble, C., Kudenko, D. (2004) Visualizing and Classifying the Pattern of Users Browsing
Behavior for Website Design Recommendation In Proceedings of 1
st
KDDS workshop, 20-24 September,
Pisa, Italy.
[5] Ting, I., Clark, L., Kimble, C., Kudenko, D. and Wright, P. (2007) "APD-A Tool for Identifying
Behavioural Patterns Automatically from Clickstream Data" Accepted to appear in KES2007 Conference,
12- 14 September.
Fuzzy Classification to Discover On-line User
Preferences Using Web Usage Mining

Dharmendra T. Patel Amit D. Kothari
Charotar Institute of Computer Applications Charotar Institute of Computer Applications
Changa-Gujarat University, Gujarat Changa-Gujarat University, Gujarat
dtpatel1978@yahoo.com

Abstract

Web usage mining is an important sub-category of Web Mining. Every day
users can access lots of web sites for different purposes. The information of
on-line users (like Time, Date, host name, amount of data transferred,
platform, url etc) is recorded in many server log files. This information,
recorded in server logs file, are very important for some decision-making tasks
for business communities. Web Usage mining is an important technique to
discover useful user patterns from information recorded in server log files. In
this paper method to discovery online user preferences is suggested. It is based
on vector space model and fuzzy classification of online user sessions. Web
Usage mining concepts like clustering and classification, based on user
interactions to web, are used in this paper to discover useful usage patterns. It
shows that why fuzzy classification is better than normal classification for
several applications like recommendation.
Keywords: Web Mining, Web Usage Mining, Vector Space Model, Fuzzy Classification
Method
1 Introduction and Related Work
Web mining is a technique of data mining, which can be applicable on web data. Today
WWW grows at an amazing rate for information gate way and as a medium for conducting
business. When any user interacts with the web, it leaves lots of information like Date, Time,
Host Name, Amount of Data transferred, Platform, Version etc to many server log files. Web
usage mining, the sub-category of web mining technique, which mines these user related
information recorded in many server log files to discover important usage patterns. Figure-1
depicts the Web Usage mining process [3]
Preprocessing is the main task of WUM process. The inputs of the preprocessing phase, for
usage processing, may include the web server logs, referral logs, registration files, index
server logs, and optionally usage statistics from a previous analysis. The outputs are the user
session file, transaction file, site topology, and page classifications [2]. The first step in usage
pattern discovery is session extraction from log files. Various sessionization strategies are
describes in [4]. Normally, sessions are represented as vectors whose coordinates determine
which item has been seen. Once the sessions are obtained they can be clustered. Each cluster
groups similar sessions. As a consequence, it is possible to acquire knowledge about typical
user visits treated here as predefined usage patterns. A classification of the on-line users to
72 Fuzzy Classification to Discover On-line User Preferences Using Web Usage Mining
Copyright ICWS-2009
one of the predefined classes is typically based on similarity calculation between each
predefined pattern and the current session. The current session is assigned to the most similar
cluster [7]. Unfortunately, the majority of clustering algorithms [1] divide the whole vector
space in separate groups that cannot work ideally for the real life cases. This problem has
been noticed in [5]. Fuzzy clustering may be the solution of above mentioned problem but it
does not solve the problem of a classification of the on-line session that is situated on the
border of two or more clusters. Independently from the clustering type (whether it is fuzzy or
not) the fuzzy classification is required.

Fig. 1: Web Usage Mining Process
The purpose of this paper is to present a method to discover on-line user preferences based on
the previous user behaviour and the fuzzy classification of the online users session to one of
the precalculated usage patterns. It is assumed that the users enter the web site to visit
abstract items (web pages, e-commerce products etc) whose features (for example textual
content) and relations between them are not known. As a result, the preference vector is
created. Each vectors coordinate corresponds to one item and measures the relevance of this
item for the user interests. The obtained vector can be used in recommendation, ordering the
search results or personalized advertisements. To apply fuzzy classification on online user,
this paper recommends following steps.
1. Session Clustering to find out Usage Patterns.
2. Classification of Online User to Usage Patterns.
3. Preference Vector Calculation.
2 Session Clustering to Find out Usage Patterns
When user visits any web site, its information stores in many server log files. Web usage
mining can be applied to server log files to find usage patterns. The first step in usage pattern
discovery is session extraction from log files. Using appropriate sessionization strategy,
based on requirement, extract session. Historical user sessions are clustered in vector space
model in order to discover typical usage pattern. Clustering process is not fuzzy like
classification. Let h be the historical session vector that corresponds to a particular session
then: hj=1 if the item dj has been visited in the session represented by h and h
i
j=0 otherwise.
Fuzzy Classification to Discover On-line User Preferences Using Web Usage Mining 73
Copyright ICWS-2009
Sessions with only one or two visited items or sessions in which almost all items occur may
worsen the clustering results. For this reason, it is better to cluster only these vectors in which
the number of visited items is lower than nmax and greater than nmin. The nmin, nmax
parameters are very important for the clustering result. Too low value of nmin may cause that
many sessions will not be similar to any other and as a consequence many clusters with a
small number of elements will appear. Too high value of nmin or too low value of nmax
removes valuable vectors. Too high value of nmax may result in appearance of small number
of clusters with many elements.
Once the historical sessions are created and selected, they are clustered using an algorithm of
clustering[8]. It is recommended to use the algorithm that does not require the number of
clusters to be specified explicitly. As a result of clustering the set C={c
1
,c
2
,c
3
c
n
}of n
clusters is created. Each cluster can be regarded as a set of the session vectors that belong to
it C
1
={h
1
,h
2
,h
3
,h
card(c1)
}. Their mean vectors called centroid can also represent the clusters:
(2.1)
These calculated centroids will be also denominated usage patterns. The purpose of a
centroid is to measure how often the given item has been visited in the sessions that
belonging to this cluster.
3 Classification of On-line User to Usage Pattern
The clusters and centroids obtained from previous section are very valuable for on-line users.
The current session vector s is used in order to classify the current user behavior to the closest
usage pattern. Similarly to the historical session vector, every coordinate corresponds to a
particular item. When the user visits the item di the coordinates of the vector s change
according to the following formula:
(3.1)
The constant t <0,1> regulates the influence of the items visited before on the classification
process. If the parameter t is set to 0 items seen before will not have any influence. In case of
t =1 items visited before will possess the same impact as the current item. Similarity between
the current session vector s and the centroids of the j
th
cluster can be calculated using the
Jaccard formula:
(3.2)
The centroid C
max
c of the closest usage pattern fulfils the following condition:
(3.3)
74 Fuzzy Classification to Discover On-line User Preferences Using Web Usage Mining
Copyright ICWS-2009
The main reason to use Jaccard formula is that zero coordinates doesnt increase similarity
values. The fuzzy classification is used in another approach. In this case, the similarity
between a given usage pattern and the current session vector is treated as a membership
function that measures the grade of membership of the current session vector in the usage
pattern (0 it does not belong to the pattern at all, 0.5 - it belongs partially, 1 it belongs
entirely). The membership function is a fundamental element of the fuzzy set theory [6].
It is important to emphasize that the preferences of the user can vary even during the same
site visit. For this reason the online classification should be recalculated every time the user
sees a new item.
4 Preference Vector Calculation
Users enter the web site to visit abstract items (web pages, e-commerce products etc) whose
features (for example textual content) and relations between them are not known. As a result,
the preference vector is created. Each vectors coordinate corresponds to one item and
measures the relevance of this item for the user interests. In this paper preference vector
calculation method is described to discover on-line user preferences based on the previous
user behavior and the fuzzy classification of the online users session to one of the
precalculated usage patterns.
The preference vector p can be obtained by calculating the similarity between the current
session vector and the usage patterns. The values of preference vectors coordinates change
iteratively every time a new document or a product is visited. Before the user enters the site,
p0=0 and the session vector s0=0 hence the preferences are not known yet and there is no
item that has been visited in this session When the i
th
item is requested the preference vector
is modified:
(4.1)
Where
1. p
i-1
Remembers the previous preferences of the user. The a(0,1)
parameter regulates the influence of the old preference vector on
the current one.
2.

Promotes items that were frequently visited in the cluster whose
centroids are similar to current session.

3. 1-s
i
Weakens the influence of the items that have been already seen in
this session.

It is important to underline that all usage patterns influence on the preferences vector. Instead
of classifying the current session vector to the closest usage pattern, the fuzzy classification is
used. The introduction of the fuzzy classification is especially profitable when the session is
situated at the same distance from the closest clusters. If any user wants common information
from many clusters equation 4.1 is very useful and that is based on fuzzy classification. If
only the closest usage pattern were used (instead of fuzzy classification), the formula would
have the following form:
Fuzzy Classification to Discover On-line User Preferences Using Web Usage Mining 75
Copyright ICWS-2009
(4.2)
Preference vector calculation is based on user behavior, if user wants common information
from more clusters equation 4. 1 is profitable which is based on fuzzy classification otherwise
equation 4.2 can be used to determine closet usage pattern. Preference vector contains many
characteristics; we have to developed preference vector based on those characteristics.
5 Conclusions and Future Work
In this paper user preferences discovery has been presented using fuzzy classification
method. It has been shown that if on-line sessions are situated between two or more usage
patterns the fuzzy classification behaves better than a normal classification. Although the
preference vector calculation using the fuzzy classification seems to be more time consuming
(please compare the formulas: 4.1 and 4.2).It is possible to limit fuzzy classification to 2 or 3
patterns to eliminate above problem.
The future work will be concentrated on integration of presented method in real time
applications like recommendation, ordering the search results or personalized advertisements.
The other thing is instead of using vector space model, we can use graph theory to retain
more information than vectors.
References
[1] Data Ware housing, Data mining and OLAP by Alex Berson and Stephen J. Smith.
[2] Proceedings of the Fifth International conference on Computational intelligence and Multimedia
Applications.
[3] Srivastava, J., Cooley R., Deshpande, M., Tan, P.N., Web Usage Mining: Discovery and Applications of
Usage Patterns from Web Data. SIGKDD Explorations, volume. 1, pages. 12-23, 2000.
[4] Web Usage mining for E-business Applications ECML/PKDD-2002 Tutorial.
[5] Mining Web Access Logs Using Relational Competitive Fuzzy Clustering, In: 8 International Fuzzy
Systems Association World Congress - IFSA 99
[6] Fuzzy Thinking, The New Science of Fuzzy Logic, Hyperion, New York,1993.
[7] Integrating Web Usage and Content Mining for More Effective Personalization. LNCS 1875 Springer
Verlag, 156-76.
[8] Cooley R., Web Usage Mining: Discovery and Application of Interesting patterns from Web Data, Ph. D.
Thesis, Department of Computer Science, University of Minnesota, 2000.
Data Obscuration in Privacy Preserving Data Mining

Anuradha T. Suman M. Arunakumari D.
K.L. College of Engineering, K.L. College of Engineering, K.L. College of Engineering,
Vijayawada Vijayawada Vijayawada
atadiparty@yahoo.co.in suman.maloji@gmail.com enteraruna@yahoo.com

Abstract

There has been increasing interest in the problem of building accurate data
mining models over aggregate data, while protecting privacy at the level of
individual records, and only disclose the randomized (obscured) values. The
model is built over the randomized data after first compensating for the
randomization (at the aggregate level). The randomization algorithm is chosen
so that aggregate properties of the data can be recovered with sufficient
precision, while individual entries are significantly distorted. How much
distortion is needed to protect privacy can be determined using a privacy
measure. This paper presents some methods and results in randomization for
numerical and categorical data, and discusses the issues of measuring privacy.
1 Introduction
One approach to privacy in data mining is to obscure or randomize the data: making private
data available, but with enough noise added that exact values cannot be determined. Consider
a scenario in which two or more parties owning confidential databases wish to run a data
mining algorithm on the union of their databases without revealing any unnecessary
information. For example, consider separate medical institutions that wish to conduct a joint
research while preserving the privacy of their patients. In this scenario it is required to protect
privileged information, but it is also required to enable its use for research or for other
purposes. In particular, although the parties realize that combining their data has some mutual
benefit, none of them is willing to reveal its database to any other party.
In this case, there is one central server, and many clients (the medical institutions), each
having a piece of information. The server collects this information and builds its aggregate
model using, for example, a classification algorithm or an algorithm for mining association
rules. Often the resulting model no longer contains personally identifiable information, but
contains only averages over large groups of clients.
The usual solution to the above problem consists in having all clients send their personal
information to the server. However, many people are becoming increasingly concerned about
the privacy of their personal data. They would like to avoid giving out much more about
themselves than is required to run their business with the company. If all the company needs
is the aggregate model, a solution is preferred that reduces the disclosure of private data while
still allowing the server to build the model.
One possibility is as follows: before sending its piece of data, each client perturbs it so that
some true information is taken away and some false information is introduced. This approach
is called randomization or data obscuration. Another possibility is to decrease precision of the
Data Obscuration in Privacy Preserving Data Mining 77
Copyright ICWS-2009
transmitted data by rounding, suppressing certain values, replacing values with intervals, or
replacing categorical values by more general categories up the taxonomical hierarchy. The
usage of randomization for preserving privacy has been studied extensively in the framework
of statistical databases. In that case, the server has a complete and precise database with the
information from its clients, and it has to make a version of this database public, for others to
work with. One important example is census data: the government of a country collects
private information about its inhabitants, and then has to turn this data into a tool for research
and economic planning.
2 Numerical Randomization
Let each client C
i
, i = 1, 2, . . . ,N, have a numerical attribute x
i
. Assume that each xi is an
instance of random variable X
i
, where all X
i
are independent and identically distributed. The
cumulative distribution function (the same for every X
i
) is denoted by F
X
. The server wants to
learn the function F
X
, or its close approximation; this is the aggregate model which the server
is allowed to know. The server can know anything about the clients that is derivable from the
model, but we would like to limit what the server knows about the actual instances x
i
.
The paper [4] proposes the following solution. Each client randomizes its x
i
by adding to it a
random shift y
i
. The shift values y
i
are independent identically distributed random variables
with cumulative distribution function F
Y
; their distribution is chosen in advance and is known
to the server. Thus, client Ci sends randomized value z
i
= x
i
+ y
i
to the server, and the servers
task is to approximate function F
X
given F
Y
and values z
1
, z
2
, . . . , z
N
. Also, it is necessary to
understand how to choose F
Y
so that
the server can approximate F
X
reasonably well, and
the value of z
i
does not disclose too much about x
i
.
=
N
i
j
X i Y
j
X i Y j
X
dz z f z z f
a f a z f
N
a f
1
1
) ( ) (
) ( ) ( 1
: ) (
j: = j+1;
The amount of disclosure is measured in [4] in terms of confidence intervals. Given
confidence c%, for each randomized value z we can define an interval [z w
1
, z + w
2
] such
that for all nonrandomized values x we have until (stopping criterion met).
P [Z w
1
x Z + w
2
|Z = x + Y,Y
~FY
] c%.
In other words, here we consider an attack where the server computes a c%-likely interval
for the private value x given the randomized value z that it sees. The shortest width w = w
1
+
w
2
for a confidence interval is used as the amount of privacy at c% confidence level. Once
the distribution function FY is determined and the data is randomized, the server faces the
reconstruction problem: Given FY and the realizations of N i.i.d. random samples Z
1
, Z
2
, . . ,
Z
N
, where Zi = Xi + Yi, estimate F
X
. In [4] this problem is solved by an iterative algorithm
based on Bayes rule. Denote the density of Xi (the derivative of F
X
) by f
X
, and the density of
Yi (the derivative of F
Y
) by f
Y
; then the reconstruction algorithm is as follows:
1. f
0
X
: = uniform distribution;
2. j:= 0 // Iteration number;
3. repeat
78 Data Obscuration in Privacy Preserving Data Mining
Copyright ICWS-2009
For efficiency, the density functions f
j
X
are approximated by piecewise constant functions
over a partition of the attribute domain into k intervals I
1
, I
2
, . . . , I
k
. The formula in the
algorithm above is approximated by (m(I
t
) is the midpoint of I
t
):

=
=
+
=
N
i
t t
j
X t i
k
t
Y
p
j
X p i Y
p
j
X
I I f I m z m f
I f I m z m f
N
I f
1
1
1
| | ) ( )) ( ) ( (
) ( )) ( ) ( (
1
: ) (

It can also be written in terms of cumulative distribution functions, where F
X
((a, b]) =
F
X
(b) F
X
(a) = P[a <X
b] and N(Is) is the number of randomized values zi inside interval

Is:
=
=
+

=
k
t
t
j
X t s Y
p
j
X p s Y
k
s
s
p
j
X
I F I m I m f
I F I m I m f
N
I N
I F
1
1
1
) ( )) ( ) ( (
) ( )) ( ) ( (
) (
: ) (

Experimental results show that the class prediction accuracy for decision trees constructed
over randomized data (using By Class or Local) is reasonably close (within 5%15%) to the
trees constructed over original data, even with heavy enough randomization to have 95%-
confidence intervals as wide as the whole range of an attribute. The training set had 100,000
records.
3 Itemset Randomization
Papers [6; 7] consider randomization of categorical data, in the context of association rules.
Suppose that each client Ci has a transaction ti, which is a subset of a given finite set of items
I, |I| = n. For any subset A I, its supporting the dataset of transactions T ={ti}
N
i 1 =
is defined
as the fraction of transactions containing A as their subset:
supp
T
(A):= |{ti | A t
i
, i = 1 . . .N}| N;
an itemset A is frequent if its support is at least a certain threshold s
min
. An association rule
A B is a pair of disjoint itemsets A and B; its support is the sup-port of AU B, and its
confidence is the fraction of transactions containing A that also contain B:
conf
T
(AB):= supp
T
(AU B) /supp
T
(A) .
An association rule holds for T if its sup-port is at least s
min
and its confidence is at least c
min
,
which is another threshold. Association rules were introduced in [2], and [3] presents
efficient algorithm Ap-riori for mining association rules that hold for a given dataset. The
idea of Ap-riori is to make use of antimonotonicity property:
A B : supp
T
(A) supp
T
(B) .
Conceptually, it first finds frequent 1-item sets, then checks the support of all 2-item sets
whose 1-subsets are frequent, then checks all 3-item sets whose 2-subsets are frequent, etc. It
stops when no candidate itemsets (with frequent subsets) can be formed. It is easy to see that
the problem of finding association rules can be reduced to finding frequent itemsets. A
natural way to randomize a set of items is by deleting some items and inserting some new
items. A select-a-size randomization operator is defined for a fixed transaction size |t| = m
and has two parameters: a randomization level 0
1 and a probability distribution (p[0],

Copyright ICWS-2009
p[1], . . . , p[m])over set {0,1, .. ,m}.Given a transaction t of size m, the operator generates a
randomized stransaction t
1
as follows:
1. The operator selects an integer j at ran-dom from the set {0, 1, . . . , m} so that P[j is
selected] = p[j].
2. It selects j items from t, uniformly at random (without replacement). These items, and
no other items of t, are placed into t
1
.
3. It considers each item a t in turn and tosses a coin with probability
of head-s
and 1 of tails. All those items for which the coin faces heads are added to t
1
.
If different clients have transactions of different sizes, then select-a-size parameters
have to be chosen for each transaction size. So, this (nonrandomized) size has to be
transmitted to the server with the randomized transaction.
4 Limiting Privacy Breaches
Consider the following simple randomization R: given a transaction t, we consider each item
in turn, and with probability 80%replace it with a new random item; with probability 20% we
leave the item unchanged. Since most of the items get replaced, we may suppose that this
randomization preserves privacy well. However, it is not so, at least not all the time. Indeed,
let A = {x, y, z} be a 3-item set with partial supports
s
3
= supp
T
(A) = 1%; s2 = 5%; s
1
+ s
0
=94%.
Assume that overall there are 10,000 items and 10 million transactions, all of size 10. Then
100,000 transactions contain A, and 500,000 more transactions contain all but one items of A.
How many of these transactions contain A after they are randomized? The following is a
rough average estimate:
A
t and A
R(t) : 100,000 0.2

3
= 800
|A t| = 2 and A
R(t) : 500,000 0.2

2
.
000 , 10
8 . 0 . 8
= 12.8
|A t| 1 and A
R(t) : < 10
7
0.2
2
000 , 10
8 . 0 . 9

1.04
So, there will be about 814 randomized transactions containing A, out of which about 800, or
98%, contained A before randomization as well. Now, suppose that the server receives from
client C
i
a randomized transaction R(t) that contains A. The server now knows that the actual,
nonrandomized transaction t at C
i
contains A with probability about 98%. On the other hand,
the prior probability of A
t is just 1%. The disclosure of A
R(t) has caused a

probability jump from 1% to 98%. This situation is a privacy breach.
Intuitively, a privacy breach with respect to some property P(t) occurs when, for some
possible outcome of randomization (= some possible view of the server), the posterior
probability of P(t) is higher than a given threshold called the privacy breach level. Of course,
there are always some properties that are likely; so, we have to only look at interesting
properties, such as the presence of a given item in t. In order to prevent privacy breaches
from happening, transactions are randomized by inserting many false items, as well as
deleting some true items. So many false items should be inserted into a transaction that
one is as likely to see a false itemset as a true one. In select-a-size randomization
Copyright ICWS-2009
operator, it is the randomization level
that determines the probability of a false item to be

inserted.
The other parameters, namely the distribution (p[0], p[1], . . . , p[m]), are set in [11] so that,
for a certain cutoff integer K, any number of items from 0 to K is retained from the original
transaction with probability 1/(K + 1), while the rest of the items are inserted independently
with probability
. The question of optimizing all select-a-size parameters to achieve

maximum recoverability for a given breach level is left open.
The parameters of randomization are checked for privacy as follows. It is assumed that the
server knows the maximum possible support of an itemset for each itemset size, among
transactions of each transaction size, or their upper bounds. Based on this knowledge, the
server computes partial supports for (imaginary) privacy-challenging itemsets, and tests
randomization parameters by computing posterior probabilities P[a t | A _ R(t)] from the
definition of privacy breaches. The randomization parameters are selected to keep variance
low while preventing privacy breaches for privacy-challenging itemsets.
Graphs and experiments with real-life datasets show that, given several million transactions,
it is possible to find randomization parameters so that the majority of 1-item, 2- item, and 3-
item sets with support at least 0.2% can be recovered from randomized data, for privacy
breach level of 50%. However, long transactions (longer than about 10 items) have to be
discarded, because the privacy-preserving randomization parameters for them must be too
randomizing, saving too little for support recovery. Those itemsets that were recovered
incorrectly (false drops and false positives) were usually close to the support threshold,
i.e. there were few outliers. The standard deviation for 3-itemset support estimator was at
most 0.07%for one dataset and less than 0.05% for the other; for 1-item and 2-item sets it is
smaller still.
5 Measures of Privacy
Privacy is measured in terms of confidence intervals. The nonrandomized numerical attribute
x
i
is treated as an unknown parameter of the distribution of the randomized value Z
i
= x
i
+ Y
i
.
Given an instance z
i
of the randomized value Z
i
, the server can compute an interval I(z
i
) =
[x(z
i
), x+(z
i
)] such that x
i
I(zi) with at least certain probability c%; this should be true for
all x
i
. The length |I(z
i
)| of this confidence interval is treated as a privacy measure of the
randomization. One problem with this method is that the domain of the nonrandomized value
and its distribution are not specified. Consider an attribute X with the following density
function:
f
X
(x) = 0.5 if 0 x 1 or 4 x 5
0 otherwise
Assume that the perturbing additive Y is distributed uniformly in [1, 1]; then, according to
the confidence interval measure, the amount of privacy is 2 at confidence level 100%.
However, if we take into account the fact that X must be compute a confidence interval of
size 1 (not 2). The interval is computed as follows:
[0, 1] if -1 z 2
I(z) =
[4, 5] if 3 z 6
Copyright ICWS-2009
Moreover, in many cases the confidence interval can be even shorter: for example, for z =
0.5 we can give interval [0, 0.5] of size 0.5.Privacy is measured using Shannons
information theory .The average amount of information in the nonrandomized attribute X
depends on its distribution and is measured by its differential entropy
2
X ~ x
log ( ) ( = E X h f
x
(x)) = -
x
X
x f ) ( log
2
f
X
(x)dx.
The average amount of information that remains in X after the randomized attribute Z is
disclosed can be measured by the conditional differential entropy
)) ( log ( ) / (
/ 2
) , ( ~ ) , (
x f E Z X h
z Z X
Z X z x
=
=
=
dxdz x f z x f
z Z X
Z X
Z X
) ( log ) , (
/ 2
,
, =

The average information loss for X that occurs by disclosing Z can be measured in terms of
the difference between the two entropies:
I(X;Z) = h(X) h(X|Z) =
) (
) (
log
|
2
Z) (X, ~ ) , (
x f
x f
E
X
z Z x
z x
=

This quantity is also known as mutual information between random variables X and Z. It is
proposed in [1] to use the following functions to measure amount of privacy ((X)) and
amount of privacy loss (p(X|Z)):
(X): = 2
h(X)
; p (X|Z) := 1 2
I(X;Z)
.
In the example above we have
(X) = 2; (X|Z) = 2h(X|Z) 0.84; P(X|Z) 0.58.
A possible interpretation of these numbers is that, without knowing Z, we can localize X
within a set of size 2; when Z is revealed, we can (on average) localize X within a set of size
0.84, which is less than 1. However, even this information-theoretic measure of privacy is not
without some difficulties. Suppose that clients would not like to disclose the property
X 0.01. The prior probability of this property is 0.5%; however, if the randomized value
Z happens to be in [1,0.99] the posterior probability P[X 0.01| Z =z] becomes 100%.Of
course,
Z [1,0.99] is unlikely. Therefore, Z [1, 0.99] occurs for about 1 in 100,000 records.
But every time it occurs the property X 0.01 is fully disclosed, becomes 100% certain.
The mutual information, being an average measure, does not notice this rare disclosure. Nor
does it alert us to the fact that whether X [0, 1] or X [4, 5] is fully disclosed for every
record; this time it is because the prior probability of each of these properties is high (50%).
The notion of privacy breaches, on the other hand, captures these disclosures. Indeed, for any
privacy breach level
< 100% and for some randomization outcome (namely, for Z 0.99)
the posterior probability of property X 0.01 is above the breach level. The definition of
privacy breaches is that we have to specify which properties are privacy-sensitive, whose
probabilities must be kept below breach level. Specifying too many privacy-sensitive
properties may require too destructive a randomization, leading to a very imprecise aggregate
model at the server. Thus, the question of the right privacy measure is still open.
Copyright ICWS-2009
6 Conclusion
The research in using randomization for preserving privacy has shown promise and has
already led to interesting and practically useful results. This paper looks at privacy under a
different angle than the conventional cryptographic approach. It raises an important question
of measuring privacy, which should be addressed in the purely cryptographic setting as well
since the disclosure through legitimate query answers must also be measured. Randomization
does not rely on intractability hypotheses from algebra or number theory, and does not
require costly cryptographic operations or sophisticated protocols. It is possible that future
studies will combine statistical approach to privacy with cryptography and secure multiparty
computation, to the mutual benefit of all of them.
References
[1] D. Agrawal and C. C. Aggarwal. On the design and quantification of privacy preserving data mining
algorithms.In Proceedings of the 20th Symposium on Principles of Database Systems, Santa Barbara,
California, USA, May 2001.
[2] Randomization in privacy preservin-g data mining - Alexandre Evfimievski Volume 4, issue 2 pages 43-
47 SIGKDD Exlorations.
[3] R. J. A. Little. Statistical analysis of masked data. Journal of Official Statistics, 9(2):407426, 1993.
[4] R. Agrawal and R. Srikant. Privacy preserving data mining. In Proceedings of the 19th ACM SIGMOD
Conference on Management of Data, Dallas, Texas, USA, May 2000.
[5] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. CRC
Press,Boca Raton, Florida, USA, 1984. SIGKDD Explorations. Volume 4, Issue 2 - page 47.
[6] S. J. Rizvi and J. R. Haritsa. Maintaining data privacy in association rule mining. In Proceedings of the 28
th

International Conference on Very Large Data Bases, Hong Kong, China, August 2002.
[7] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserving mining of association rules. In
Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery in Databases and
Data Mining, pages 217228, Edmonton, Alberta, Canada, July 2326 2002.
Mining Full Text Documents by Combining
Classification and Clustering Approaches

Y. Ramu
S.V. Engg. College for Women, Bhimavaram534204
yramumail@yahoo.co.in

Abstract

The area of Knowledge Discovery in Text (KDT) and Text Mining (TM) is
growing rapidly mainly because of the strong need for analyzing the vast
amount of textual data that reside on internal file systems and the Web. Most
of the present day search engines aid in locating relevant documents based on
keyword matches. However, to provide the user with more relevant
information, we need a system that also incorporates the conceptual
framework of the queries. So training search engine to retrieve documents
based on a combination of keyword and conceptual matching is essential. An
automatic classifier is used to determine the concepts to which new documents
belong. Currently, the classifier is trained by selecting documents randomly
from each concepts training set and it also ignores the hierarchical structure
of the concept tree. In this paper, I present a novel approach to select these
training documents by using document clustering within the concepts. I also
exploit hierarchical structure in which the concepts themselves are arranged.
Combining these approaches to text classification, I can achieve an
improvement in accuracy over the existing system.
1 Introduction
The vast amount of data found in an organization, some estimates run as high as 80%, are
textual such as reports, emails, etc. This type of unstructured data usually lacks metadata and
as a consequence there is no standard means to facilitate search, query and analysis. Today,
the Web has developed a medium of documents for people rather than for data and
information that can be processed automatically.
A human editor can only recognize that a new event has occurred by carefully following all
the web pages or other textual sources. This is clearly inadequate for the volume and
complexity of the information involved. So there is a need for automated extraction of useful
Knowledge from huge amounts of textual data in order to assist human analysis is apparent.
Knowledge discovery and Text Mining are mostly automated techniques that aim to discover
high level information in huge amount of textual data and present it to the potential user
(analyst, decision-maker, etc).
Knowledge Discovery in Text (KDT) is the non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in unstructured textual data.
Unstructured textual data is a set of documents. In this paper, I use the term document to refer
to a logical unit of text. This could be Web page, a status memo, an invoice, an email etc. It
84 Mining Full Text Documents by Combining Classification and Clustering Approaches
Copyright ICWS-2009
can be complex and long, and is often more than text and can include graphics and
multimedia content.
2 Motivation
Search engines often provide too many irrelevant results. This is mostly because of the fact
that a single word might have multiple meanings [Krovetz 92]. Thus current day search
engines that match documents only based on keywords prove inaccurate.
To overcome this problem, it is better to consider in to account both the keyword for which
the user is searching and the meaning, or the concept, in which the user is interested. The
conceptual arrangement of information can be found on the Internet in the form of directory
services such as Yahoo!
These arrange Web pages conceptually in a hierarchical browsing structure. While it is
possible that a lower level concept may belong to more than one higher-level concept, in our
study we consider the hierarchy as a classification tree. In this case, each concept is a child of
at most one parent concept.
During indexing, we can use an automatic classifier to assign newly arriving documents to
one or more of the preexisting classes or concepts. However, the most successful paradigm
for organizing large amounts of information is by categorizing the different documents
according to their topic, where topics are organized in a hierarchy of increasing specificity
[Koller 97]. By utilizing known hierarchical structure, the classification problem can be
decomposed into a smaller set of problems corresponding to hierarchical splits in the tree.
For any classifier, the performance will improve if the documents that are used to train the
classifier are the best representatives of the categories. Clustering can be used to select the
documents that best represents a category.
3 Related Work
3.1 Text Classification
Text classification is the process of matching a document with the best possible concept(s)
from a predefined set of concepts. Text classification is a two step process: Training and
Classification.
i) Training: The system is given a set of pre-classified documents. It uses these to learn
the features that represent each of the concepts.
ii) Classification: A classifier uses the knowledge that it has already gained in the
training phase to assign a new document to one or more of the categories. Feature
selection plays an important role in document classification.
3.2 Hierarchical Text Classification
In flat text classification, categories are treated in isolation of each other and there is no
structure defining the relationships among them. A single huge classifier is trained which
categorizes each new document as belonging to one of the possible basic classes.
Mining Full Text Documents by Combining Classification and Clustering Approaches 85
Copyright ICWS-2009
In hierarchical text classification we can address this large classification problem using a
divide-and-conquer approach [Sun 01]. [Koller 97] Proposed an approach that utilizes the
hierarchical topic structure to decompose the classification task into a set of simpler
problems, one at each node in the classification tree.
At each level in the category hierarchy, a document can be first classified into one or more
sub-categories using some flat classification methods. We can use features from both the
current level as well as its children to train this classifier. The following are the motivations
for taking hierarchical structure into account [DAlessio 00]:
1. The flattened classifier loses the intuition that topics that are close to each other in
hierarchy have more in common with each other, in general, than topics that are
spatially far apart. These classifiers are computationally simple, but they lose
accuracy because the categories are treated independently and relationship among the
categories is not exploited.
2. Text categorization in hierarchical setting provides an effective solution for dealing
with very large problems. By treating problem hierarchically, the problem can be
decomposed into several problems each involving a smaller number of categories.
Moreover, decomposing a problem can lead to more accurate specialized classifiers.
Category structures for hierarchical classification can be classified into [Sun 03]:
Virtual category tree
Category tree
Virtual directed acyclic category graph
Directed acyclic category graph
3.3 Document Clustering
There are many different clustering algorithms, but they fall into a few basic types [Manning
99]. One way to group the algorithms is by: hierarchical clustering or the flat non-hierarchical
clustering
i) A Hierarchical Clustering: Produces a hierarchy of clusters with the usual
interpretation that each node stands for a subclass of its mothers node. There are two
basic approaches to generating hierarchical clustering: Agglomerative and Divisive.
ii) Flat Clustering(Non- Hierarchical): It simply creates a certain number of clusters and
the relationships between clusters are often undetermined. Most algorithms that
produce flat clustering are iterative. They start with a set of initial clusters and
improve them by iterating a reallocation operation that reassigns objects. Non-
hierarchical algorithms often start out with a partition based on randomly selected
seeds (one seed per cluster), and then refine this initial partition. [Manning 99].
3.4 Documents Indexing
The indexing process is comprised of two phases: Classifier training and Document
collection indexing.
Copyright ICWS-2009
i) Classifier training: During this phase a fixed number of sample documents for each
concept are collected and merged, and the resulting super-documents are preprocessed
and indexed using the TF * IDF method. This essentially represents each concept by
the centroid of the training set for that concept.
ii) Document Collection indexing: New documents are indexed using a vector space
method to create a traditional word- based index. Then, the document is classified by
comparing the document vector to the centroid for each concept. The similarity values
thus calculated are stored in the concept- based index.
4 Implementation (Approach)
4.1 Incorporating Clustering
Feature selection for text classification plays a primary role towards improving the
classification accuracy and computational efficiency. With any large set of classes, the
boundaries between the categories are fuzzy. The documents that are near the boundary line
will add noise if used for training and confuse the classifier. Thus, we want to eliminate
documents, and the words they contain, from the representative vector for the category. It is
important for us to carefully choose the documents from each category on which the feature
selection algorithms operate during training. Hence, in order to train the classifier, we need
to:
Identify within-category clusters (Cluster Mining), and
Extract the cluster(s) representative pages
Here, cluster mining is a different one to the conventional use of clustering techniques to
compute a partition for a complete set of data (documents (web) in our case). Its aim is to
identify only some representative clusters of Web pages within a Web structure. So we have
to use clustering techniques to get some kind of information about the arrangement of
documents within each category space and select the best possible representative documents
from those clusters. In essence, we are doing document mining within the framework of
cluster mining.
4.2 Incorporating Hierarchical Classification
There are two approaches adopted by existing hierarchical classification methods [Sun 01]
i) Big Bang approach: In this, only a single classifier is used in the classification
process.
ii) Top-Down Level Based approach: In this one or more classifiers are constructed at
each category level and each classifier works as a flat classifier.
In our approach, we adopt a top-down level based approach that utilizes the hierarchical topic
structure to decompose the classification task into a set of simpler problems. The classifiers
we use are based on the vector space model.
4.3 System Architecture
We study and evaluate the performance of the text classifier when it is trained by using
documents selected with the help of clustering from each category and by using a top- down,
Copyright ICWS-2009
level-based, hierarchical approach of text classification. To do this, we need the following
components:
i) A system to perform document clustering with in each category. We then choose the
documents based on the result of clustering so that the documents that are best
representatives of the category are selected.
ii) Automatic classifier(s) that will be trained for evaluation purposes. We will have one
a comprehensive classifier for flat classification and one classifier for each non- leaf
node in the tree in case of hierarchical classification.
iii) A mechanism to test the classifier with documents that it has not seen before to
evaluate its classification accuracy. We use the accuracy of classification as an
evaluation measure.
4.4 Classification Phase (Testing)
Similar to the processing of the training documents, a term vector is generated for the
document to be classified. This vector is compared with all the vectors in the training inverted
index and the category vectors most similar to the document vector are the categories to
which the document is assigned. The similarity between the vectors is determined by the
cosine similarity measure, i.e., the inner product of the vectors. This gives a measure of the
degree of similarity of the document with a particular category. The results are then sorted to
identify the top matches. A detailed discussion on tuning the various parameters, such as
number of tokens per document, number of categories per document to be considered, etc., to
improve the performance of the categorizer can be found in [Gauch 04].
5 Experimental Observations and Results
5.1 Experimental Set up
Source of Training Data: Because Open Directory Project hierarchy [ODP 02] is readily
available for download from their web site in a compact format, it was chosen as the source
for classification tree. In our work with hierarchical text classification, the top few levels of
the tree are sufficient. We decided to classify documents into classes from the top three levels
only.
5.2 Experiment: Determining the Baseline
Currently, Key Concept uses a flat classifier. Documents are randomly selected from the
categories. To evaluate our experiments, we must first establish a baseline level of
performance with the existing classifier.
Chart 1 provides us with the baseline with which we can compare our future work. It shows
the percentage of documents within the top n (n =1,2,..10) concepts plotted against the rank n.
46.6% of the documents are correctly classified as belonging to the their true category and the
correct answer appears within the top 10 selections over 80% of the time.
Copyright ICWS-2009

Chart 1: Baseline. Performance of the Flat Classifier when it is trained using 30 documents randomly selected
from each concept
5.3 Experiment Effect of Clustering on Flat Classification

Chart 2: Using within-category clustering to select the documents to train the flat classifier.
Chart 2 shows the comparison of the results obtained from the six experiments It is clear from
the graph above that experiment which is selecting documents that are farthest from the
centroid, yields the poorest of the results. The percentage of exact matches in this is just
29.6%. This denotes a fall of 36% as compared to our random baseline of 46.6%.
Experiment, which involves selecting documents closest to the centroid from each concept,
gives us 49.5% of exact matches. This translates to an improvement of 3% in exact terms
over random training. In experiment-3, choose 30 documents that are farthest to each other in
each concept to train the classifier. We train the classifier with 30 documents that are farthest
from each other. The results of this experiment show that the percentage of exact matches is
48.6%, an improvement of 2% over our baseline.
The accuracy of the classifier is 51.6%, 52.2% and 52.9% for experiments 4, 5 and 6
Copyright ICWS-2009
respectively. The best observed performance among these 6 trials is for experiment-6,
selecting from two clusters after discarding outliers, which shows an improvement of 6.3%
over baseline.
6 Future Work
In this paper, we presented a novel approach to text classification by combining within-
concept clustering with hierarchical approach. We are going to conduct experiments to
determine how deeply we need to traverse the concept tree to collect training documents..
References
[1] [Krovetz 92] Robert Krovetz and Bruce W. Croft. Lexical Ambiguity and Information Retrieval. ACM
Transactions on Information Systems, 10(2), April 1992, pages. 115-141.
[2] [Koller 97]. D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In
Proceedings of the 14
th
International Conference on Machine Learning, 1997.
[3] [Sun 01]. A. Sun and E. Lim. Hierarchical Text Classification and Evaluation. In Proceedings of the 2001
IEEE International Conference on Data Mining (ICDM2001), California, USA, November 2001, pages
521-528.
[4] [DAlessio 00]: S. DAlessio, K. Murray, R. Schiaffino, and A. Kershenbaum. The effect of using
hierarchical classifiers in text categorization. In Proceedings Of the 6
th
International Conference
Recherched Information Assistee par Ordinateur, Paris, FR, 2000, pages 302-313.
[5] [Sun 03]: A. Sun, E. Lim, and W. Ng. Performance Measurement Framework for Hierarchical Text
Classification. Journal of the American Society for Information Science and Technology, 54(11), 2003.
Pages 1014-1028.
[6] [Manning 99]. C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. The
MIT Press. 1999.
[7] [Gauch 04] S. Gauch, J. M. Madrid, S. Induri, D. Ravindran, and S. Chadalavada. KeyConcept: A
Conceptual Search Engine. Information and Telecommunication Technology Center, Technical Report:
ITTC-FY2004-TR-8646-37, University of Kansas. 2004
Discovery of Semantic Web Using Web Mining

K. Suresh P. Srinivas Rao D. Vasumathi

I.T,VCE, Hyderabad JNTU Hyderabad JNTUCEH
kallamsuresh@yahoo.co.in srinuit2006@gmail.com vasukumar_devara@yahoo.co.in

Abstract

The Semantic Web is the second generation WWW, enriched by machine
processable information which supports the user in his tasks The main idea of
the Semantic Web is to enrich the current Web by machine-processable
information in order to allow for semantic-based tools supporting the human
user .Semantic web Mining aims at combining the two fast developing
research areas Semantic Web and Web Mining .Web Mining aims at
discovering insights about the meaning of Web resources and their usage.
Given the primarily syntactical nature of data Web Mining operates on, the
discovery of meaning is impossible based on these data only. In this paper,
we discuss the interplay of the Semantic Web with Web Mining, with a
specific focus on usage mining.
1 Introduction
Web Usage Mining is the application of data mining methods to the analysis of recordings of
Web usage, most often in the form Of Web server logs .One of its central problems is the
large number of patterns that are usually found: among these, how can the interesting patterns
be identified? For example, an application of association rule analysis to a Web log will
typically return many patterns like the observation that 90% of the users who made a
purchase in an online shop also visited the homepage-a pattern that is trivial because the
homepage is the sites main entry point. Statistical measures of pattern quality like support
and confidence, and measures of interestingness based on the divergence from prior beliefs
are a primarily syntactical approach to this problem. They need to be complemented by an
understanding of what a site and its usage patterns are about i.e. a semantic approach. A
popular approach for modeling sites and their usage is related to OLAP techniques: a
modeling of the pages in terms (possibly multiple) concept hierarchies, and an investigation
of patterns at different levels of abstraction, i.e. Knowledge discovery Cycle which iterates
over various roll- upsanddrill-downs. Concept Hierarchies conceptualize a domain in
terms of taxonomies such as product catalogs, topical thesauri, Etc. The expressive power of
this form of knowledge representation is limited to be a relationship. However, for many
applications, a more expressive form of knowledge representation is desirable, for example
ontologies that allow arbitrary relations between concepts.
A second problem facing many current analyses that take semantics into account is that the
conceptualizations often have to be hand-crafted to represent a site that has grown
independently of an overall conceptual design, and that the mapping of individual pages to
this conceptualization may have to be established. It would thus be desirable to have a rich
semantic model of a site, of its content and its (hyperlink) structure, a model that captures the
Discovery of Semantic Web Using Web Mining 91
Copyright ICWS-2009
complexity of the manifold relationships between the concepts covered in a site, and a model
that is built into the site in the sense that the pages requested by visitors are directly
associated with the concepts and relations treated by it.
The Semantic Web is just this: todays Web enriched by a formal semantics in form of
ontologies that captures the meaning of pages and links in a machine-understandable form.
The main idea of the Semantic Web is to enrich the current Web by machine-processable
information in order to allow for semantic-based tools supporting the human user. In this
paper, we discuss on one hand how the Semantic Web can improve Web usage mining, and
on the other hand how usage mining can be used to built up the Semantic Web.
2 Web Usage Mining
Web mining is the application of data mining techniques to the content, structure, and usage
of Web resources. This can help to discover global as well as local structure within and
between Web pages. Like other data mining applications, Web mining can profit from given
structure on data (as in database tables), but it can also be applied to semi structured or
unstructured data like free-form text. This means that Web mining is an invaluable help in the
transformation from human understandable content to machine understandable semantics. A
distinction is generally made between Web mining that operates on the Web resources
themselves (often further differentiated into content and structure mining), and mining that
operates on visitors usage of these resources. These techniques, and their application for
understanding Web usage, will be discussed in more detail in section 5.In Web usage mining,
the primary Web resource that is being mined is a record of the requests made by visitors to a
Web site, most often collected in a Web server log [5].The content and structure of Web
pages, and in particular those of one Web site, reflect the intentions of the authors and
designers of the pages and the underlying information architecture. The actual behavior of the
users of these resources may reveal additional structure.
First, relationships may be induced by usage where no particular structure was designed. For
example, in an online catalog of products, there is usually either no inherent structure
(different products are simply viewed as a set), or one or several hierarchical structures given
by product categories, manufacturers, etc. Mining the visits to that site, however, one may
find that many of the users who were interested in product A were also interested in product
B. Here, interest may be measured by requests for product description pages, or by the
placement of that product into the shopping cart (indicated by the request for the respective
pages). The identified association rules are at the center of cross-selling and up-selling
strategies in E-commerce sites: When a new user shows interest in product A, she will
receive a recommendation for product B (cf. [3, 4]).
Second, relationships may be induced by usage where a different relationship was intended.
For example, sequence mining may show that many of the users who visited page C later
went to page D, along paths that indicate a prolonged search (frequent visits to help and index
pages, frequent backtracking, etc.) [1, 2]. This can be interpreted to mean that visitors wish to
reach D from C, but that this was not foreseen in the information architecture, hence that
there is at present no hyperlink from C to D. This insight can be used for static site
improvement for all users (adding a link from C to D), or for dynamic recommendations
personalized for the subset of users who go to C (you may wish to also look at D).It is
useful to combine Web usage mining with contentand structure analysis in order to make
92 Discovery of Semantic Web Using Web Mining
Copyright ICWS-2009
sense of observed frequent paths and the pages on these paths. This can be done using a
variety of methods. Many of these methods rely on a mapping of pages into ontology. And
underlying ontology and the mapping of pages into it may already be available, the mapping
of pages into an existing ontology may need to be learned, and/or the ontology itself may
have to be inferred first. In the following sections, we will first investigate the notions of
semantics (as used in the Semantic Web) and ontologies in more detail. We will then look at
how the use of ontologies, and other ways of identifying the meaning of pages, can help to
make Web Mining go semantic. Lastly, we will investigate how ontologies and their
instances can be learned.
3 Semantic Web
The Semantic Web is based on a vision of Tim Berners-Lee, the inventor of the WWW. The
great success of the current WWW leads to a new challenge: a huge amount of data is
interpretable by humans only; machine support is limited. Berners-Lee suggests to enrich the
Web by machine-processable information which supports the user in his tasks. For instance,
todays search engines are already quite powerful, but still return too often too large or
inadequate lists of hits. Machine-processable information can point the search engine to the
relevant pages and can thus improve both precision and recall. For instance, it is today almost
impossible to retrieve information with a keyword search when the information is spread over
several pages. The process of building the Semantic Web is today still heavily going on. Its
structure has to be defined, and this structure has then to be filled with life. In order to make
this task feasible, one should start with the simpler tasks first. The following steps show the
direction where the Semantic Web is heading:
1. Providing a common syntax for machine understandable statements.
2. Establisshing common vocabularies.
3. Agreeing on a logical language.
4. Using the language for exchanging proofs.
Berners-Lee suggested a layer structure for the Semantic
Web: (i) Unicode/URI, (ii) XML/Name Spaces/ XML Schema, (iii) RDF/RDF Schema, (iv)
Ontology vocabulary, (v) Logic, (vi) Proof, (vii) Trust.
1 This structure reflects the steps listed above. It follows the understanding that each step
alone will already provide added value, so that the Semantic Web can be realized in an
incremental fashion. On the first two layers, a common syntax is provided. Uniform resource
identifiers (URIs) provide a standard way to refer to entities,2 while Unicode is a standard for
exchanging symbols. The Extensible Markup Language (XML) fixes a notation for describing
labeled trees, and XML Schema allows to define grammars for valid XML documents. XML
documents can refer to different namespaces to make explicit the context (and therefore
meaning) of different tags. The formalizations on these two layers are nowadays widely
accepted, and the number of XML documents is increasing rapidly.
The Resource Description Framework (RDF) can be seen as the first layer which is part of
the Semantic Web. According to the W3C recommendation [40], RDF is a foundation for
processing metadata; it provides interoperability between applications that exchange machine
Copyright ICWS-2009
understandable information on the Web. RDF documents consist of three types of entities:
resources, properties, and statements. Today the Semantic Web community considers these
levels rather as one single level as most ontologies allow for logical axioms. Following [2],
ontology is an explicit formalization of a shared understanding of a conceptualization. This
high-level definition is realized differently by different research communities. However, most
of them have a certain understanding in common, as most of them include a set of concepts, a
hierarchy on them, and relations between concepts. Most of them also include axioms in
some specific logic. To give a flavor, we present here just the core of our own definition [3],
as it is reflected by the Karlsruhe Ontology framework KAON.3 It is built in a modular way,
so that different needs can be fulfilled by combining parts.

Fig. 1: The Relation between the WWW, Relational Metadata, and Ontologies.
Definition 1 A core ontology with axioms is a tuple O :=(C;_C;R; _;_R;A) consisting of _
two disjoint sets C and R whose elements are called concept identifiers and relation
identifiers, resp.,
_ a partial order _C on C, called concept hierarchy or taxonomy,_ a function _:R ! C+ called
signature (where C+ is the set of all finite tuples of elements in C),_ a partial order _R on R,
called relation hierarchy, where r1 _R r2 implies j_(r1)j = j_(r2)j and _i(_(r1)) _C _i(_(r2)),
for each 1 _ i _ j_(r1)j, with _i being the projection on the ith component, and _ a set A of
logical axioms in some logical language L.
This definition constitutes a core structure that is quite straightforward, well-agreed upon, and
that may easily be mapped onto most existing ontology representation languages. Step by
step the definition can be extended by taking into account axioms, lexicons, and knowledge
bases [1].As an example, have a look at the top of Figure 1. The set C of concepts is the set
fTop, Project, Person, Researcher, Literalg, and the concept hierarchy _C is indicated by the
arrows with a bold head. The set R of relations is the set fworks-in, researcher, cooperates-
with, nameg. The relation worksin has (Person, Project) as signature, the relation name
has (Person, Literal) as signature.4 In this example, the hierarchy on the relations is flat, i. e.,
_R is just the identity relation. For an example of a non-flat relation, have a look at Figure 2.
Copyright ICWS-2009
root
facility accommodation
Food provider
hotel
minigolf
Youth_hostel
Belongs_to
Tennis _court
Family_hotel Wellness hotel
is _sports
_facility
Fast food restaurant
italian german
Vegetarian - only regular
b
Fig. 2: Parts of the ontology of the content
The objects of the metadata level can be seen as instances of the ontology concepts. For
example, URI-SWMining is an instance of the concept Project, and thus by inheritance
also of the concept Top. Up to here, RDF Schema would be sufficient for formalizing the
ontology. Often ontologies contain also logical axioms. By applying logical deduction, one
can then infer new knowledge from the information which is stated implicitly. The axiom in
Figure 1 states for instance that the cooperates-with relation is symmetric. From it, one can
logically infer that the person addressed by URI-AHO is cooperating with the person
addressed by URI-GST (and not only the other way around).
A priori, any knowledge representation mechanism5 can play the role of a Semantic Web
language. Frame Logic (or FLogic; [2]), for instance, provides a semantically founded
knowledge representation based on the frame and slot metaphor. Probably the most popular
framework at the moment are Description Logics (DL). DLs are subsets of first order logic
which aim at being as expressive as possible while still being decidable. The description logic
SHIQ provides the basis for the web language DAML+OIL.6 Its latest version is currently
established by the W3C Web Ontology Working Group (WebOnt)7 under the name OWL.
Several tools are in use for the creation and maintenance of ontologies and metadata, as well
as for reasoning within them. Our group has developed OntoEdit [4, 5], an ontology editor
which is connected to Ontobroker [1], an inference engine for FLogic. It provides means for
semantic based query handling over distributed resources. In this paper, we will focus our
interest on the XML, RDF, ontology and logic layers.
4 Using Semantics for Usage Mining and Mining the Usage of the Semantic Web
Semantics can be utilized for Web Mining for different purposes some of the approaches
presented in this section rely on a comparatively ad hoc formalization of semantics, while
others exploit the full power of the Semantic Web. The Semantic Web offers a good basis to
enrich Web Mining: The types of (hyper)links are now described explicitly, allowing the
Copyright ICWS-2009
knowledge engineer to gain deeper insights in Web structure mining; and the contents of the
pages come along with a formal semantics, allowing her to apply mining techniques which
require more structured input. Because the distinction between the use of semantics for Web
mining and the mining of the Semantic Web itself is all but sharp, we will discuss both in an
integrated fashion. Web usage mining benefits from including semantics into the mining
process for the simple reason that the application expert as the end user of mining results is
interested in events in the application domain, in particular user behavior, while the data
availableWeb server logsare technically oriented sequences of HTTP requests.
A central aim is therefore to map HTTP requests to meaningful units of application events.
Application events are defined with respect to the application domain and the site, a non-
trivial task that amounts to a detailed formalization of the sites business model. For example,
relevant E-business events include product views and product click-through this in which a
user shows specific interest in a specific product by requesting more detailed information
(e.g., from the Beach Hotel to a listing of its prices in the various seasons). Web server logs
generally contain at least some information on an event that was marked by the users request
for a specific Web page, or the systems generating a page to acknowledge the successful
completion of a transaction. For example, consider a tourism Web site that allows visitors to
search hotels according to different criteria, to look at detailed descriptions of these hotels, to
make reservations, and so on. In the Web site, a hotel room reservation event may be
identified by the recorded delivery of the page reserve.php?-user=12345&hotel=Beach Hotel
people=2&-arrive=01May&depart=04 May, which was generated after the user chose room
for 2 persons in the Beach Hotel and typed in the arrival and departure dates of his desired
stay. What information the log contains, and whether this is sufficient, will depend on the
technical set-up of the site as well as on the purposes of the analysis.So what are the aspects
of application events that need to be reconstructed using semantics? In the following sections,
we will show that a requested Web page is, first, about some content, second, the request for
a specific service concerning that content, and third, usually part of a larger sequence of
events. We will refer to the first two as atomic application events, and to the third as complex
application event.
4.1 Atomic Application Events: Content
A requested Web page is about something, usually a product or other object described in the
page. For example, search hotel.html?facilities=tennis may be a page about hotels, more
specifically a listing of hotels, with special attention given to a detailed indication of their
sports facilities. To describe content in this way,URLs are generally mapped to concepts. The
concepts are usually organized in taxonomies (also called concept hierarchies, see [1] and
the definition in Section 3). For example, a tennis court is a facility. Introducing relations, we
note that a facility belongs-to an accommodation, etc. (see Fig. 2).
4.2 Atomic application events: Service
A requested Web page reflects a purposeful user activity, often the request for a specific
service. For example, search hotel.html?facilities=tennis was generated after the user had
initiated a search by hotel facilities (stating tennis as the desired value). This way of
analyzing requests gives a better sense of what users wanted and expected from the site, as
opposed to what they received in terms of the eventual content of the page.
Copyright ICWS-2009
To a certain extent, the requested service is associated with the requests URL stem and the
delivered pages content (e.g., the URL search hotel.html says that the page was a result of a
search request). However, the delivered pages content may also be meaningless for the
understanding of user intentions, as is the case when the delivered page was a 404 File not
found. More information is usually contained in the specifics of the user query that led to the
creation of the page. This information may be contained in the URL query string, which is
recorded in the Web server log if the common request method GET is used.
The query string may also be recorded by the application server in a separate log. As an
example, we have used ontology to describe a Web site which operates on relational
databases and also contains a number of static pages, together with an automated
classification scheme that relies on mapping the query strings for dynamic page generation to
concepts [5]. Pages are classified according to multiple concept hierarchies that reflect
content (type of object that the page describes), structure (function of pages in object search),
and service (type of search functionality chosen by the user). A path can then be regarded as a
sequence of (more or less abstract) concepts in a concept hierarchy, allowing the analyst to
identify strategies of search. This classification can make Web usage mining results more
comprehensible and actionable for Web site redesign or personalization: The semantic
analysis has helped to improve the design of search options in the site, and to identify
behavioral patterns that indicate whether a user is likely to successfully complete a search
process, or whether he is likely to abandon the site .
The latter insights could be used to dynamically generate help messages for new users.Oberle
[4] develops a scheme for application server logging of user queries with respect to a full-
blown ontology (a knowledge portal in the sense of [2]). This allows the analyst to utilize
the full expressiveness of the ontology language, which enables a wide range of inferences
going beyond the use of taxonomy-based generalizations. He gives examples of possible
inferences on queries to a community portal, which can help support researchers in finding
potential cooperation partners and projects. A largescale evaluation of the proposal is under
development. The ontologies of content and services of a Web site as well as the mapping of
pages into them may be obtained in various ways. At one extreme, ontologies may be
handcrafted ex post; at the other extreme, they may be the generating structure of the Web
site (in which case also the mapping of pages to ontology elements is already available). In
most cases, mining methods themselves must be called upon to establish the ontology
(ontology learning) and/or the mapping (instance learning), for example by using methods of
learning relations (e.g., [3]) and information extraction (e.g., [1, 3]).
4.3 Complex application events
A requested Web page, or rather, the activity/ies behind it, is generally part of a more
extended behavior. This may be a problem-solving strategy consciously pursued by the user
(e.g., to narrow down search by iteratively refining search terms), a canonical activity
sequence pertaining to the site type (e.g., catalog search/browse, choose, add-to cart, pay in
an E-commerce setting [37]), or a description of behavior identified by application experts in
Copyright ICWS-2009
exploratory data analysis. An example of the latter is the distinction of four kinds of online
shopping strategies by [4]: directed buying earch/deliberation, hedonic browsing, and
knowledge building. The first group is characterized by focused search patterns and
immediate purchase.
The second is more motivated by a future purchase and therefore tends to browse through a
particular category of products rather than directly proceed to the purchase of a specific
product. The third is entertainment- and stimulus-driven, which occasionally results in
spontaneous purchases. The fourth also shows exploratory behavior, but for the primary goal
of information acquisition as a basis for future purchasing decisions. Moe characterized these
browsing patterns in terms of product and category pages visited on a Web site. Spiliopoulou,
Pohle, and Teltzrow [4] transferred this conceptualization to the analysis of a non-
commercial information site. They formulated regular expressions that capture the behavior
of search/deliberation and knowledge building, and used sequence mining to identify these
behaviors in the sites logs.
Search/deliberation

Knowledge building

Fig. 3: Parts of the ontology of the complex application events of the example site.
4.4 How is Knowledge about Application Events used in Mining?
Once requests have been mapped to concepts, the question arises how knowledge is gained
from these transformed data. We will investigate the treatment of atomic and of complex
application events in turn. Mining using multiple taxonomies is related to OLAP data cube
techniques: objects (in this case, requests or requested URLs) are described along a number
of dimensions, and concept hierarchies or lattices are formulated along each dimension to
allow more abstract views .The analysis of data abstracted using taxonomies is crucial for
many mining applications to generate meaningful results: In a site with dynamically
generated pages, each individual page will be requested so infrequently that no regularities
may be found in an analysis of navigation behavior. Rather, regularities may exist at a more
abstract level, leading to rules like visitors who stay in Wellness Hotels also tend to eat in
restaurants. Second, patterns mined in past data are not helpful for applications like
recommender systems when new items are introduced into product catalog and/or site
structure: The new Pier Hotel cannot be recommended simply because it was not in the
Home category
Product

Home category
product

category
product

Copyright ICWS-2009
tourism site before and thus could not co-occur with any other item, be recommended by
another user, etc.
A knowledge of regularities at a more abstract level could help to derive a recommendation
of the Pier Hotel because it too is wellness Hotel (and there are criteria for recommending
Wellness Hotels).After the preprocessing steps in which access data have been mapped into
taxonomies, two main approaches are taken in subsequent mining steps. In many cases,
mining operates on concepts at a chosen level of abstraction. For example, the sessions are
transformed into points in a feature space [3], or on the sessions transformed into sequences
of content units at a given level of description (for example, association rules can be sought
between abstract concepts such as Wellness Hotels, tennis courts, and restaurants).This
approach is usually combined with interactive control of the software, so that the analyst can
re-adjust the chosen level of abstraction after viewing the results (e.g., in the miner WUM;
see [5] for a case study).Alternatively to this static approach, other algorithms identify the
most specific level of relationships by choosing concepts dynamically. This may lead to rules
like People who stay at Wellness Hotels tend to eat at vegetarian only Indian restaurants
linking hotel-choice behavior at a comparatively high level of abstraction with restaurant
choice behavior at a comparatively detailed level of description.
Semantic Web Usage Mining for complex application events involves two steps of mapping
requests to events. As discussed in Section 4.3 above, complex application events are usually
defined by regular expressions in atomic application events (at some given level of
abstraction in their respective hierarchies). Therefore, in a first step, URLs are mapped to
atomic application events at the required level of abstraction. In a second step, a sequence
miner can then be used to discover sequential patterns in the transformed data.
The shapes of sequential patterns sought, and the mining tool used, determine how much
prior knowledge can be used to constrain the patterns identified: They range from largely
unconstrained first-order or k-th order Markov chains [7], to regular expressions that specify
the atomic activities completely (the name of the concept) or partially (a variable matching a
set of concepts) [4, 2]. Examples of the use of regular expressions describing application-
relevant courses of events include search strategies [5], a segmentation of visitors into
customers and non-customers [7], and a segmentation of visitors into different interest groups
based on the customer buying cycle model from marketing [4]. To date, few commonly
agreed-upon models of Semantic Web behavior exist.
5 Extracting Semantics from Web Usage
The effort behind the Semantic Web is to add semantic annotation to Web documents in
order to access knowledge instead of unstructured material. The purpose is to allow
knowledge to be managed in an automatic way. Web Mining can help to learn definitions of
structures for knowledge organization (e. g., ontologies) and to provide the population of
such knowledge structures. All approaches discussed here are semi-automatic. They assist the
knowledge engineer in extracting the semantics, but cannot completely replace her. In order
to obtain high quality results, one cannot replace the human in the loop, as there is always a
lot of tacit knowledge involved in the modeling process [5].
A computer will never be able to fully consider background knowledge, experience, or social
conventions. If this were the case, the Semantic Web would be superfluous, since then
Copyright ICWS-2009
machines like search engines or agents could operate directly on conventional Web pages.
The overall aim of our research is thus not to replace the human, but rather to provide him
with more and more support. In [6], we have discussed how content, structure, and usage
mining can be used for creating Semantics. Here we focus on the contribution of usage
mining. In the World Wide Web as in other places, much knowledge is socially constructed.
This social behavior is reflected by the usage of the Web. One tenet related to this view is
that navigation is not only driven by formalized relationships or the underlying logic of the
available Web resources, but that it is an information browsing strategy that takes advantage
of the behavior of like-minded people Recommender systems based on collaborative
filtering have been the most popular application of this idea. In recent years, the idea has
been extended to consider not only ratings, but also Web usage as a basis for the
identification of like-mindedness (People who liked/bought this book also looked at ...); see
[3] for a recent mining-based system; see also [6] for a classic, although not mining-based,
application. Web usage mining by its definition always creates patterns that structure pages in
some way.
6 Conclusions and Outlook
In this paper, we have studied the combination of the two fast-developing research areas
Semantic Web and Web Mining, especially usage mining. We discussed how Semantic Web
Usage Mining can improve the results of classical usage mining by exploiting the new
semantic structures in the Web; and how the construction of the Semantic Web can make use
of Web Mining techniques. A truly semantic understanding of Web usage needs to take into
account not only the information stored in server logs, but also the meaning that is constituted
by the sets and sequences of Web page accesses. The examples provided show the potential
benefits of further research in this integration attempt. One important focus is to make search
engines and other programs able to better understand the content of Web pages and sites. This
is reflected in the wealth of research efforts that model pages in terms of an ontology of the
content. Overall, three important directions for further interdisciplinary cooperation between
mining and application experts in Semantic Web Usage Mining have been identified:
1. the development of ontologies of complex behavior,
2. the deployment of these ontologies in Semantic Web description and mining tools and
3. continued research into methods and tools that allow the integration of both experts
and users background knowledge into the mining cycle. Web mining methods should
increasingly treat content, structure, and usage in an integrated fashion in iterated
cycles of extracting and utilizing semantics, to be able to understand and (re)shape the
Web.
References
[1.] C.C. Aggarwal. Collaborative crawling: Mining user experiences for topical resource discovery. In KDD -
2002 Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, Edmonton, CA, July 23-26, 2002, pages 423428, New York, 2002. ACM.
[2.] M. Baumgarten, A.G. Buchner, S.S. Anand, M.D. Mulvenna, and J.G. Hughes. User-driven navigation
pattern discovery from internet data. In M. Spiliopoulou and B. Masand, editors, Advances in Web Usage
Analysis and User Profiling, pages 7491. Springer, Berlin, 2000.
Copyright ICWS-2009
[3.] B. Berendt. Detail and context in web usage mining: Coarsening and visualizing sequences. In R. Kohavi,
B.M. Masand, M. Spiliopoulou, and J. Srivastava, editors, WEBKDD2001 Mining Web Log Data Across
All Customer Touch Points, pages 124. Springer-Verlag, Berlin Heidelberg,2002b.
[4.] B. Berendt, B. Mobasher, M. Nakagawa, and M. Spiliopoulou. The impact of site structure and user
environment on session reconstruction in web usage analysis. In Workshop Notes of Fourth WEBKDD
Web Mining for Usage Patterns & User Profiles at KDD - 2002, July 23, 2002, pages 115129, 2002.
[5.] B. Berendt and M. Spiliopoulou. Analysing navigation behavior in web sites integrating multiple
information systems. The VLDB Journal, 9(1):5675, 2000.
[6.] Bettina Berendt, Andreas Hotho, and Gerd Stumme. Towards semantic web mining. In [22], pages 264
278.
[7.] J. L. Borges and M. Levene. Data mining of user navigation patterns. In M. Spiliopoulou and B. Masand,
editors, Advances in Web Usage Analysis.
Performance Evolution of Memory Mapped Files
on Dual Core Processors Using Large
Data Mining Data Sets

S.N. Tirumala Rao

E.V. Prasad
RNEC, Ongole J.N.T.U.C.E
naga_tirumalarao@yahoo.com drevprasad@yahoo.co.in
N.B. Venkateswarlu G. Sambasiva Rao
AITAM, Tekkali, SACET,Chirala
venkat_ritch@yahoo.com sambasivarao_sacet@yahoo.com

Abstract

In the recent years, major CPU designers have shifted from ramping up clock
speeds to add on-chip multicore processors. A study is carried out with data
mining (DM) algorithms to explore the potential of multi-core hardware
architecture with OpenMP. The concept of memory mapped files is widely
supported by most of the modern operating systems. Performance of memory
mapped files on multicore processor is also studied. In our experiments
popular clustering algorithms such as k-means, max-min are used.
Experiments are carried out with serial versions and parallel versions.
Experiment results with both simulated and real data demonstrates the
scalability of our implementation and effective utilization of parallel hardware,
which benefits DM problems, involves large data sets.
Keywords: OpenMP, mmap(), fread(), k-means and max-min
1 Introduction
The goal of Data Mining is to discover knowledge hidden in data repositories. This activity
has recently attracted a lot of attention. High energy physics experiments produce hundreds
of Tera Bytes of data. Credit card banking sector hold large databases of customers
transactions and web search engines collect web documents worldwide. Regardless of the
application field, Data Mining (DM) allows to dig into huge datasets to reveal patterns and
correlations useful for high level interpretation. Finding clusters, association rules, classes
and time series are the most common DM tasks. Evidently, classification algorithms and
clustering algorithms are employed for this purpose. All require the use of algorithms whose
complexity, both in time and in space, grows at least linearly with the dataset size. Because of
the size of the data, and the complexity of the algorithms, the DM algorithms are reported to
be time consuming and hinder quick policy decision making. There are many attempts to
reduce CPU time requirement of the DM applications [Venkateswarlu et al., 1995; Gray and
More, 2004].
Many DM algorithms require a computation to be iteratively applied to all records of a
dataset. In order to guarantee scalability, even on a serial or a small scale parallel platform
102 Performance Evolution of Memory Mapped Files on Dual Core Processors Using Large Data Mining Data
Copyright ICWS-2009
(workstation cluster), the increase in the I/O activity must be carefully taken into account. In
the work of [palmerini, 2001] recognized two main categories of algorithms with respect to
the patterns of their I/O activities. Read and Compute (R&C) algorithms, which will be useful
for the same dataset at each iteration, and read, compute and write (RC&W) ones, which at
each iteration rewrite the dataset to be used at the next step and also suggested the
employment of Out-of-Core (OOC) techniques which explicitly take care of data
movements which reported to be showing low I/O overhead. An important OS feature is
time-sharing among processes; widely known as multi-threading, with which one can overlap
I/O actions with useful computations. [Stoffel et al., 1999; Bueherg, 2006] demonstrated the
advantage of such features in order to design efficient DM algorithms.
Most of the commercial data mining tools and public domain tools such as Clusta, Xcluster,
Rosetta, FASTLab, Weka etc., support DM algorithms which accept data sets in flat file form
or CSV form only. Thus, they use standard I/O functions such as fgetc(), fscanf(). However,
fread() is also in wide use with many DM algorithms [chen et al. ,2002 ; Islam,2003].
Moreover, earlier studies [Islam, 2003] indicated that kernel level I/O fine tuning was very
important in getting better throughput from system while running DM algorithms. In the
recent years, many network and other applications, which demand huge I/O overhead, are
reported to be using a special I/O feature known as mmap() to improve their performance.
For example the performance of Apache Server was addressed in [www.isi.edu]. In addition,
there would be the CPU time benefit by making use of memory mapping rather than
conventional I/O in Mach Operating Systems. [Carig and Leroux] have reported that,
effective utilization of multi-core technology will profoundly improve the performance and
scalability of networking equipment, video game platforms, and a host of other embedded
applications
2 Parallel processing
Parallel processing used to be reserved for supercomputers. By the use of internet, many
companies to have web and database capable of handling thousands of requests in a second.
These servers used a technology known as Symmetric Multiple Processing (SMP), which is
still the most common form of parallel processing. Requests for web pages, however, are
atomic and so, if you have a mainframe with 4 CPUs, you can run four copies of the web
server (one on each CPU) and dispatch incoming requests to whichever CPU is the least
busy. Now, parallel computing is becoming extremely common. Dual CPU systems (using
SMP) are much cheaper than they used to be putting them in the reach of many consumers.
Moreover, many single CPU machines now have parallel capabilities. Intel's hyper threading
CPUs are capable of running multiple threads simultaneously under certain conditions. Now
with Intel's and AMD's dual core processors, many people who buy a single CPU system
actually have the functionality of a dual CPU system.
2.1 Parallelization by Compiler
The first survey of parallel algorithms for hierarchical clustering using distance based metrics
is given in [Olson, 1995]. A parallelizing compiler generally works in two different ways:
Fully automatic and programmer directed. The fully automatic parallelization has several
important caveats: Wrong results may be produced; Performance may actually degrade, much
less flexible than manual parallelization [parallel computing].
Performance Evolution of Memory Mapped Files on Dual Core Processors Using Large Data Mining Data 103
Copyright ICWS-2009
3 Traditional File I/O
The traditional way of accessing files is to first open them with the open system call and then
use read, write and lseek calls to do sequential or random access I/O. The detailed
experimental results of traditional file I/O on DM algorithms proved that fread() gives better
performance than fgetc() on single core machines, for this refer fig number 1 (Annexure) in
[Tirumala Rao et al., 2008].
3.1 Memory Mapping
A memory mapping of a file is a special file access technique that is widely supported in
popular operating systems such as Unix and Windows and also reported that the mapping of a
large file into the memory (address space) can significantly enhance the I/O system
performance. The detailed experimental results of traditional file I/O and memory mapping
(mmap()) on DM algorithms proved that mmap() gives better performance than fread() on
single core machines refer figure number 2 (Annexure ) in [Tirumala Rao et al., 2008].
4 OpenMP
OpenMP is an API provides a portable, scalable model for developers of shared memory
parallel applications. The API supports C/C++ and FORTRAN on multiple architectures,
including UNIX and Windows. Writing a shared memory parallel program required the use
of vendor-specific constructs which raised a lot of portability issues and this problem was
solved by OpenMP [www.OpenMP.org]. The OpenMP API consists of set of compiler
directives for expressing parallelism, work sharing, data environment and synchronization.
These directives are added to an existing serial program in such away that they can be safely
discarded by compilers which dont understand the API. So that OpenMP extends but there
will not be any change in the base program. It supports incremental parallelism, unified code
for both serial and parallel applications. It also supports both coarse-grained and fine-grained
parallelism.
4.1 OpenMp Vs POSIX
Explicit threading methods, such as Windows threads or POSIX threads use library calls to
create, manage, and synchronize threads. Use of explicit threads requires an almost complete
restructuring of affected code. On the other hand, OpenMP is a set of pragmas, API
functions, and environment variables enable to incorporate threads into the applications at a
relatively high level. The OpenMP pragmas are used to denote regions in the code that can be
run concurrently. An OpenMP-compliant compiler transforms the code and inserts the proper
function calls to execute these regions in parallel. In most cases, the serial logic of the
original code can be preserved and is easily recovered by ignoring the OpenMP pragmas at
compilation time.
4.2 OpenMp Vs MPI
In the past, OpenMP has been confined to Symmetric Multi-Processing (SMP) machines and
teamed with Message Passing Interface (MPI) technology to make use of multiple SMP
systems. Most parallel data clustering approaches target distributed memory multiprocessors
Copyright ICWS-2009
and their implementation is based on message passing. The message passing would require
significant programming effort.A new system, Cluster OpenMP, is an implementation of
OpenMP that can make use of multiple SMP machines without resorting to MPI. This
advance has the advantage of eliminating the need to write explicit messaging code, as well
as not mixing programming paradigms. The shared memory in Cluster OpenMP is
maintained across all machines through a distributed shared-memory subsystem. Cluster
OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables
to be made consistent only when absolutely necessary [OpenMp, 2006].
4.3 OpenMp Vs Traditional Parallel Programming
OpenMP is a set of extensions that make it easy for programmers to take full advantage of a
system. It has been possible to write parallel programs for a long time. Historically, this has
been done by forking the main program into multiple processes or multiple threads manually.
This strategy has two major drawbacks. First, spawning processes is extremely platform
dependent. Second, it creates a lot of overhead for both the CPU and the programmer as it
can be quite complicated keeping track of what is going on in all of the threads. OpenMP
takes most of the work out of it for you. Most importantly, OpenMP makes it much easier to
parallelize computationally intensive mathematic calculations
4.4 Our Contribution
Previously [Hadjidoukas, 2008] had reported that, OpenMP provides a means of transparent
management of the asymmetry and non-determinism in CURE (clustering using
representatives). This paper aims to develop an efficient parallelized clustering algorithms
such as k-means and max-min, that targets shared memory multi-core processors by
employing mmap() facility under popular operating systems such as Windows
xp
and Linux.
This work mainly focuses on the shared memory architecture under OpenMP environment
that supports multiple levels of parallelism. Thus, we are able to satisfy the need for nested
parallelism exploitation in order to achieve load balancing. Our experimental results
demonstrate significant performance gain in parallelized version of above said algorithms.
These experiments were carried out with both synthetic and real data.
5 Experimental Set-Up
In this study, a randomly generated data set and Pocker hand data set [cattral and Oppacher,
2007] are used with the selected algorithms such as k-means and max-min. Random data set
is generated to have 10 million records with the dimensionality of 1026. Pokers hand set data
has 1 million records with ten attributes (dimensions). It is in ASCII format with comma
separated values, which is converted to binary format before applying our algorithms. Thus,
experiments are carried out with the converted binary data.
The k-means and max-min algorithms are tested with file size of 2 GB. Computational time
requirements of these algorithms with fread(),parallelized algorithm with OpenMP and
mmap(), parallelizing above algorithms with OpenMP functions is observed with both the
data sets under various conditions. Intel Pentium dual core 2.80 GHz processor with 1 GB
RAM, 1 MB Cache memory is used in our study. Fedora 9 Linux (kernel 2.6.25-14, red hat
version 6.0.52) equipped with GNU C++(gcc version 4.3) and Windows
xp
with VC++ 2008,
Copyright ICWS-2009
environment is installed on a machine with dual booting option to study the performance of
the parallelization of above said DM algorithms with OpenMP and mmap() under the same
hard ware setup.
In order to check the performance of parallelized algorithms over sequential algorithms with
varying dimensionality, these experiments have been carried out. Figure 1 to 3 of appendix
demonstrate our observations. Here onwards algorithms implemented with fread(), OpenMP
are termed as FODM and algorithms implemented with mmap(),OpenMP are termed as
MODM. It could be observed that parallelized FODM and MODM consistently taking less
time than sequential algorithm. It could also observe that algorithms with MODM are more
scalable than FODM. Our observation also says that Parallelized selected DM algorithms
gives better performance under Linux environment over Windows
xp
environment.
0
5 0 0 0 0 0 0
1 0 0 0 0 0 0 0
1 5 0 0 0 0 0 0
2 0 0 0 0 0 0 0
2 5 0 0 0 0 0 0
1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0
D i m e n si o n s
C
l
o
c
k

T
i
c
k
s
S e r i a l -fr e a d () P a r a l l e l -fr e a d () N= 1 0 0 0 0 0 0 0
S e r i a l -m m a p () P a r a l l e l -m m a p () K = 2

Fig. 1: k-means algorithm with random data for 10 Million records and also clusters 2 under Linux.

0
2 0 0 0 0 0 0
4 0 0 0 0 0 0
6 0 0 0 0 0 0
8 0 0 0 0 0 0
1 0 9 8 7 6 5 4 3 2
D i m e n s i o n s
C
l
o
c
k

T
i
c
k
s
S e r i a l - f r e a d ( ) P a r a l l e l - fr e a d ( ) N = 1 0 0 0 0 0 0 K = 1 0

Fig. 2: k-means algorithm with Pokers data for 1 Million records and also clusters 10 under Linux.
0
5 000
10 000
15 000
20 000
25 000
30 000
35 000
40 000
45 000
50 000
55 000
60 000
65 000
70 000
75 000
80 000
85 000
90 000
2 3 4 5 6 7 8 9 10
Di m e n si o n s
C
l
o
c
k

T
i
c
k
s
S e ri a l -fr e a d () P a r a l l e l -fr e a d () N= 100 0000 0
S e ri a l -m m a p () P a r a l l e l -m m a p () K= 10

Fig. 3: k-means algorithm with random data for 10 Million records and also clusters 10 under WindowsXP .
Copyright ICWS-2009
To see the benefit of parallelized selected DM algorithms over sequential algorithms with
varying number of samples, the experiments are conducted experiments, which were
presented from Figure 4 and 5. It is observed the advantage of MODM or FODM over
sequential algorithms. It is also realized that parallelized MODM gives better performance
over parallelized FODM irrespective of their size.
0
2000000
4000000
6000000
8000000
10000000
12000000
14000000
1 2 3 4 5 6 7 8 9 10
Re cords Mi l l i on s
C
l
o
c
k

T
i
c
k
s
S erial -fread() P arall el-fread() D= 10
S erial -m m ap() P arall el-m m ap() K = 2

Fig. 4: k-means algorithm with random data for Dimensionality 10 and clusters 2 under Linux .
0
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
1 2 3 4 5 6 7 8 9 10
Records in Millions
C
l
o
c
k

T
i
c
k
s
Serial-fread() parallel-fread() D=10
Serial-mmap() Parallel-mmap() K=2

Fig. 5: k-means algorithm with random data for Dimensionality 10 and clusters 2 under Windows-xp.
0
10000000
20000000
30000000
40000000
50000000
2 3 4 5 6 7 8 9 10
Clusters
C
l
o
c
k

T
i
c
k
s
Serial-fread() Parallel-fread() D=10 and N=10000000

Fig. 6: k-means algorithm with random data for 10 Million records and Dimensions 10 under Linux
environment
Experiments are carried out to verify the performance of parallelized DM algorithms over
sequential DM algorithms with varying number of clusters. The observations from Figure 6
and 7 depict that the parallelized DM algorithms benefit over sequential DM. It is also
Copyright ICWS-2009
observed that of parallelized DM algorithm with mmap() is showing better benefit over
parallelized DM algorithm with fread() independent of varying number samples.
Figures 8 and 9 demonstrate that benefit of mmap() is more with dual core than with single
core. In all these experiments, N is termed as number of records, D is dimensionality of the
data set, K is number of clusters.
0
10000
20000
30000
40000
50000
60000
70000
80000
2 3 4 5 6 7 8 9 10
Cl uste rs
C
l
o
c
k

T
i
c
k
s
S e ri a l -fre a d() P a ra l l e l -fre a d() N= 10000000
se ri a l -m m a p() P a ra l l e l -m m a p() D= 10

Fig. 7: k-means algorithm with random data for 10 Million records and Dimensions 10 under Windows-xp
environment
0
10
20
30
40
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Records in millions
%
o
f

b
e
n
f
i
t

C
l
o
c
k

t
i
c
k
s
% of mmap() benfit on Single Core
% of mmap() benfit on dual core D=10 and K=2

Fig. 8: k-means algorithm pokers data for dimensionality 10 and clusters 2 under Linux environment
0
5
10
15
20
25
30
2 3 4 5 6 7 8 9 10
No of Cl uste rs
%

O
f

B
e
n
f
i
t

c
l
o
c
k

t
i
c
k
s
Benfit of mmap() On Single Core Benfit Of mmap() On Dual COre

Fig. 9: k-means algorithm pokers data for dimensionality 10 and clusters 2 under Linux environment
Copyright ICWS-2009
6 Conclusion
The parallelization of k-means and max-min algorithms with OpenMP is studied on selected
operating systems. Experiments show that Parallelized DM algorithms are more scalable than
sequential algorithms on personal computer also. It also shows that parallelized algorithms
with mmap() are more scalable than parallelized algorithms with fread() irrespective of
number of samples, number of dimensions and number of clusters. Our observations also
reveal that the computational benefit of mmap() over fread() based algorithms is independent
of number of dimensions, number of samples and number of clusters. The advantage of
mmap() on dual core is higher over single core.
References
[1] [Bueherg, 2006] Gregery Bueherg, Towards Data mining an enlights, Architectures, SIAM conference on
Data mining, April, 2006.
[2] [Carig and Leroux] Robert Carig, Paul N. Leroux Leveraging Multi-Core Processor for High-Performance
Embedded Systems, www.qnx.com.
[3] [cattral and Oppacher, 2007] Rabert cattral and Franz Oppacher Carleton University, Department of
Computer Science Intelligent systems research unit, Canada,
http://archive.ics.uci.edu/ml/datasets/Poker+Hand
[4] [chen et al., 2002] Yen-Yu chen, Dingquing Gasu, Torsten Suel, I/O Efficient Techniques for computing
page rank, Department of computer and information science, Polytechnique university, Brooklyn,
Technical Report -CIS-2002-03.
[5] [Gray and More ,2004] A.Gray and A.More, Data Structures for fast statistics, International conference
On Machine learning, Alberta Canada, July 2004.
[6] [Hadjidoukas ,2008] Panagiotis E. Hadjidoukas, and Laurent Amsaleg, Parallelization of a Hierarchical
Data Clustering Algorithm using OpenMP, Department of Computer Science, University of Ioannina,
Ioannina, Greece, pages 289-299,2008
[7] [Islam,2003] Tuba Islam, An unsupervised approach for Automatic Language Identification, Master
Thesis, Bogaziqi University, Istambul, Turkey, 2003.
[8] [Olson, 1995] C.F. Olson,Parallel Algorithms for Hierarchical Clustering. Parallel Computing, pages
313-1325, 1995.
[9] [OpenMp ,2006] Processors White Papers Extending OpenMP to Clusters, Intel, May 2006.
[10] [palmerini , 2001] Paolo palmerini, Design of efficient input/output intensive data mining application,
ERCIM NEWS, No: 44, Jan, 2001. intensive data mining application, ERCIM NEWS, No: 44, Jan, 2001.
[11] [parallel computing] Introduction to parallel computing, https://comuting
IInl.gov/tutorials/parallel_comp/
[12] [Stoffel et al. ,1999 ] Killen stoffel and abdelkades Belkoniene, Parallel k-means clustering for large data
sets, proceedings of Europas, 1999.
[13] [Tirumala Rao et al., 2008] S .N. Tirumala Rao, E. V. Prasad, N. B. Venkateswarlu and B. G. Reddy,
Significant performance evaluation of memory mapped files with clustering algorithms, IADIS
International conference on applied computing, Portugal pages .455-460, 2008.
[14] [Venkateswarlu et al.,1995] N. B. Venkateswarlu, M.B.Al-Daoud and S.A Raberts, Fast k-means
Clustering Algorithms, University of Leads School of Computer Studies Research Report Series Report
95.18
[15] [www.isi.edu] Optimized performance analysis of Apache-1.0.5 server, www.isi.edu.
[16] [www.OpenMP.org] OpenMP Architecture Review Board. OpenMP specifications. Available at
http://www.openmp.org.
Steganography Based Embedded System used for Bank
Locker System: A Security Approach

J.R. Surywanshi K.N. Hande
G.H. Raisoni College of Engineering, Nagpur G.H. Raisoni College of Engineering, Nagpur
Jaya_surywanshi2006@yahoo.co.in Kapilhande@gmail.com

Abstract

Steganography literally means covered message and involves transmitting
secret messages through seemingly innocuous files. The goal is that not only
does the message remain hidden, but also that a hidden message was even sent
goes undetected .In this project we applied this concept on hardware based
application. We developed an embedded system that is used to automatically
On and Off the bank locker. There is no roll of Key for opening and closing
the locker. In respective of key, the security will provided through
steganography. The secret code will provided to the bank locker through a
simple mobile device. This is totally a wireless based application. We used
bluetooth device for sending and receiving the signals. This is new
achievement that operating a locker through a single mobile signals. The bank
locker owner should use his locker without time consuming process .and to
avoid any kind of misuse of locker.
1 Introduction
Every one knows the general process of bank locking system. This is totally manual process.
If we develop this process in modern way by taking the help of steganography, then it
develops a new system that provides a tight security. In steganographic communication
senders and receivers agree on a steganographic system and a shared secret key that
determines how a message is encoded in the covered medium. To send a hidden message, for
example, Alice creates a new image with a digital camera. Alice supplies the steganographic
key with her shared secret key and her message. The steganographic system uses the shared
secret key to determine how the hidden message should be encoded in the redundant bits. The
result is a stego image that Alice sends to Bob. When Bob receives the image, he uses the
shared secret and the agreed upon steganographic system to retrieve the hidden message.
Figure [1] shows an overview of the encoding step;
As mentioned above the role of Alice and Bob is to perform by the Locker owner and Bank
higher authority person who have rights of providing the security policy to their bank owner
people. We have used here a simple mobile device that is used for providing stegno message
for locking and unlocking the bank locker irrespective of key.
Locker automatically turns ON and OFF without use of actual key. It is an automatically
operating system. Micro controller is used for providing the signals to the DC motor. DC
motor changes its position that is required for opening and closing the lock. Blue tooth device
does a very good job for this whole process. Bluetooth receives and sends the signals from
110 Steganography Based Embedded System used for Bank Locker System: A Security Approach
Copyright ICWS-2009
the PC to mobile device and vice versa. Software is developed such that if anyone knows the
image and trying to guess the secret code at that time then only three possible chances are
allowed. More than these chances will automatically deactivate the locker system.

Fig. 1: Modern steganographic communication. The encoding step of a steganographic system identifies
redundant bits and then replaces a subset of them with data from a secret message
Provision is provided if someone steals the mobile and tries to access the locker system.
Entry related to accessing the locker system will automatically update bank server data. So
the person wont be able to lie. Working process of this project is very simple. At the
designing time the bank authority person provides a simple image to the locker owner. The
locker owner tells his secret code to the bank authority person. The secrete code may be the
signature or any identity of that person. This secret code wills Stegno into the given image.
This image is stored into the server of the bank as well as it transferred to the bank locker
owners personal mobile. During the stegno process, the pairing address of the mobile is also
inserted. This provides the security at device level.
In short there are three kinds of securities will provided
Image based
Personal secret Code
Device based (Pairing address of mobile)
It means at the decoding time the process will check the same image, same secret code plus
same mobile device. This is complete process of designing level. This process has to be
performed at the time when the new owner wants to open his new locker. After this creation
when the owner of locker wants to operate his locker he can use it without needs of the bank
authority people. For example when he/she wants to open the locker he/she will go to the
bank. He/she will just give the signals through his/her mobile device to the locker. This is
totally wireless base application. Bluetooth performs this activity. The signals first checked
Steganography Based Embedded System used for Bank Locker System: A Security Approach 111
Copyright ICWS-2009
out by the computer. The PC will check all three-security level. If all the information is
correct then it gives the signals to the micro controller. The micro controller gives signals to
the DC motor. This DC motor will lock and unlock the system. After the completion of the
works the locker user again transfers the signals to the PC. Bank server transfers the
unlocking signals to the micro controller. Figure [2] shows the signal transferred by user for
operating his locker.
2 System Overview
This project will develop in two phases.
Software phase
Hardware phase
2.1 Software Development Phase
First we develop the software phase. For this we consider a single image. This image will use
any JPEG image. This image is openly available to all the people who will use this mobile.
But nobody have the secret code. So no harm if the mobile is kept anywhere.
Different algorithms are available for steganography
Discrete Cosine Transform
Sequential
Pseudo random
Subtraction
Statistic aware embedding
We use here Discrete Cosine Transform

Fig. 2. The complete process where the computer stegno the secrete code within the image and transfer to the
mobile device.
Copyright ICWS-2009
2.1.1 DCT Based Information Hiding Process
Transform coding is simply the compression of images in the frequency domain. It
constitutes an integral component of contemporary image processing applications. Transform
coding relies on the premise that pixels in an image exhibit a certain level of correlation with
their neighboring pixels. A transformation is, therefore, defined to map the spatial
(correlated) data into transformed (uncorrelated) coefficients. Clearly, the transformation
should utilize the fact that the information content of an individual pixel is relatively small
i.e., to a large extent visual contribution of a pixel can be predicted using its neighbors. The
Discrete Cosine Transform (DCT) is an example of transform coding. JPEG is an image
compression standard, which was proposed by the Joint Photographic Experts Group. JPEG
transforms the information of color domain into frequency domain by applying Discrete
Cosine Transform (DCT). The image is divided into blocks with 8X8 pixels, which is
transformed into frequency domain. Each block of an image is represented by 64
components, which are called DCT coefficients. The global and important information of an
image block is represented in lower DCT coefficients, while the detailed information is
represented in upper coefficients. The compression of an image is achieved by omitting the
upper coefficients. The following equation is used for quantization. The result is rounded to
the nearest integer.
(1)
Where
ci is the original transform coefficient (real number)
q is the quantization factor (integer between 1..255)
The reverse process, dequantization of quantized coefficients is completed with the following
formula:
(2)
Algorithm used for this total approach is given in here. For each color component, the JPEG
image format uses a discrete cosine transform (DCT) to transform successive 8X8 pixel
blocks of the image into 64 DCT coefficients each. The DCT coefficient F(u,v) of an 8X8
block of image pixels f(x,y) are given by :
(3)
Where C(x) = 1/2when x equal 0 and C(x) = 1 otherwise. Afterwards, the following
operation quantizes the coefficients:
(4)
Copyright ICWS-2009
where Q(u,v) is a 64-element quantization table. We can use the least-significant bits of the
quantized DCT coefficients as redundant bits in which to embed the hidden message. The
modification of a single DCT coefficient affects all 64 image pixels. In some image formats
(such as GIF), an images visual structure exists to some degree in all the images bit layers.
Steganographic systems that modify least significant bits of this image format are often
susceptible to visual attacks. This is not true for JPEGs. The modifications are in the
frequency domain instead of the spatial domain, so there are no visual attacks against the
JPEG format.
Input: message, cover image
Output: stego image
while data left to embed do
get next DCT coefficient from cover image
if DCT0 and DCT 1 then
get next LSB from message
replace DCT LSB with message LSB
end if
insert DCT into stego image
end while
Fig. 3. The JSteg algorithm. As it runs, the algorithm sequentially replaces the least-significant bit of discrete
cosine transform (DCT) coefficients with message data. It does not require a shared secret.
Figure [4] shows two images with a resolution of 640X480 in 24-bit color. The
uncompressed original image is almost 1.2 Mbytes (the two JPEG images shown are about
0.3 Mbytes). Figure [4a] is unmodified; Figure 4b contains the first chapter of Lewis
Carrolls The Hunting of the Snark. After compression, the chapter is about 15 Kbytes. The
human eye cannot detect which image holds steganographic content.
Embedding Process
Compute the DCT coefficients for each 8x8 block
Quantize the DCT coefficients by standard JPEG quantization table
Modify the coefficients according to the bit to hide
If bit=1, all coefficients are modified to odd numbers
If bit=0, all coefficients are modified to even numbers
All coefficients quantized to 0 are remain intact
Inverse quantization
Inverse DCT
Extracting process:
Compute the DCT coefficients for each 8x8 block
Quantize the DCT coefficients by standard JPEG quantization table
Count the numbers of coefficients quantized to odd and even
If odd coefficients are more, then bit=1
If even coefficients are more, then bit=0
Copyright ICWS-2009

Fig. 4: Embedded information in a JPEG. (a) The unmodified original picture; (b) the picture with the first
chapter of The Hunting of the Snanke embedded in it.
2.2 Hardware Development
In hardware phase we designed a micro controller kit. This contained a micro controller and a
Dc motor interface. This motor rotates clockwise and anticlockwise direction for opening or
closing the bank locker. We used here PIC micro controller. The motor will have the
functionality to.
2.2.1 Block Descriptions
PC: This block is the only point where the system accepts user input. It will be a Windows-
based software application to be run on any Windows PC or laptop. Here the user will be able
to manipulate the various functions of the motor; run, stop, accelerate and decelerate, using
easy to learn onscreen controls.
USB Bluetooth Adapter: To bridge the connection between the PIC and the PC, there will
be Bluetooth modules connected to both sides. The PC side will be implemented using this
common adapter which lets a Bluetooth connection be made as a serial link. The USB adapter
can be installed easily in Windows, just as any other USB device would. The signals will then
be received and manipulated from the motor control software.
Bluetooth Adapter: This is the other end of the Bluetooth wireless connection, it is the
module that will receive the wireless signals from the USB transmitter and send them to the
control unit
Control Unit: The control unit consists of the PIC microcontroller and the pulse width
modulator. The PIC will be used to receive feedback from the motor to determine speed and
adjust the signal accordingly. The PWM will be used to control the duty cycle of the motor
by regulating the power output.
Step-down DC to DC Converter: The step-down converter is used to supply lower voltages
then available from the voltage source. In this case, we are using a 12 V battery as our power
supply, so the step-down converter will be available to output voltages from 0 to 12 V,
depending on the duty ratio that is designated by the control unit. This will change the output
voltage to the motor accordingly, thus varying the speed the motor is running at. This module
will be designed using several resistors, an inductor, a capacitor and a diode and MOSFET
transistor used as switches.
Copyright ICWS-2009

Fig. 5: Complete process of transferring the signals from PC to Dc motor start, stop, through the commands of
the computer. There is Wireless control via Bluetooth is provided
12 V Battery: This is the power supply for the circuit. It will be used to power the PIC, as
well as supply power to the step-down converter which will be varied from 0-12 V depending
on the voltage needed for the requested speed. A 12 V lead acid battery will be used.
H-Bridge: The purpose of this unit is to allow the motor to come to a complete stop if
requested, as well as change the direction the motor is running in. This receives its commands
from the control unit.
DC Motor: This is a 12 V permanent magnet DC motor. It will be powered by the 12V
battery through the step-down dc to dc converter and controlled via the control unit and H-
Bridge.
3 Conclusion
This steganography based embedded system allows users who wish to operate the bank
locker without any communication with bank authority person and without any time
consuming process. This is a new implementation that this stenography technology will used
in any hardware based application. This is successfully worked out in this project .The future
could be like it is used in general purposed application like home security application Hence
through this we tried to develop steganography based embedded system that provided the
tight security for hardware base application system.
Copyright ICWS-2009
References
[1] A. Westfeld and A. Pfitzmann, Attacks on Steganographic Systems, Proc. Information Hiding3rd Intl
Workshop, Springer Verlag, 1999, pages. 6176.
[2] B. Chen and G.W. Wornell, Quantization Index Modulation: A Class of Provably Good Methods for
Digital Watermarking and Information Embedding, IEEE Trans. Information Theory, vol. 47, no. 4, 2001,
pages. 14231443
[3] F.A.P. Petitcolas, R.J. Anderson, and M.G. Kuhn, Information HidingA Survey, Proc. IEEE, vol. 87,
no. 7,1999, pages. 10621078.
[4] Farid, Detecting Hidden Messages Using Higher- Order Statistical Models, Proc. Intl Conf. Image
Processing, IEEE Press, 2002.
[5] J. Fridrich and M. Goljan, Practical SteganalysisState of the Art, Proc. SPIE Photonics Imaging 2002,
Security and Watermarking of Multimedia Contents, vol. 4675, SPIE Press, 2002, pages. 113.
[6] N.F. Johnson and S. Jajodia, Exploring Steganography: Seeing the Unseen, Computer, vol. 31, no. 2,
1998, pages. 2634.
[7] N.F. Johnson and S. Jajodia, Steganalysis of Images Created Using Current Steganographic Software,
Proc. 2
nd
Intl Workshop in Information Hiding, Springer-Verlag, 1998, pages. 273289.
[8] R.J. Anderson and F.A.P. Petitcolas, On the Limits of Steganography, J. Selected Areas in Comm., vol.
16, no. 4, 1998, pages. 474481.
[9] Wireless Bluetooth Controller for DC Motor ECE 445Project Proposal February 5, 2007.
Audio Data Mining Using Multi-Perceptron
Artificial Neural Network

A.R. Ebhendra Pagoti Mohammed Abdul Khaliq
DIT, GITAM University DIT, GITAM University
Rushikonda, Visakhapatnam-530046 Rushikonda, Visakhapatnam-530046
Andhra Pradesh, India Andhra Pradesh, India
Praveen Dasari
DIT, GITAM University, Rushikonda, Visakhapatnam-530046, Andhra Pradesh, India

Abstract

Data mining is the activity of analyzing a given set of data. It is the process of
finding patterns from large relational databases. Data mining includes: extract,
transform, and load transaction data onto the data warehouse system, store and
manage the data in a multidimensional database system, provides data, analyze
the data by application software and visual presentation. Audio data contains
information of each audio file such as signal processing component- power
spectrum, cepstral values that is representative of particular audio file. The
relationship among patterns provides information. It can be converted into
knowledge about historical patterns and future trends. This work involves in
implementing an artificial neural network (ANN) approach for audio data
mining. Acquired audio is preprocessed to remove noise followed by feature
extraction using cepstral method. The ANN is trained with the cepstral values
to produce a set of final weights. During testing process (audio mining), these
weights are used to mine the audio file. In this work, 50 audio files have been
used as an initial attempt to train the ANN. The ANN is able to produce only
about 90% accuracy of mining due to less correlation of audio data.
Keywords: ANN, Backpropagation Algorithm, Cepstrum, Feature Extraction, FFT, LPC,
Perceptron, Testing, Training, Weights.
1 Introduction
Data mining is concerned with discovering patterns meaninagfully from data. Data mining
has deep roots in the fields of statistics, artificial intelligence, and machine learning. With the
advent of inexpensive storage space and faster processing over the past decade, the research
has started to penetrate new grounds in areas of speech and audio processing as well as
spoken language dialog. It has gained interest due to audio data that are available in plenty.
Algorithmic advances in automatic speech recognition have also been a major, enabling
technology behind the growth in data mining. Currently, large vocabulary, continuous speech
recognizers are now trained on a record amount of data such as several hundreds of millions
of words and thousands of hours of speech. Pioneering research in robust speech processing,
large-scale discriminative training, inite state automata, and statistical hidden Markov
modeling have resulted in real-time recognizers that are able to transcribe spontaneous
118 Audio Data Mining Using Multi-Perceptron Artificial Neural Network
Copyright ICWS-2009
speech. The technology is now highly attractive for a variety of speech mining applications.
Audio mining research includes many ways of applying machine learning, speech processing,
and language processing algorithms [1]. It helps in the areas of prediction, search,
explanation, learning, and language understanding. These basic challenges are becoming
increasingly important in revolutionizing business processes by providing essential sales and
marketing information about services, customers, and product offerings. A new class of
learning systems can be created that can infer knowledge and trends automatically from data,
analyze and report application performance, and adapt and improve over time with minimal
or zero human involvement. Effective techniques for mining speech, audio, and dialog data
can impact numerous business and government applications. The technology for monitoring
conversational audio to discover patterns, capture useful trends, and generate alarms is
essential for intelligence and law enforcement organizations as well as for enhancing call
center operation. It is useful for a digital object identifier analyzing, monitoring, and tracking
customer preferences and interactions to better establish customized sales and technical
support strategies. It is also an essential tool in media content management for searching
through large volumes of audio warehouses to find information, documents, and news.
2 Technical Work Preparation
2.1 Problem Statement
Audio files are to be mined properly with high accuracy given partial audio information. This
an be very much achieved using ANN. This work involves in implementing supervised
backpropagation algorithm (BPA). The BPA is trained with the features of audio data for
different number of nodes in the hidden layer. The layer with optimal number of nodes has to
be chosen for proper audio mining.
2.2 Overview of Audio Mining
Audio recognition is a classic example of things that the human brain does well, but digital
computers do poorly. Digital computers can store and recall vast amounts of data perform
mathematical calculation at blazing speeds and do repetitive tasks without becoming bored or
inefficient. Computer performs very poorly when faced with raw sensory data. Teaching the
same computer to understand audio is a major undertaking. Digital signal processing
generally approaches the problem of audio recognition into two steps, 1) Feature extraction,
2) Feature matching Each word in the incoming audio signal is isolated and then analyzed to
identify the type of excitation and resonate frequency [2]. These parameters are then
compared with previous example of spoken words to identify the closest match. Often, these
systems are limited to few hundred words; can only accept signals with distinct pauses
between words; and must be retrained. While this is adequate for many commercial
applications, these limitations are humbling when compared to the abilities of human hearing.
There are two main approaches to audio mining. 1. Text-based indexing: Text-based
indexing, also known as large-vocabulary continuous speech recognition, converts speech to
text and then identifies words in a dictionary that can contain several hundred thousand
entries. 2. Phoneme-based indexing: Phoneme based indexing doesnt convert speech to text
but instead works only with sounds. The system first analyzes and identifies sounds in a piece
of audio content to create a phonetic-based index. It then uses a dictionary of several dozen
Audio Data Mining Using Multi-Perceptron Artificial Neural Network 119
Copyright ICWS-2009
phonemes to convert a users search term to the correct phoneme string. (Phonemes are the
smallest unit of speech in a language, such as the long a sound that distinguishes one
utterance from another. All words are sets of phonemes). Finally, the system looks for the
search terms in the index. A phonetic system requires a more proprietary search tool because
it must phoneticize the query term, and then try to match it with the existing phonetic string
output. Although audio mining developers have overcome numerous challenges, several
important hurdles remain. Precision is improving but it is still a key issue impeding the
technologys widespread adoption, particularly in such accuracy-critical applications as court
reporting and medical dictation. Audio mining error rates vary widely depending on factors
such as background noise and cross talk. Processing conversational speech can be particularly
difficult because of such factors as overlapping words and background noise [3][4].
Breakthroughs in natural language understanding will eventually lead to big improvements.
The problem of audio mining is an area with many different applications. Audio
identification techniques include Channel vocoder, linear prediction, Formant vocoding,
Cepstral analysis. There are many current and future applications for audio mining. Examples
include telephone speech recognition systems, or voice dialers on car phones.
2.3 Schematic Diagram
The sequence of Audio mining can be schematically shown as below.

Fig.1: Sequence of audio processing
2.4 Artificial Neural Network
A neural network is constructed by highly interconnected processing units (nodes or neurons)
which perform simple mathematical operations [5]. Neural networks are characterized by
their topologies, weight vectors and activation function which are used in the hidden layers
and output layer [6]. The topology refers to the number of hidden layers and connection
between nodes in the hidden layers. The activation functions that can be used are sigmoid,
hyperbolic tangent and sine [7]. A very good account of neural networks can be found in
[11]. The network models can be static or dynamic [8]. Static networks include single layer
Copyright ICWS-2009
perceptrons and multilayer perceptrons. A perceptron or adaptive linear element (ADALINE)
[9] refers to a computing unit. This forms the basic building block for neural networks. The
input to a perceptron is the summation of input pattern vectors by weight vectors. In Figure 2,
the basic function of a single layer perceptron is shown.

Fig. 2: Operation of a neuron
In Figure 3, a multilayer perceptron is shown schematically. Information flows in a feed-
forward manner from input layer to the output layer through hidden layers. The number of
nodes in the input layer and output layer is fixed. It depends upon the number of input
variables and the number of output variables in a pattern. In this work, there are six input
variables and one output variable. The number of nodes in a hidden layer and the number of
hidden layers are variable. Depending upon the type of application, the network parameters
such as the number of nodes in the hidden layers and the number of hidden layers are found
by trial and error method.

Fig. 3: Multilayer Perceptron
In most of the applications one hidden layer is sufficient. The activation function which is
used to train the ANN, is the sigmoid function and it is given by:
f(x)=1(1+exp(-x)) , wheref (x) is a non - linear differentiable function, (1)
x=Wij(p)xi(p)+(p), where Nn is the total number of nodes in the nth layer.
Wij is the weight vector connecting ith neuron of a layer with the jth neuron in the next layer.
q is the threshold applied to the nodes in the hidden layers and output layer and p is the
Copyright ICWS-2009
pattern number. In the first hidden layer, xi is treated as an input pattern vector and for the
successive layers, xi is the output of the ith neuron of the proceeding layer. The output xi of a
neuron in the hidden layers and in the output layer is calculated by:
Xi^(n+1)(p)=1/(1+exp(-x+(p)) (2)
For each pattern, error E (p) in the output layers is calculated by:
E(p)=1/2(i=1toNm)(di(p)-xi^M(p))^2 (3)
Where M is the total number of layer which include the input layer and the output layer, NM
is the number of nodes in the output layer. di(p) is the desired output of a pattern and Xi M(p)
is the calculated output of the network for the same pattern at the output layer. The total error
E for all patterns is calculated by:
E=(p=1toL)E(p) , where, L is the total number of patterns. (4)
2.5 Implementation
The flowchart, fig. 4 explains the sequence of implementation of audio mining. Fifty audio
files were chosen. The feature extraction procedure is applied Preemphasizing and
windowing. Audio is intrinsically a highly non-stationary signal. Signal analysis, FFT-based
or Linear Predictor Coefficients (LPC) based, must be carried out on short segments across
which the audio signal is assumed to be stationary. The feature extraction is performed on 20
to 30 ms windows with 10 to 15 ms shift between two consecutive windows. To avoid
problems due to the truncation of the signal, a weighting window with the appropriate
spectral properties must be applied to the analyzed chunk of signal. Some windows are
Hamming, Hanning and Blackman Normalization Feature normalization can be used to
reduce the mismatch between signals recorded in different conditions. Normalization consists
in mean removal and eventually variance normalization. Cepstral mean subtraction (CMS) is
a good compensation technique for convolutive distortions. Variance normalization consists
in normalizing the feature variance to one and in signal recognition to deal with noises and
channel mismatch. Normalization can be global or local. In the first case, the mean and
standard deviation are computed globally while in the second case, they are computed on a
window centered on the current time. Feature extraction method LPC starts with the
assumption that an audio signal is produced by a buzzer at the end of a tube, with occasional
added hissing and popping sounds. Although apparently crude, this model is actually a close
approximation to the reality of signal production. LPC analyzes the signal by estimating the
formants, removing their effects from the signal, and estimating the intensity and frequency
of the remaining buzz. The process of removing the formants is called inverse filtering, and
the remaining signal after the subtraction of the filtered modeled signal is called the residue.
The numbers, which describe the intensity and frequency of the buzz, the formants, and the
residue signal, can be stored or transmitted somewhere else. LPC synthesizes the signal by
reversing the process: use the buzz parameters and the residue to create a source signal, use
the formants to create a filter, and run the source through the filter, resulting in audio. Steps:
1. Audio files in mono or stereo recorded in natural or inside lab, or taken from a standard
data base 2. Extracting features by removing noise provided it is a fresh audio, otherwise for
existing audio, noise removal is not required 3. Two phases have to be adopted: Training
phase and testing phase 4. Training Phase: In this phase, a set of representative numbers are
to be obtained from an initial set of numbers. BPA is used for learning the audio files 5.
Copyright ICWS-2009
Testing phase: In this phase, the representative numbers obtained in step 4 has to be used
along with the features obtained from a test audio file to obtain, an activation value. This
value is compared with a threshold and final decision is taken to retrieve an audio file or offer
solution to take further action which can be activating a system in a mobile phone, etc.

2.6 Results and Discussion
Cepstrum analysis is a nonlinear signal processing technique with a variety of applications in
areas such as speech and image processing. The complex cepstrum for a sequence x is
calculated by finding the complex Natural logarithm of the Fourier transform of x, then the
inverse Fourier transform of the resulting sequence. The complex cepstrum transformation is
central to the theory and application of homomorphic systems, that is, systems that obey
certain general rules of superposition. The real cepstrum of a signal x, sometimes called
simply the cepstrum, is calculated by determining the natural logarithm of magnitude of the
Fourier transform of x,then obtaining the inverse Fourier transform of the resulting sequence.
It is difficult to reconstruct the original sequence from its real cepstrum transformation, as the
Copyright ICWS-2009
real cepstrum is based only on the magnitude of the Fourier transform for the sequence. Table
1, gives the cepstral coefficients for 25 sample audio files. Each row is a pattern used for
training the ANN with BPA. The topology of the ANN used is 6 6 1. In this, 6 nodes in
the input layer, 6 nodes in the hidden layer and 1 node in the output layer is used for proper
training of ANN followed by audio mining.
Table 1 Cepstral Features Obtained from Sample Audio Files

F1 F6 are cepstral values. We can choose more than 6 values for an audio file. Target
labeling should be less than 1 and greater than zero. When the number of audio file increases,
then more decimal values have to be incorporated.
3 Conclusion
Audio of common birds and pet animals have been recorded casually. The audio file is
suitably preprocessed followed by cepstral analysis and training ANN using BPA. A set of
final weights with 6 6 1 configuration is obtained with 7350 iterations to reach .0125
mean squared error rate. Fifty patterns have been used for training the ANN. Thirty patterns
were used for testing (audio mining). The results are close to 90% of mining as the audio was
recorded in open. The percentage of recognition and audio mining accuracy has to be tested
with large number of new audio files from the same set of birds and pet animals
References
[1] Lie Lu and Hong-Jiang Zhang, Content analysis for audio classification and segmentation., IEEE
Transactions on Speech and Audio Processing, 10:504516, October 2002.
[2] T. Tolonen and M. Karjalainen, A computationally efficient multipitch analysis model, IEEE
Transactions on Speech and Audio Processing, Vol. 8(No. 6):708716, November 2000.
Copyright ICWS-2009
[3] Haleh Vafaie and Kenneth De Jong, Feature space transformation using genetic algorithms, IEEE
Intelligent Systems, 13(2):5765, March/April 1998.
[4] Usama M. Fayyad, Data Mining and Knowledge Discovery: Making Sense Out of Data, IEEE Expert,
October 1996, pp. 20-25.
[5] Fortuna L, Graziani S, LoPresti M and Muscato G (1992), Improving back-propagation learning using
auxiliary neural networks, Int. J of Cont. , 55(4), pp. 793-807.
[6] Lippmann R P (1987) An introduction to computing with neural nets, IEEE Trans. On Acoustics, Speech
and Signal Processing Magazine, V35, N4, pp.4.-22
[7] Yao Y L and Fang X D (1993), Assessment of chip forming patterns with tool wear progression in
machining via neural networks, Int.J. Mach. Tools & Mfg, 33 (1), pp 89 -102.
[8] Hush D R and Horne B G (1993), Progress in supervised neural networks, IEEE Signal Proc. Mag., pp 8-
38.
A Practical Approach for Mining Data
Regions from Web Pages

K. Sudheer Reddy G.P.S. Varma P. Ashok Reddy
Infosys Technologies Ltd., Hyd. S.R.K.R.Engg College L.B.R.College of Engineering
sudheerreddy_k@infosys.com gpsvarma@yahoo.com ashokreddimca@gmail.com

Abstract

In recent years government agencies and industrial enterprises are using the
web as the medium of publication. Hence, a large collection of documents,
images, text files and other forms of data in structured, semi structured and
unstructured forms are available on the web. It has become increasingly
difficult to identify relevant pieces of information since the pages are often
cluttered with irrelevant content like advertisements, copyright notices, etc
surrounding the main content. This paper deals with the techniques that help
us mine such data regions in order to extract information from them to provide
value-added services. In this paper we propose an effective automatic
technique to perform the task. This technique is based on three important
observations about data regions on the web.
1 Introduction
Web information extraction is an important problem for information integration, because
multiple web pages may present the same or similar information using completely different
formats or syntaxes, which makes integration of information a challenging task. Due to the
heterogeneity and lack of structure of web data, automated discovery of targeted information
becomes a complex task. A typical web page consists of many blocks or areas, e.g., main
content areas, navigation areas, advertisements, etc. For a particular application, only part of
the information is useful, and the rest are noises. Hence it is useful to separate these areas
automatically for several practical applications. Pages in data-intensive web sites are usually
automatically generated from the back-end DBMS using scripts. Hence, the structured data
on the web are often very important since they represent their host pages essential
information, e.g., details about the list of products and services.
In order to extract and make use of information from multiple sites to provide value added
services, one needs to semantically integrate information from multiple sources. There are
several approaches for structured data extraction, which is also called wrapper generation.
The first approach is to manually write an extraction program for each web site based on
observed format patterns of the site. This manual approach is very labor intensive and time
consuming. It thus does not scale to a large number of sites. The second approach is wrapper
induction or wrapper learning, which is currently the main technique. Wrapper learning
works as follows: The user first manually labels a set of trained pages. A learning system
then generates rules from the training pages. The resulting rules are then applied to extract
target items from web pages. These methods either require prior syntactic knowledge or
substantial manual efforts. Example wrapper induction systems include WEIN.
126 A Practical Approach for Mining Data Regions from Web Pages
Copyright ICWS-2009
The third approach is the automatic approach. Since structured data objects on the web are
normally database records retrieved from underlying web databases and displayed in web
pages with some fixed templates. Automatic methods aim to find patterns/grammars from the
web pages and then use them to extract data. Examples of automatic systems are IEPAD,
ROADRUNNER.
Another problem with the existing automatic approaches is their assumption that the relevant
information of a data record is contained in a contiguous segment of HTML code, which is
not always true. MDR (Mining Data Records) basically exploits the regularities in the HTML
tag structure directly. It is often very difficult to derive accurate wrappers entirely based on
HTML tags. MDR algorithm makes use of the HTML tag tree of the web page to extract data
records from the page. However, an incorrect tag tree may be constructed due to the misuse
of HTML tags, which in turn makes it impossible to extract data records correctly. MDR has
several other limitations which will be discussed in the latter half of this paper. We propose a
novel and more effective method to mine the data region in a web page automatically. The
algorithm is called VSAP (Visual Structure based Analysis of web Pages). It finds the data
regions formed by all types of tags using visual cues.
2 Related Work
Extracting the regularly structured data records from web pages is an important problem. So
far, several attempts have been made to deal with the problem. Related work, mainly in the
area of mining data records in a web page automatically, is MDR (Mining Data Records).
MDR automatically mines all data records formed by table and form related tags i.e.,
<TABLE>, <FORM>, <TR>, <TD>, etc. assuming that a large majority of web data records
are formed by them.
The algorithm is based on two observations:
(a) A group of data records are always presented in a contiguous region of the web page and
are formatted using similar HTML tags. Such region is called a Data Region. (b) The nested
structure of the HTML tags in a web page usually forms a tag tree and a set of similar data
records are formed by some child sub-trees of the same parent node.
The algorithm works in three steps:
Step 1 Building the HTML tag tree by following the nested blocks of the HTML tags in the
web page.
Step 2 Identifying the data regions by finding the existence of multiple similar generalized
nodes of a tag node. A generalized node (or a node combination) is a collection of child
nodes of tag node, with the following two properties:
(i) All the nodes have the same parent.
(ii) The nodes are adjacent.
Then each generalized node is checked to decide if it contains multiple records or only one
record. This is done by string comparison of all possible combinations of component nodes
using Normalized Edit Distance method. A data region is a collection of two or more
generalized node with the following properties:
A Practical Approach for Mining Data Regions from Web Pages 127
Copyright ICWS-2009
(i) The generalized nodes all have the same parent.
(ii) The generalized nodes all have the same length.
(iii)The generalized nodes are all adjacent.
(iv) The normalized edit distance (string comparison) between adjacent generalized nodes
is less than a fixed threshold.
To find relevant data region, MDR makes use of the content mining.
Step 3 Identifying the data records involves finding the data records from each generalized
node in a data region. All the three steps of MDR have certain serious limitations which will
be discussed in the latter half of the paper
2.1. How to use MDR
In running MDR, we used their default settings. MDR system was downloaded at:
http://www.cs.uic.edu/~liub/WebDataExtraction/MDR-download.html
1. Click on "mdr.exe". You will get a small interface window.
2. You can type or paste a URL (including http ://) or a local path into the Combo Box;
the Combo Box contains a list of URLs which you have added. At the beginning it
may be empty.
3. If you are interested in extracting tables (or with rows and columns of data), Click on
"Extract" in the Table section.
4. If you are interested in extracting other types of data records, click on "Extract" in the
"Data Records (other types)" section. We separate the two functions for efficiency
reasons.
5. After the execution, the output file will be displayed in an IE window. The extracted
tables or data regions and data records are there.
Options: Only show the data regions with "$" sign: When dealing with E-Commerce
websites, most data records of interest are merchandise. If this option is checked, MDR only
outputs that data regions in which the data records are merchandise. (Here we assume every
merchandise has a price with "$" sign) In this way, some data regions that also contain
regular pattern data records will not be displayed.
3 The Proposed Technique
We propose a novel and more effective method to mine the data region in a web page
automatically. The algorithm is called VSAP (Visual Structure Based Analysis of
WebPages).The visual information (i.e., the locations on the screen at which tags are
rendered) helps the system in three ways:
a) It enables the system to identify gaps that separate records, which helps to segment
data records correctly, because the gaps within the data record(if any) is typically
smaller than that in between data records.
b) The visual and display information also contains information about the hierarchical
structure of the tags.
Copyright ICWS-2009
c) By the visual structure analysis of the WebPages, it can be analyzed that the relevant
data region seems to occupy the major central portion of the Webpage.
The system model of the VSAP technique is shown in fig 1.
It consists of the following components.
Parsing and Rendering Engine
Largest Rectangle Identifier
Container Identifier
Data Region Identifier
The output of each component is the input of the next component.

Fig. 1: System Model
The VSAP technique is based on three observations:
a) A group of data records, that contains descriptions of set of similar objects, is
typically presented in a contiguous region of a page.
b) The area covered by a rectangle that bounds the data region is more than the area
covered by the rectangles bounding other regions, e.g. Advertisements and links.
c) The height of an irrelevant data record within a collection of data records is less than
the average height of relevant data records within that region.
Definition 1: A data region is defined as the most relevant portion of a webpage.
E.g. A region on a product related web site that contains a list of products forms the data
region.
Definition 2: A data record is defined as a collection of data a meaningful independent
entity. E.g. A product listed inside a data region on a product related web site is a data record.
Fig 2 illustrates an example which is a segment of a webpage that shows a data region
containing list of four books. The full description of each book is a data record.
Copyright ICWS-2009

Fig. 2: An Example of a data region containing 4 data records
The overall algorithm of the proposed technique is as follows:
Algorithm VSAP (HTML document)
a) Set maxRect=NULL
b) Set dataRegion=NULL
c) FindMaxRect (BODY);
d) FindDataRegion (maxRect);
e) FilterDataRegion (dataRegion);
End
The lines 1 and 2 specify initializations. The line 3 finds the largest rectangle within a
container. Line 4 identifies the data region which consists of the relevant data region and
some irrelevant regions also. Line 5 identifies the actual relevant data region by filtering the
bounding irrelevant regions. As mentioned earlier, the proposed technique has two main
steps. This section presents them in turn.
3.1 Determining the Co-ordinates of All Bounding Rectangles
In the first step of the proposed technique, we determine the coordinates of all the bounding
rectangles in the web pages. The VSAP approach uses the MSHTML parsing and rendering
engine of Microsoft Internet Explorer 6.0. This parsing and rendering engine of the web
browser gives us these coordinates of a bounding rectangle. We scan the HTML file for tags.
For each tag encountered, we determine the coordinate of the top left corner, height and
width of the bounding rectangle of the tag.
Definition: Every HTML tag specifies a method for rendering the information contained
within it. For each tag, there exists an associated rectangular area on the screen. Any
Copyright ICWS-2009
information contained within rectangular area obeys the rendering rules associated with the
tag. This rectangle is called the bounding rectangle for the particular tag.
A bounding rectangle is constructed by obtaining the coordinate of the top left corner of the
tag, the height and the width if that tag. The left and top coordinates of the tag are obtained
from offsetLeft and offsetTop properties of the HTMLObjectsElement. These values are with
respect to its parent tag. The height and width of that tag are available from the offsetHeight
and offsetWidth properties of the HTMLObjects Element Class.
Fig. 3 shows a sample web page of the product related website, which contains the number of
books and their description which form the data records inside the data region

Fig. 3: A Sample Web page of a product related website
For each HTML tag on web page, there exists an associated rectangle on the screen, which
forms the bounding rectangle for that specific tag. Fig 4 shows the bounding rectangles for
the <TD> tags of the web pages shown in Fig 3.

Fig. 4: Bounding Rectangles for <TD> tag corresponding to the web page in Fig 3
Copyright ICWS-2009
3.2 Identifying the Data Regions
The second step of the proposed technique is to identify the data region of the web page. The
data region is the most relevant portion of a web page that contains a list of data records. The
three steps involved in identifying the data region are:
Identify the largest rectangle.
Identify the container within the largest rectangle.
Identify the data region containing the data records within this container.
3.2.1 Identification of the Largest Rectangle
Based on the height and width of bounding rectangles obtained in the previous step, we
determine the area of the bounding rectangles of each of the children of the BODY tag. We
then determine the largest rectangle amongst these bounding rectangles. The reason for doing
this is due to the observation that the largest bounding rectangle will always contain the most
relevant data in that web page. In Fig 5 largest rectangle is being shown with a dotted border.
The procedure FindMaxRect identifies the largest rectangle amongst all the bounding
rectangles of the children of the BODY tag. It is as follows.
Procedure FindMaxRect (BODY)
for each child of BODY tag
Begin
Find the coordinate of the bounding
rectangle for the child
If the area of the bounding rectangle > area
of maxRect
then
maxRect = child
endif
end

Fig. 5: Largest Rectangle amongst bounding rectangles of children of BODY tag
Copyright ICWS-2009
3.2.2 Identification of the Container within the Largest Rectangle
Once we have obtained the largest rectangle, we form a set of the entire bounding rectangle.
The rationale behind this is that the most important data of the web page must occupy a
significant portion of the web page. Next, we determine the bounding rectangle having the
smallest area in the set. The reason for determining the smallest rectangle within this set is
that the smallest rectangle will only contain data records. Thus a container is obtained. It
contains the data region and some irrelevant data.

Fig. 6: The container identified from sample web page in Fig 3
Definition: A container is a superset of the data region which may or may not contain
irrelevant data. For example, the irrelevant data contained in the container may include
advertisements on the right and bottom of the page and the links on the left side. The Fig 6
shows the container identified from the web page shown in fig 3.
The procedure FindDataRegion identifies the container in the web pages which contains the
relevant data region along with some irrelevant data also. It is as follows:
Procedure FindDataRegion (maxRect)
ListChildren=depth first listing of the
children of the tag Associated with maxRect
For each tag in ListChildren
Begin
If area of bounding rectangle of a tag >
half the area of maxRect
then
If area of bounding rectangle data
region>area of bounding rectangle of tag
then data region =tag
Endif
Endif
End
Copyright ICWS-2009
The fig 7 shows the enlarged view of the container shown in the fig 6. We note that there is
some irrelevant data, both on the top as well as the bottom of the actual data region
containing the data records.

Fig. 7: The Enlarged view of the container shown in Fig 6
3.2.3 Identification of Data Region Containing Data Records within the Container
To filter the irrelevant data from the container, we use a filter. The filter determines the
average heights of children within the container. Those children whose heights are less than
the average height are identified as irrelevant and are filtered off. The fig 8 shows a filter
applied on the container in fig 7, in order to obtain the data region. We note that the irrelevant
data in this case on top and bottom of the container, which are being removed by the filter.
The procedure FilterDataRegion filters the irrelevant data from the container, and gives the
actual data region as the output. It is as follows:
Procedure FindDataRegion (dataRegion)
totalHeight=0
For each child of dataRegion
totalHeight+=height of the
bounding rectangle of child
avgHeight=
totalHeight/no of children
of dataRegion
For each child of dataRegion
Begin
If height of childs bounding
rectangle<avgHeight
Then Remove child from dataRegion
Endif
End
Copyright ICWS-2009

Fig. 8: Data Region obtained after filtering the container in Fig 7
The VSAP technique, as described above, is able to mine the relevant data region containing
data records from the given web page efficiently.
4 MDR Vs VSAP
In this section we evaluate the proposed technique. We also compare it with MDR.
The evaluation consists of three aspects as discussed in the following:
4.1 Data Region Extraction
We compare the first step of MDR with our system for identifying the data regions.
MDR is dependent on certain tags like <TABLE>, <TBODY>, etc for identifying the data
region. But, a data region need not be always contained only within specific tags like
<TABLE>, <TBODY>, etc. A data region may also be contained within tags other than
tables-related tags like <P>, <LI>, <FORMS> etc. In the proposed VSAP system, the data
region identification is independent of specific tags and forms. Unlike MDR, where an
incorrect tag tree may be constructed due to the misuse of HTML tags, there is no such
possibility of tag tree construction in case of VSAP, because the hierarchy of tags is
constructed based on the visual cues on web page.In case of MDR, the entire tag tree needs to
be scanned in order to mine data regions, but VSAP does not scan the entire tag tree, but it
only scans the largest child of the <BODY>tag. Hence, this method proves very efficient in
improving the time complexity compared to other contemporary algorithms.
4.2 Data Record Extraction
We compare the record extraction step of MDR with VSAP. MDR identifies the data records
based on keyword search (e.g. $). But VSAP is purely dependent on the visual structure of
Copyright ICWS-2009
the web page only. It does not make use of any text or content mining. This proves to be very
advantageous as it overcomes the additional overhead of performing keyword search on web
page.
MDR, not only identifies the relevant data region containing the search result records but also
extracts records from all the other sections of the page, e.g. some advertisement records also,
which are irrelevant. In MDR, comparison of generalized nodes is based on string
comparison using normalized edit distance method. However, this method is slow and
inefficient as compared to VSAP where the comparison is purely numeric, since we are
comparing the coordinates of the bounding rectangles. It scales well with all the web pages.
A single data record may be composed of multiple sub-trees. Due to noisy information, MDR
may find wrong combination of sub-trees. In VSAP system, visual gaps between data records
help to deal with this problem.
4.3 Overall Time Complexity
Complexity of VSAP is much lesser than the existing algorithms. The existing algorithm
MDR has complexity of the order O (NK) without considering string comparison, where N is
the total number of nodes in the tag tree and K is the maximum number of tag nodes that
generalized node can have (which is normally a small number <10). Our algorithm VSAP has
a complexity of the order of O(n), where n is the number of tag- comparisons made.
5 Conclusion
In this paper, we have proposed a new approach to extract structured data from web pages.
Although the problem has been studied by several researchers, existing techniques are either
inaccurate or make many strong assumptions. A novel and effective method VSAP is
proposed to mine the data region in a web page automatically. It is a pure visual structure
oriented method that can correctly identify the data region. VSAP is independent of errors
due to the misuse of HTML tags. Most of the current algorithms fail to correctly determine
the data region ,when the data region consisting of only one data record .Also ,most of the
sites fail in the case where a series of at records is seperaed by an advertisement, followed
again by a single data record. VSAP works correctly for both the above cases .The no of
comparisons done in VSAP is significantly lesser than other approaches. Further the
comparisons are made on numbers; unlike other methods where strings or trees are compared
.Thus VSAP overcomes the drawbacks of existing methods and perform significantly better
than these methods.
Scope for Future Work
Extraction of data fields from the data records contained in these mined data regions will be
considered in the future work taking also into account the complexities such as the web pages
featuring dynamic HTML etc. The extracted data can be put in some suitable format and
eventually stored back into a relational data base .Thus data extracted from each web page
can then be integrated into a single collection. This collection of data can be further used for
various knowledge discovery applications. e.g., making a comparative study of products from
various companies, smart shopping, etc.
Copyright ICWS-2009
References
[1] Jiawei Han and Micheline Kambler, Data Mining: Concepts and Techniques.
[2] Arun .K. Pujari, Data Mining Techniques.
[3] Pieter Adriaans, Dolf Zantinge, Data Mining.
[4] George M. Maracas, Modern Data Warehousing, Mining, and Visualization Core Concepts, 2003. Data
[5] Baeza Yates, R .Algorithms for string matching: A survey.
[6] J. Hammer, H. Garcia Molina, J. Cho, and A. Crespo . Extracting semi-structured information from the
web.
[7] A. Arasu, H. Garcia-Molina, Extracting structured data from web pages.

Computer Networks
On the Optimality of WLAN Location
Determination Systems

T.V. Sai Krishna T. Sudha Rani
B.V.C. Engineering College Aditya Engineering College
Odalarevu, J.N.T.U Kakinada Surampalem, J.N.T.U Kakinada
sai.bvc@gmail.com sudha_mahi84@yahoo.co.in

Abstract

This paper presents a general analysis for the performance of WLAN location
determination systems. In particular, we present an analytical method for
calculating the average distance error and probability of error of WLAN
location determination systems. These expressions are obtained with no
assumptions regarding the distribution of signal strength or the probability of
the user being at a specific location, which is usually taken to be a uniform
distribution over all the possible locations in current WLAN location
determination systems. We use these expressions to find the optimal strategy
to estimate the user location and to prove formally that probabilistic
techniques give more accuracy than deterministic techniques, which has been
taken for granted without proof for a long time. The analytical results are
validated through simulation experiments and we present the results of testing
actual WLAN location determination systems in an experimental testbed.
Keywords: Analytical analysis, optimal WLAN positioning strategy, simulation experiments,
WLAN location determination.
1 Introduction
WLAN location determination systems use the popular 802.11 [10] network infrastructure to
determine the user location without using any extra hardware. This makes these systems
attractive in indoor environments where traditional techniques, such as the Global Positioning
System (GPS) [5], fail to work or require specialized hardware. Many applications have been
built on top of location determination systems to support pervasive computing. This includes
[4] location-sensitive content delivery, direction finding, asset tracking, and emergency
notification.
In order to estimate the user location, a system needs to measure a quantity that is a function
of distance. Moreover, the system needs one or more reference points to measure the distance
from. In case of the GPS system, the reference points are the satellites and the measured
quantity is the time of arrival of the satellite signal to the GPS receiver, which is directly
proportional to the distance between the satellite and the GPS receiver. In case of WLAN
location determination systems, the reference points are the access points and the measured
quantity is the signal strength, which decays logarithmically with distance in free space.
Unfortunately, in indoor environments, the wireless channel is very noisy and the radio
frequency (RF) signal can suffer from reflection, diffraction, and multipath effect [9], [12],
140 On the Optimality of WLAN Location Determination Systems
Copyright ICWS-2009
which makes the signal strength a complex function of distance. To overcome this problem,
WLAN location determination systems tabulate this function by sampling it at selected
locations in the area of interest. This tabulation has been known in literature as the radio map,
which captures the signature of each access point at certain points in the area of interest.
WLAN location determination systems usually work in two phases: offline phase and
location determination phase. During the offline phase, the system constructs the radio-map.
In the location determination phase, the vector of samples received from each access point
(each entry is a sample from one access point) is compared to the radio-map and the nearest
match is returned as the estimated user location. Different WLAN location determination
techniques differ in the way they construct the radio map and in the algorithm they use to
compare a received signal strength vector to the stored radio map in the location
determination phase.
In this paper, we present a general analysis of the performance of WLAN location
determination systems. In particular, we present a general analytical expression for the
average distance error and probability of error of WLAN location determination systems.
These expression are obtained with no assumptions regarding the distribution of signal
strength or user movement profile. We use these expressions to find the optimal strategy to
use during the location determination phase to estimate the user location. These expressions
also help to prove formally that probabilistic techniques give more accuracy than
deterministic techniques, which has been taken for granted without proof for a long time. We
validate our analysis through simulation experiments and discuss how well it models actual
environments. For the rest of the paper we will refer to the probability distribution of the user
location as the user profile. To the best of our knowledge, our work is the first to analyze the
performance of WLAN location systems analytically and provide the optimal strategy to
select the user location.
The rest of this paper is structured as follows. Section 2 summarizes the previous work in the
area of WLAN location determination systems. Section 3 presents the analytical analysis for
the performance of the WLAN location determination systems. In section 4, we validate our
analytical analysis through simulation and measurement experiments. Section 5 concludes the
paper and presents some ideas for future work.
2 Related Work
Radio map-based techniques can be categorized into two broad categories: deterministic
techniques and probabilistic techniques. Deterministic techniques, such as [2], [8], represent
the signal strength of an access point at a location by a scalar value, for example, the mean
value, and use non-probabilistic approaches to estimate the user location. For example, in the
Radar system [2] the authors use nearest neighborhood techniques to infer the user location.
On the other hand, probabilistic techniques, such as [3], [6], [7], [13], [14], store information
about the signal strength distributions from the access points in the radio map and use
probabilistic techniques to estimate the user location. For example, the Horus system from
the University of Maryland [14], [15] uses the stored radio map to find the location that has
the maximum probability given the received signal strength vector.
All these systems base their performance evaluation on experimental testbeds which may not
give a good idea on the performance of the algorithm in different environments. The authors
On the Optimality of WLAN Location Determination Systems 141
Copyright ICWS-2009
in [7], [14], [15] showed that their probabilistic technique outperformed the deterministic
technique of the Radar system [2] in a specific testbed and conjectured that probabilistic
techniques should outperform deterministic techniques. This paper presents a general
analytical method for analyzing the performance of different techniques. We use this analysis
method to provide a formal proof that probabilistic techniques outperform deterministic
techniques. Moreover, we show the optimal strategy for selecting locations in the location
determination phase.
3 Analytical Analysis
In this section, we give an analytical method to analyze the performance of WLAN location
determination techniques. We start by describing the notations used throughout the paper. We
provide two expressions: one for calculating the average distance error of a given technique
and the other for calculating the probability of error (i.e. the probability that the location
technique will give an incorrect estimate).
3.1 Notations
We consider an area of interest whose radio map contains N locations. We denote the set of
locations as L. At each location, we can get the signal strength from k access points. We
denote the k-dimensional signal strength space as S. Each element in this space is a k-
dimensional vector whose entries represent the signal strength reading from different access
points. Since the signal strength returned from the wireless cards are typically integer values,
the signal strength space S is a discrete space. For a vector s S, f *A(s) represents the
estimated location returned by the WLAN location determination technique A when supplied
with the input s. For example, in the Horus system [14], [15], f *Horus(s) will return the
location l L that maximizes P (l /s).Finally, we use Euclidean (l1, l2) to denote the
Euclidean distance between two locations l1 and l2.
3.2 Average Distance Error
We want to find the average distance error (denoted by E(DErr)). Using conditional
probability, this can be written as:
E (DErr) = E (DErr/l is the correct user location)
.P ( l is the correct user location) (1)
where P (l is the correct user location) depends on the user profile.
We now proceed to calculate E(DErr/l is the correct user location). Using conditional
probability again:
E (DErr/l is the correct user location)
= E (DErr/s, l is the correct user location)
.P(s/l is the correct user location) (2)
= Euclidean (f *
A
(s), l)
.P(s/l is the correct user location)
Copyright ICWS-2009
Where Euclidean (f *A(s), l) represents the Euclidean distance between the estimated
location and the correct location. Equation 2 says that to get the expected distance error given
we are at location l, we need to get the weighted sum, over all the possible signal strength
values s S, of the Euclidean distance between the estimated user location (f *A(s)) and the
actual location l.
Substituting equation 2 in equation 1 we get:
E (DErr) = Euclidean (f *A(s), l)
P(s/l is the correct user location)
P (l is the correct user location) (3)
Note that the effect of the location determination technique is summarized in the function f
*A. We seek to find the function that minimizes the probability of error. We differ the
optimality analysis till we present the probability of error analysis.
3.3 Probability of Error
In this section, we want to find an expression for the probability of error which is the
probability that the Location determination technique will return an incorrect Estimate. This
can be obtained from equation 3 by Noting that every non-zero distance error (represented by
the function Euclidean (f *
A
(s), l)) is considered an error.
More formally, we define the function:
g(x) =
The probability of error can be calculated from equation 3 as:
P (Error) = g (Euclidean (f *
A
(s), l))
.P(s/l is the correct user location)
.P (l is the correct user location) (4)
In the next section, we will present a property of the Term g (Euclidean (f *
A
(s), l)) and use
this property to get the optimal strategy for selecting the location.
3.4 Optimality
We will base our optimality analysis on the probability of error.
Lemma 1: For a given signal strength vector s, g(Euclidean(f *A(s), l)) will be zero for only
one location l L and one for the remaining N 1 locations.
Proof: The proofs can be found in [11] and have been removed for space constraints. The
lemma states that only one location will give a value of zero for the function g(Euclidean(f
*A(s), l))in the inner sum. This means that the optimal strategy should select this location in
order to minimize the probability of error. This leads us to the following theorem. Theorem 1
(Optimal Strategy): Selecting the location that maximizes the probability P(s/l).P(l) is both a
necessary and sufficient condition to minimize the probability of error.
Proof: The proof can be found in [11].
Copyright ICWS-2009
Theorem 1 suggests that the optimal location determination technique should store in the
radio map the signal strength distributions to be able to calculate P(s/l).Moreover, the optimal
technique needs to know the user profile in order to calculate P(l).
Corollary 1: Deterministic techniques are not optimal.
Proof: The proof can be found in [11].Note that we did not make any assumption about the
Independence of access points, user profile, or signal strength distribution in order to get the
optimal strategy.
A major assumption by most of the current WLAN location determination systems is that all
user locations are equi-probable. In this case, P (l) = and Theorem 1can be rewritten as:
Theorem 2:
If the user is equally probable to be at any location of the radio map locations L, then
selecting
The location l that maximizes the probability P(s/l) is both a necessary and sufficient
condition to minimize the probability of error.
Proof: The proof is a special case of the proof of Theorem 1.
This means that, for this special case, it is sufficient for the optimal technique to store the
histogram of signal strength at each location. This is exactly the technique used in the Horus
system [14], [15].
Figure 1 shows a simplified example illustrating the intuition behind the analytical
expressions and the theorems. In the example, we assume that there are only two locations in
the radio map and that at each location only one access point can be heard whose signal
strength, for simplicity of illustration, follows a continuous distribution. The user can be at
any one of the two locations with equal probability. For the Horus system (Figure 1.a),
consider the line that passes by the point of intersection of the two curves.

Fig. 1: Expected error for the special case of two locations
Copyright ICWS-2009
Since for a given signal strength the technique selects the location that has the maximum
probability, the error if the user is at location 1 is the area of curve 1 to the right of this line. If
the user is at location 2, the error is the area of curve 2 to the left of this line. The expected
error probability is half the sum of these two areas as the two locations are equi-probable.
This is the same as half the area under the minimum of the two curves (shaded in figure).For
the Radar system (Figure 1.b), consider the line that bisects the signal strength space between
the two distribution averages. Since for a given signal strength the technique selects the
location whose average signal strength is closer to the signal strength value, the error if the
user is at location 1 is the area under curve 1 to the right of this line. If the user is at location
2, the error is the area under curve 2 to the left of this line. The expected error probability is
half the sum of these two areas as the two locations are equi-probable (half the shaded area in
the figure).From Figure 1, we can see that the Horus system outperforms the Radar system
since the expected error for the former is less than the later (by the hashed area in Figure 1.b).
The two systems would have the same expected error if the line bisecting the signal strength
space of the two averages passes by the intersection point of the two curves. This is not true
in general. This has been proved formally in the above theorems. We provide simulation and
experimental results to validate our results in section4.
4 Experiments
4.1 Testbed
We performed our experiment in a floor covering a 20,000 feet area. The layout of the floor
is shown in Figure 2.

Fig. 2: Plan of the floor where the experiment was conducted.
Redadings were collected in the corridors (shown in gray).
Copyright ICWS-2009
Both techniques were tested in the Computer Science Department wireless network. The
entire wing is covered by 12 access points installed in the third and fourth floors of the
building. For building the radio map, we took the radio map locations on the corridors on a
grid with cells placed 5 feet apart (the corridors width is 5 feet). We have a total of 110
locations along the corridors. On the average, each location is covered by 4 access points. We
used the mwvlan driver and the MAPI API [1] to collect the samples from the access points.
4.2 Simulation Experiments
In this section, we validate our analytical results through simulation experiments. For this
purpose, we chose to implement the Radar system [2] from Microsoft as a deterministic
technique and the Horus system [14],[15] from the University of Maryland as a probabilistic
technique that satisfy the optimality criteria as described in Theorem 2.
We start by describing the experimental testbed that we use to validate our analytical results
and evaluate the systems.
4.2.1 Simulator
We built a simulator that takes as an input the following parameters:
The radio map locations coordinate.
The signal strength distributions at each location from each access point.
The distribution over the radio map locations that represent the steady state
probability of the user being at each location (user profile).
The simulator then chooses a location based on the user location distribution and generates a
signal strength vector according to the signal strength distributions at this location. The
simulator feeds the generated signal strength vector to the location determination technique.
The estimated location is compared to the generated location to determine the distance error.
The next sections analyze the effect of the uniform user profile on the performance of the
location determination systems and validate our analytical results. The results for the
heterogeneous profiles can be found in [11].
4.2.2 Uniform user Location Distribution
This is similar to the assumption taken by the Horus system. Therefore, the Horus system
should give optimal results. Figures 3 shows the probability of error and average distance
error (analytical and simulation results) respectively for the Radar and the Horus systems.
The error bars represent the 95% confidence interval for the simulation experiments. The
figure shows that the analytical expressions obtained are consistent with the simulation
results. Moreover, the Horus system performance is better than the Radar system as predicted
by Theorem 2. The Horus system performance is optimal under the uniform distribution of
user location.
Copyright ICWS-2009

Fig. 3: Performance of the Horus and Radar systems under a uniform user profile (profile 1).
4.3 Measurements Experiments
In our simulations, we assumed that the test data follows the signal strength distributions
exactly. This can be considered as the ideal case since in a real environment,
The received signal may differ slightly from the stored signal strength distributions. Our
results however are still valid and can be considered as an upper bound on the performance of
the simulated systems. In order to confirm that, we tested the Horus system and the Radar
System in an environment where the test set was collected on different days, time of day and
by different persons than those in the training set. Figure 4 shows the CDF of the distance
error for the two systems. The figure shows that the Horus system (a probabilistic technique)
significantly outperforms the Radar system (a deterministic technique) which confirms to our
results.

Fig. 4: CDF for the Distance Error for the Two Systems.
Copyright ICWS-2009
We presented an analysis method for studying the performance of WLAN location
determination systems. The method can be applied to any of the WLAN location
determination techniques and does not make any assumptions about the signal strength
distributions at each location, independence of access points, nor the user profile. Second, we
studied the effect of the user profile on the performance of the WLAN location determination
systems.
We used the analytical method to obtain the optimal strategy for selecting the user location.
The optimal strategy must take into account the signal strength distributions at each location
and the user profile. We validated the analytical results through simulation experiments. In
our simulations, we assumed that the test data follows the signal strength distributions
exactly. This can be considered as the ideal case since in a real environment, the received
signal may differ slightly from the stored signal strength distributions. Our results however
are still valid and can be considered as an upper bound on the performance of the simulated
systems. We confirmed that through actual implementation in typical environments. For
future work, the method can be extended to include other factors that affect the location
determination process such as averaging multiple signal strength vectors to obtain better
accuracy, using the user history profile, usually taken as the time average of the latest
location estimates, and the correlation between samples from the same access points.
References
[1] http://www.cs.umd.edu/users/moustafa/Downloads.html.
[2] P. Bahl and V. N. Padmanabhan. RADAR: An In-Building RF-based User Location and Tracking System.
In IEEE Infocom 2000, volume 2, pages 775784, March2000.
[3] P. Castro, P. Chiu, T. Kremenek, and R. Muntz. A Probabilistic Location Service for Wireless Network
Environments. Ubiquitous Computing 2001, September2001.
[4] G. Chen and D. Kotz. A Survey of Context-Aware Mobile Computing Research. Technical Report
Dartmouth Computer Science Technical Report TR2000-381, 2000.
[5] P. Enge and P. Misra. Special issue on GPS: The Global Positioning System. Proceedings of the IEEE,
pages 3172, January 1999.
[6] A. M. Ladd, K. Bekris, A. Rudys, G. Marceau, L. E.Kavraki, and D. S. Wallach. Robotics-Based Location
Sensing using Wireless Ethernet. In 8th ACM MOBICOM, Atlanta, GA, September 2002.
[7] T. Roos, P. Myllymaki, H. Tirri, P. Misikangas, and J. Sievanen. A Probabilistic Approach to WLAN User
Location Estimation. International Journal of Wireless Information Networks, 9(3), July 2002.
[8] A. Smailagic, D. P. Siewiorek, J. Anhalt, D. Kogan, and Y. Wang. Location Sensing and Privacy in a
Context Aware Computing Environment. Pervasive Computing, 2001.
[9] W. Stallings. Wireless Communications and Networks. Prentice Hall, first edition, 2002.
[10] The Institute of Electrical and Electronics Engineers, Inc. IEEE standard 802.11 Wireless LAN Medium
Access Control (MAC) and Physical Layer (PHY) specifications.1999.
[11] M. Youssef and A. Agrawala. On the Optimality of WLAN Location Determination Systems. Technical
Report UMIACS-TR 2003-29 and CS-TR 4459, University of Maryland, March 2003.
[12] M. Youssef and A. Agrawala. Small-Scale Compensation for WLAN Location Determination Systems. In
WCNC 2003, March 2003.
[13] M. Youssef and A. Agrawala. Handling Samples Cor relation in the Horus System. In IEEE Infocom 2004,
March 2004.
Multi-Objective QoS Based Routing Algorithm
for Mobile Ad-hoc Networks

Shanti Priyadarshini Jonna Ganesh Soma
JNTU College of Engineering JNTU College of Engineering
Anantapur, India Anantapur, India
shantipriya.jona@gmail.com soma.ganesh@gmail.com

Abstract

Mobile Ad-Hoc NETwork (MANET) is a collection of wireless nodes that can
dynamically be set up anywhere and anytime without using any pre-existing
network infrastructure. The dynamic topology of the nodes possess more
routing challenges in MANET compared to infrastructure based network.
Most current routing protocols in MANETs try to achieve a single routing
objective using a single route selection metric. As the various routing
objectives in Mobile Ad-hoc Networks are not completely independent, an
improvement in one objective can only be achieved at the expense of others.
Therefore, efficient routing in MANETs requires selecting routes that meet
multiple objectives. Along with this requirement routing algorithm must be
capable of providing different priorities to different QoS as needed by the
application which vary form one application to other. We develop a Hybrid
Routing Algorithm for MANET which uses the advantages of both reactive
and proactive routing approaches in finding stable routes, reducing initial
route delay and to minimize bandwidth usage. We have proposed a generic
Multi-Objective Hybrid Algorithm to find the best available routes
considering multiple QoS parameters achieving multiple objectives by
evaluating the different alternatives. This algorithm can also provide support
to multiple number of QoS parameters which can be varied, is very much
needed to support any kind of application, where each application have
different priorities for the QoS parameters.
Keywords: Mobile Adhoc Networks.
1 Introduction
FUTURE Mobile Ad-hoc Networks are expected to support applications with diverse Quality
of Service requirements. QoS routing is an important component of such networks. The
objective of QoS routing is two-fold: to find a feasible path for each transaction; and to
optimize the usage of the network by balancing the load.
Routing in mobile ad-hoc network depends on many factors like, including modeling of the
topology, selection of routers, and initiation of request, and specific underlying characteristics
that could serve as a heuristic in finding the path efficiently. The routing problem in mobile
ad-hoc networks relates to how mobile nodes can communicate with one another, over the
wireless media, without any support from infrastructured network components. Several
Multi-Objective QoS Based Routing Algorithm for Mobile Ad-hoc Networks 149
Copyright ICWS-2009
routing algorithms have been proposed in the literature for mobile ad-hoc networks with the
goal of achieving efficient routing.
These algorithms can be classified into three main categories based on the way the algorithm
finds path to the destination.
They are: 1. Proactive Routing Algorithms
2. Reactive Routing Algorithms
3. Hybrid Routing Algorithms
Proactive protocols perform routing operations between all source destination pairs
periodically, irrespective of the need of such routes where as Reactive protocols are designed
to minimize routing overhead. Instead of tracking the changes in the network topology to
continuously maintain shortest path routes to all destinations, Reactive protocols determine
routes only when necessary. The use of Hybrid Routing is an approach that is often used to
obtain a better balance between the adaptability to varying network conditions and the
routing overhead. These protocols use a combination of reactive and proactive principles,
each applied under different conditions, places, or regions.

Fig. 1: Classification and examples of ad hoc routing protocols.
In this paper we propose a generic Multi-Objective Hybrid Routing Algorithm which uses the
advantages of both reactive and proactive routing approaches to find the best available routes
by considering multiple QoS parameters and achieving multiple objectives.
2 Proposed Algorithm
It is a MultiObjective Hybrid Routing Algorithm for Mobile Ad Hoc Networks. This
algorithm tries to achieve multiple objectives. Here each of these objectives depends upon
one or multiple QOS parameters. We have considered n QOS parameters which accounts for
achieving multiple objectives. Depending upon the parameters we are considering and usage,
different objectives can be achieved. These parameters can be varied depending upon the
application. As every application have different requirements of the QOS and thus have
different priorities of each parameter. So we have proposed a flexible generic scheme in
which user can select different set of n QOS parameters accounting for achieving multiple
objectives.
150 Multi-Objective QoS Based Routing Algorithm for Mobile Ad-hoc Networks
Copyright ICWS-2009
The 3-Cartesian co-ordinates are considered in this algorithm in determining expected and
request zones by introducing the third co-ordinate z of geographic (earth centered) Cartesian
co-ordinate system. Route Recovery with local route repair, based on distance metric of the
path length is also added in this algorithm to support real time applications. This algorithm
has five phases Neighbor Discovery, Route Discovery, Route Selection, Route Establishment
and Route Recovery Phase. Route Discovery Phase consists of sub modules: Intra Zone
Routing, Inter Zone Routing.
A Neighbor Discovery Phase
Here Neighbor Discovery Algorithm will look after the maintenance of Neighbor Tables and
Zone Routing tables. Each and every node maintains Neighbor Tables and Zone Routing
Tables. The Neighbor Table along with the neighbor node addresses also stores available
QOS parameter values along the link between itself and its Neighbor. These parameters are
considered for selecting best available routes by Intra Zone Routing Protocol (used to select
the routes with in the zone). In this phase each and every node periodically transmits beacons
to its neighbors. On reception of these packets from neighbors every node updates its
Neighbor Table with appropriate values. Each node exchanges their Neighbor Tables from
their corresponding neighbors and constructs Zone Routing Tables. Every node constructs
Zone Routing Table from their Neighbor Tables using Link State Algorithm. A Zone is a set
of nodes which lies with in a limited region in 2-Hop distance from given node.
B Route Discovery Phase
This phase is used to find all the alternate routes available. It has sub modules Intra Zone
Routing Protocol and Inter Zone Routing Protocol. Any node which requires route to any
destination constructs Route Request Packet (RREQ) in which Desired QOS Metrics [Q1,
Q2... Qn] and set of parameters [P1, P2... Qn] to be calculated during route Discovery Phase
are introduced. Source node S initially searches for the destination node D whether it belongs
to Zone or not. If it belongs to the zone it finds the route desired using Intra Zone routing
Module.
Intra Zone Routing Protocol is for selecting the path to any destination which is present with
in the zone. Source node selects the path available from the zone routing table only when
desired QOS metrics are satisfied. Inter Zone Routing Protocol is for selecting all the
available routes to any destination node which lies outside the zone. S broadcasts the RREQ
packets, on reception of RREQ every node will check whether it is a member of the Request
zone, if it is a member then it checks whether the link between itself and its Predecessor Node
is satisfying these QOS constraints or not, and if it can satisfy the QOS requirements then it
broadcasts the request further by processing the parameters depending up on the metric,
including its details else it discards the request, there by reducing unnecessary routing traffic.
Destination D may receive RREQ packets from alternate paths, these are the different
alternatives available at D. Route Selection Algorithm is used to select the best available
route and Route Reply Packet (RREP) is constructed at D and sent back along the path
chosen. Each intermediate node processes the RREQ packets and stores the route request
details in the route table along with the pointer to Local Route Repair Table (LRRT) in which
QOS parameter values attained up to that node are stored to use it further for local route
recovery if needed. Every node retains this LRRT table only if it has to repair the route
locally, decided by distance metric in the Route Establishment phase.
Copyright ICWS-2009
C Route Selection Phase
In this phase the best available route among k alternatives have to be selected taking
decision depending up on the m (multiple QOS parameters) attributes. Among the k
alternatives all are not optimal solutions, pareto optimal solutions have to be found. Finding
pareto-minimum vectors among r given vectors, each of dimension m, is a fundamental
problem in multi-objective optimization problems. Multi Objective Optimization is used in
Route Selection Phase, where Multi Objective Problem is transformed in to Single Objective
Problem by weighting method. The goal of such single-objective optimization problems is to
find the best solution, which corresponds to the minimum or maximum value of an objective
function. In this algorithm multiple objectives are reformulated as single-objective problem
by combining different objectives into one (that is, by forming a weighted combination of the
different objectives). First, all the objectives need to be either minimized or maximized. This
is done by multiplying one of them by -1 (i.e., max f2 is equivalent to min (-f2) = minf2').
Next, these objectives must be lumped together to create a single objective function.
Weighting (conversion) factors w1 w2... wn are used in order to obtain a single, combined
objective function.
Maximize{F}=(+/-)w1f1(x)+(+/-) w2f2(x)...(+/-)wnfn(x).
To find the relative performance of each objective function each of the objective function
value obtained is divided by corresponding desired QOS value. Now relative efficiency of
each route is obtained by calculating the F value of all valid paths (which satisfy the QOS
requirements) from source to destination. Finally, given this single objective function one can
find a single optimal solution (optimal route).
D Route Establishment Phase
This phase is for reverse path set up i.e. the route is established from destination to source.
After selecting the optimal route available by Route Selection phase, Route Reply packets
will be sent along the path selected, back tracking from destination to source setting the status
field value of corresponding entry in route table from Route Not Established(NE) to
Established, and updating NextNode_ID as the P.Current_ID (the node from where RREP
packet has received). Then send back the RREP packets towards source selecting next hop
from route table which is stored during forward path set up. This phase itself decides whether
the intermediate node is capable of handling Local route recovery for this path or not
depending up on the distance metric which is explained in next section. If the node is capable
of local recovery then retains LRRT table entries otherwise clears them thus saving space.
E Route Recovery Phase
Every routing algorithm to support real time applications must have efficient route recovery
mechanisms. This algorithm has local route repair mechanism, to have this feature extra
overhead is required (since each node has to store QOS requirements per route) but at the
same time its neccessary to have this feature. To have a trade-off between efficient route
recovery mechanism and space overhead, path length is considerd as a distance metric and
divided the nodes in to 2 categories one which can handle local route recovery and other
notifies Source or Destination to handle route recovery. Path length divided into (0 - to - 25)
Copyright ICWS-2009
%, (25 to - 75) %, (75 - to - 100) %. So middle 50% of the nodes which lie in the (25 - to-
75) %portion of path length handle local route recovery and the remaining portions notify
Source/Destination. In the route establishment phase every node calculates in which portion
of the path length it lies, so as to handle route recovery. Every node which receives RERR
messages checks whether it is capable of handling route recovery locally by checking
whether LRRT table entries are available or not. If they are available construct RREQ
packets locally with entries available from LRRT and broad cast it other wise send RERR
packets towards source or destination.
3 Complexity Analysis of the Algorithm
Let N be the total number of nodes in the network, n be the number of neighbor nodes of a
particular node, z be the number of nodes in its zone and N1 be the number of nodes in the
request zone.
A Space Complexity
For each node, a Neighbor Table and a Zone Routing Table are required. The size of each
entry in the Neighbor Table is (9 Bytes + k Bytes), where k be the number of QOS
parameters considered. The size of each entry in the Zone Routing Table is (9 Bytes k
Bytes). Total size required by each node will be ((9+k)*n + (9+k)*z). The total amount of
space required by over all Network will be
N*(9*n + 9*z + k*(n+z)).
Case 1: Average Case
The Space Complexity
= O (N* (9*n + 9*z + k*(n+z))).
= O (N* (c1*n + c2*z + c3*(n+z))), where c1, c2, c3 are constants.
= O (N* (c2+c3)*z), as z n
= O (N* z). ( O (N2))
This is for average space complexity
Case 2: Best Case
In the best case, the number of nodes in the zone equals the number of neighbor nodes. So the
best case space complexity of the network becomes O (N*n).
Case 3: Worst Case
In the worst case, the number of neighbor nodes or the number of nodes in the zone equals to
the total number of nodes in the network. In that case the overall complexity becomes O(N 2)
(since z= N).
B Time Complexity
For the neighbor table maintenance, the proposed algorithm uses the Link state algorithm. It
receives the neighbor tables from all the neighboring nodes and computes the zone routing
table. As the numbers of neighboring nodes are n, the complexity for computing zone routing
table is of order O(n2).
Copyright ICWS-2009
In the average case the route is found from the routing table using the binary search algorithm
in O(log z) time. The average case time complexity of the algorithm for entire network is =N
*(O (n2) + O (log z)).
=O (N *log z), if 2 z > 2n
=O (N *n2), otherwise n2
Case 2: Best Case
In the best case the required route is directly found from the routing table in one step, i.e. in
O(1) time. So, the best case time complexity of the algorithm for the entire network is
N * (O(1) + O(n2 )) = O(N *n2 ).
Case 3: Worst Case
In the worst case, the route request has to go through the entire request zone. The complexity
becomes 1 O(N *log(z)). Let m be the possible routes satisfying QOS at Destination node.
Then for selecting k Pareto optimal solutions from m alternatives, time complexity isO (m2)
Then for selecting the best route from k alternatives, the time complexity will be O(k2 ). In
worst case k = m ten the time complexity will be O (m2). So the total time complexity for
selecting the route will be O (m2) + O (m2) = O (m2).
The worst case time complexity becomes
N*(O(N1* log z) + O(m2))
= N*(O(N1* log z), since m << N1
=O(N* N1* log z). If request zone becomes the entire network, then the complexity becomes
O(N2*log z).
4 Communication Complexity
This complexity is considered at steady state conditions of the network. The amount of data
transferred between the nodes is of O(n 2) as the nodes have to exchange the neighbor tables
with their neighbors.
Case 2: Best Case
In the best case, the route is found from the routing table. So the communication complexity
becomes
O(n 2 )+O(1)=O(n 2).
Case 3: Worst Case
In the worst case, the route request and reply has to go through entire request zone. So, the
complexity becomes
O (n 2) + O (N12) = O (N12).
If the request zone is the entire network, then it is O (N2). For On demand type of algorithms,
this is O (N 2) always.
Copyright ICWS-2009
Complexity Type Best case Average case Worst case
Space complexity O(N* n) O(N* z) O(N 2)
Time complexity O(N) O(N* log z) O(N 2*log z)
Communication Complexity O(n 2) O(N 12) O(N 2)
5 Conclusion
MOHRA is an algorithm improved on top of New Hybrid Routing Algorithm(NHRA).
NHRA is a single objective routing protocol, where as this algorithm addresses multiple
objectives and it has the advantages of both reactive and proactive routing approaches.This
algorithm is used to select optimal route available achieving multiple objectives like
minimum delay, highly stable routes with desired bandwidth, which depends upon one or
multiple QOS parameters like delay, associativity ticks (this metric is used to find link
stability), and bandwidth. Depending upon the parameters considering and usage, different
objectives can be achieved.
This algorithm can provide support to multiple number of QoS parameter which can be
varied achieving multiple objectives, which is a flexible scheme to support any kind of real
time applications, where each application have different priorities and this is all possible with
less computational effort.
As we are using associativity count long-lived routes are selected, ensuring a lower packet
loss rate arising from the movement of intermediate nodes and fewer route failures. Thus
accounting for the increase in packet delivery fraction and reducing end-to-end delay and by
using location co-ordinates search space can be reduced for route discovery.
One more major contribution of this work is efficient Route Recovery mechanism with local
route repair based on distance metric of the path length to support real time applications.
Based on the simulation study and comparative analysis of the routing algorithms, it is
observed that NHP with respect to end to end delay works well when compared with other
algorithms, DSDV works well with respect to packet deliver fraction when compared other
algorithms.
6 Future Scope
We have analyzed the GPS and its usage in finding Location Co-ordinates, various alternate
positioning systems can be studied and suggest one positioning system which is cost-
efficient.
The proposed Multi-Objective Hybrid Routing Algorithm can be implemented using Network
Simulator 2.
Our QoS aware hybrid routing algorithm only does the job of searching the path with enough
resources, but does not reserve them. This job of reserving by QoS Signaling mechanism can
be incorporated in to our algorithm.
The implemented algorithm New Hybrid Routing Algorithm uses only two dimensional
location co-ordinates. It can be extended to 3-dimensional coordinates to give completeness
to the algorithm.
Copyright ICWS-2009
References
[1] Highly Dynamic Destination-Sequenced Distance-Vector Routing (DSDV) for Mobile Computers,
Perkins C.E. and Bhagwat P., Computer Communications Review, Oct 1994, pp.234-244.
[2] Routing in Clustered Multihop, Mobile Wireless Networks with Fading Channel, C.C. Chiang,
Proceedings of IEEE SICON, pp. 197-211, April 1997.
[3] The Landmark Hierarchy: a new hierarchy for routing in very large networks, P.F.Tsuchiya, In Computer
Communication Review, vol.18, no.4, Aug. 1988, pp. 35-42.
[4] Perkins C.E. and Royer, E.M., Ad-hoc on-demand distance vector routing, WMCSA 99. Second IEEE
Workshop on Mobile Computing Systems and Applications, pp: 90-100, 1999.
[5] Johnson, D.B. and Maltz, D.A., Dynamic Source Routing Algorithm in Ad-Hoc Wireless Networks,
Mobile Computing, Chapter 5, Kluwer Academic, Boston, MA, 1996, pp.153-181.
[6] A Highly Adaptive Distributed Routing Algorithm for Mobile and wireless networks, V.D. Park and M.S.
Corson, IEEE, Proceedings of IEEE INFO-COM97, Kobe, Japan, pp. 103-112, April 1997.
[7] Nicklas Beijar, Zone Routing Protocol (ZRP), www.netlab.tkk.fi/opetus/s38030/k02/Papers/08
Nicklas.pdf
[8] Analysis of the Zone Routing protocols, John Schaumann, Dec 8, 2002
http://www.netmeister.org/misc/zrp/zrp.pdf
[9] DDR-Distributed Dynamic Routing Algorithm for Mobile Ad hoc Networks, Navid Nikaein, Houda
Labiod and Christian Bonnet, International Symposium on Mobile Ad Hoc Networking & Computing, pp:
19-27, 2000.
[10] M. Joa-Ng and I-Tai Lu, A peer-to-peer zone-based two-level link state routing for mobile ad hoc net-
works, IEEE on Selected Areas in Communications,vol. 17, no. 8, pp. 1415 1425, 1999.
[11] Ko Young-Bae, Vaidya Nitin H., Location-Aided Routing in mobile ad hoc networks, Wireless Networks
6, 2000, pp.307-321.
[12] S. Basagni, I. Chlamtac, V. Syrotiuk and B. WoodWard, A Distance Routing Effect Algorithm for
Mobility.
A Neural Network Based Router

D.N. Mallikarjuna Rao V. Kamakshi Prasad
Jyothishmathi Institute of Technology Jawaharlal Nehru Technological
and Science, Karimnagar University, Hyderabad
dnmrao@yahoo.com kamakshiprasad@yahoo.com

Abstract

In this paper we describe a router (in a communication network) which takes
routing decision by using a neural network which has to be trained regularly.
We construct a multi layer feed forward neural network. We train the neural
network using the data collected by the ACO (Ant Colony Optimization)
algorithm[2]. Given the destination node as input, the Neural Network will
give the next node as output on which the packet has to be transferred on. This
experiment shows that we can replace the routing tables with Neural Network
and a search algorithm is not required to find the next node, given the
destination.
1 Introduction
Routing has a profound effect on the performance of the communication networks as it
involves decision making process by consulting a routing table. The size of the routing table
is proportional to the number of routers available in the network. A routing algorithm should
take minimum average response time to find the optimum path(s) for transporting data or
message. In doing so it must satisfy the users demands for fast service. In todays world the
networks are growing in leaps and bounds. Therefore storing and updating the information
about the routers is a tedious task. The routers also must adapt to the changes in the network
environment.
Research on neural network based routers have used global information of the
communication network (Hopfield and Tank, 1989). Lee and Chang, 1993 used complex
neural network to make the routing decisions. Chiu-Che Tseng, Max Garzon have used local
information for updating the Neural Network.
The rest of the paper is organized as follows. Section 2 describes the model and the Neural
Network component. Section 3 describes the JavaNNS simulator. Section 4 describes the
experimental setup and the results.
2 The Model
The model consists of a Neural Network router which has to be trained using a routing table
information. We have used the routing table information obtained from the Ant algorithm
simulation[2]. The Neural Network has been trained using this routing table information.
The Neural Network
In our communication network, Neural Network is a part of every router which replaces the
routing table. Destination node address is given as the input to this Neural Network and it
A Neural Network Based Router 157
Copyright ICWS-2009
provides the next node as output. Ant algorithm provides routing table information to every
node. This information which is taken offline is used to train this multi layer feed forward
neural network. First we randomly initialize the weights of the Neural Network. Then using
the information provided by the Ant algorithm the weights are updated to reflect the
patterns(routing table information).
The Ant Algorithm
ACO algorithms take inspiration from the behaviour of real ants in finding the paths to the
food nest. The ants leave pheromone (chemical substance) in the path to the destination. The
other ants follow the path which contains the more contains the more concentration of the
pheromone. This behaviour of ants has been applied to solving heuristic problems. In these
problems a colony of artifical ants are collectively used to communicate indirectly and arrive
at a solution. Although we have not implemented the Ant algorithm. We have used the
routing table information[2] to train the neural network.
In [2] the authors have simulated the ACO algorithm where in a forward ant is launched from
every node periodically. This forward ant pushes the address of the nodes it visits on to the
memory stack carried by it. When it reaches the destination a Backward Ant is generated
which follows the same path as that of the Forward Ant. The Backward Ant updates the
routing table information while moving from the destination to the source. This routing table
information has been used for our simulation purpose.
3 JavaNNS Simulator
Stuttgart Neural Network Simulator (SNNS) was developed by a team of the chair at the
University of Stuttgart. SNNS which was developed for Unix work stations and Unix PCs is
an efficient universal simulator of neural networks. The simulator kernel was in ANSI C and
the graphical user interface was written in X11R6 which is a network compiler.
JavaNNS is the successor of SNNS. In this the graphical user interface is much more
comfortable and user friendly which is written in Java. Because of this platform-
independence is also increased.
JavaNNS is available for the operating systems: Windows NT/Windows 2000, Solaris and
RedHat Linux. JavaNNS is freely available and can be downloaded from the link provided in
the reference.
4 Experimental Setup and the Results
A 12 node communication network has been used for Ant Algorithm simulation[2]. The
authors of the simulation have used ns-2 for simulating the ACO algorithm. The algorithm
updates routing table every time a Backward traces back to the source. The routing table
contains multiple paths with different probabilities (pheromone values). We have taken one
such routing table for node no.3 and normalized the paths. i.e. we have taken the best path
only. We have constructed the neural Network in JavaNNS simulator. JavaNNS provides
graphical user interface for constructing the neural network. In the interface we can specify
the number of layers, type of each layer, number of nodes in each layer, the activation
function and type of connections(feed forward, auto associtiave etc.) After constructing the
Neural Network we have initialized the weights randomly, again using a control function,
158 A Neural Network Based Router
Copyright ICWS-2009
between -1 and +1. We have converted the normalized routing table information file into the
input/output pattern file. The same file has been used for validation also. Then the Neural
Network has been trained for 100 cycles. After the training the error value has reached below
our desired value.
Various figures while simulating the Neural Network have been included here.
A 12 node communication network has been used for Ant Algorithm simulation[2]. The
network looks like this.

Fig. 1: Network Topology used for Simulation on ns-2[2]
Once the Ant Algorithm has been simulated it generated routing table at every node. We have
taken on such routing table (for Node 3) for implementing the
Neural Router. The routing table for Node 3 is given below which specifies which actually
multiple paths for every destination. We have taken the best path for the purpose of
simulating the Neural Network.
Dest Node Next Node
0 6
1 2
2 2
4 6
5 6
6 6
7 7
8 6
9 6
10 6
11 7
Fig. 2: Routing table at Node 3
The following figure shows the three layer feed forward Neural Network after initializing
with random weights but before training.
5 6
0 1 2 3
4 7
8 9 10 11
A Neural Network Based Router 159
Copyright ICWS-2009

Fig. 3: Neural Network after initializing the weights.
The following figure shows the three layer feed forward Neural Network after it has been
trained. We have used Backpropagation algorithm for training the network and learning rate
parameter is 0.3.

Fig. 4: The three layer feed forward Neural Network after trained
160 A Neural Network Based Router
Copyright ICWS-2009
In the figure the upper layer is input layer, the middle one is the hidden layer and the bottom
one is the output layer. After initializing the weights randomly we have converted routing
table information into input-output pattern file compatible with the JavaNNS simulator. We
have used the same file for training as well as validation.
While training the simulator has the facility to plot the error graph. This graph indicates
whether the error is decreasing or increasing in other words whether the Neural Network is
converging or not. The figure below indicates that the Neural Network has indeed converged
and the error has fallen much below the desired value.

Fig. 5: The error graph
We have trained the network using 100 cycles and the error has fallen below 0.02 when the
100
th
cycle was applied.
Conclusions and the Future work
We conclude that a feed forward Neural Network can replace a routing table. In this paper we
have taken the simulated result of the Ant Algorithms routing table for training the Neural
Network. These two can be combined so that the information is given dynamically to the
Neural Network which can adapt dynamically to the changes in the communication network.
References
[1] Chiu-Che Tseng, Max Garzon, Hybrid Distributed Adaptive Neural Router, Proceedings
of ANNIE, 98.
[2] V. Laxmi, Lavina Jain and M.S. Gaur, Ant Colony Optimization based Routing on ns-2, International
Conference on Wireless communication and Sensor Networks(WSCN), India, December 2006.
[3] University of Tubingen, JavaNNS, Java Neural Network Simulator. The url is http://www.ra.cs.uni-
tuebingen.de/software/JavaNNS/welcome_e.html.
Spam Filter Design Using HC, SA, TA
Feature Selection Methods

M. Srinivas Supreethi K.P. E.V. Prasad
Dept. of CSE, JNTUCEA JNTUCE, Anantapur JNTUCE, Kakinada
sreenu2521@gmail.com supreethi.pujari@gmail.com

Abstract

Feature selection is an important research problem in different statistical
learning problems including text categorization applications such as spam
email classification. In designing spam filters, we often represent the email by
vector space model (VSM), i.e., every email is considered as a vector of word
terms. Since there are many different terms in the email, and not all classifiers
can handle such a high dimension, only the most powerful discriminatory
terms should be used. Another reason is that some of these features may not be
influential and might carry redundant information which may confuse the
classifier. Thus, feature selection, and hence dimensionality reduction, is a
crucial step to get the best out of the constructed features. There are many
feature selection strategies that can be applied to produce the resulting feature
set. In this paper, we investigate the use of Hill Climbing, Simulated
Annealing, and Threshold Accepting optimization techniques as feature
selection algorithms. We also compare the performance of the above three
techniques with the Linear Discriminate Analysis. Our experiment results
show that all these techniques can be used not only to reduce the dimensions
of the e-mail, but also improve the performance of the classification filter.
1 Introduction
The junk email problem is rapidly becoming unmanageable, and threatens to destroy email as
a useful means of communication. These tide of unsolicited emails flood into corporate and
consumer inboxes everyday. Most spam is commercial advertising, often for dubious
products, get-rich quick schemes, or quasi-legal services. People waste increasing amounts of
their time reading/deleting junk emails. According to a recent European Union study, junk
email costs all of us some billion (US) dollars per year, and many major ISPs say that spam
adds some cost of their service. There is also the fear that such emails could hide viruses
which can then infect the whole network. Future mailing system should require more capable
filters to help us in the selection of what to read and avoid us to spend more time on
processing incoming messages.
Many commercial and open-source products exist to accommodate the growing need for
spam classifiers, and a variety of techniques have been developed and applied toward the
problem, both at the network and user levels. The simplest and most common approaches are
to use filters that screen messages based upon the presence of words or phrases common to
junk e-mail. Other simplistic approaches include black-listing (i.e., automatic rejection of
messages received from the addresses of known spammers) and white-listing (i.e., automatic
162 Spam Filter Design Using HC, SA, TA Feature Selection Methods
Copyright ICWS-2009
acceptance of message received from known and trusted correspondents). In practice,
effective spam filtering uses a combination of these three techniques. In this paper, we only
discuss how to classify the junk emails and legitimate emails based on the words or features.
From the machine learning view point, spam filtering based on the textual content of email
can be viewed as a special case of text categorization, with the categories being spam or
nonspam. In text categorization [5], the text can be represented by vector space model
(VSM). Each email can be transferred into the vector space model. This means every email is
considered as a vector of word terms. Since there are many different words in the email and
not all classifiers can handle such a high dimension, we should choose only the most
powerful discriminatory terms from the email terms. Another reason of applying feature
selection is that the reduction of feature space dimension may improve the classifiers'
prediction accuracy by alleviating the data sparseness problem.
In this paper, we investigate the use of Hill Climbing (HC), Simulated Annealing (SA), and
Threshold Accepting (TA) local search optimization techniques [8] as feature selection
algorithms. We also compare the performance of the above three techniques with Linear
Discriminate Analysis (LDA) [3]. Our results indicate that, using a K-Nearest Neighbor
(KNN) classifier [1], the accuracy of spam filters using any of the above strategies
outperform those obtained without feature selection. Among the four approaches, SA reaches
the best performance. The rest of the paper is organized as follows. Section 2 introduce the
experimental settings and related feature selection strategies. In section 3, we report the
experimental results obtained and finally section 4 is the conclusion.
2 Experimental Settings
In our experiments, we first transfer the emails into vectors by TF-IDF formulas [5]. Then we
apply the proposed feature sets. Finally we compare the accuracy obtained with the four
strategies.
2.1 Data Sets
Unlike general text categorization tasks where many standard benchmark collections exist, it
is very hard to collect legitimate e-mails for the obvious reason of protecting personal
privacy. In our experiment we use PulCorpus [10]. This corpus consists of 1099 messages;
481 of which are marked as spam and 618 are labeled as legitimate, with a spam rate of
43.77%. The messages in PUI corpus have header fields and HTML tags removed, leaving
only subject line and mail body text. To address privacy, each token was mapped to a unique
integer. The corpus comes in four versions: with or without stemming and with or without
stop word removal. In our experiment we use Lemmatize enabled, stop-list enabled version
from PUI corpus. This corpus has already been parsed and tokenized into individual words
with binary attachments and HTML tags removed. We randomly chose 62 legitimated e-
mails and 48 spam e-mails for testing and the rest e-mails for training.
2.2 Classifiers
K-nearest neighbor classification is an instance-based learning algorithm that has shown to be
very effective in text classification. The success of this algorithm is due to the availability of
effective similarity measure among the K nearest neighbor. The algorithm starts by
Spam Filter Design Using HC, SA, TA Feature Selection Methods 163
Copyright ICWS-2009
calculating the similarity between the test e-mail and all e-mail in training set. It then picks
the K closet instances and assigns the test e-mail to the most common class among these
nearest neighbors. Thus after transforming the training e-mails and test e-mails into vectors,
the second step is to find out the K vectors from training vector which are most similar to the
test vector. In this work, we used the Euclidian distance as a measure for the similarity
between vectors.
2.3 Transfer E-mails to Vectors
In text categorization, the text can be represented by vector space model. For terms appearing
frequently in many e-mails has limited discrimination power, we use Term frequency and
Inverse document frequency (TF-IDF) representation to represent the e-mail [5].
Accordingly, the more often a term appears in the e-mail, the more important this word is for
that e-mail. In our experiment, we sorted the features by its document frequency (DF), i.e.,
the number of e-mails that contain the i
th
features to choose 100 features which DF range is
0.02 to 0.5 [2]. Thus the input to the feature selection algorithm is a feature vector of length
100. We then applied the feature selection algorithms to find out the most powerful
discriminatory terms from the 100 features and test the performance of the e-mail filter.
2.4 Performance Measures
We now introduce the performance measures used in this paper. Let N=A+B+C+D be the
total number of test e-mails in our corpus.
Table 1: Confusion Matrix
Spam Non-Spam
Filter Decision: Spam A B
Filter Decision: Non-Spam C D
If table 1 denotes the confusion matrix of the e-mail classifier, then we define the accuracy,
precision, recall, and F1 for spam e-mails as follows:
ACCURACY =
A D
N
, PRECISION (P): A/AB

RECALL (R) =
A
A C
, F1 = 2PR/P R
Similar measures can be defined for legitimate e-mails.
2.5 Feature Selection Strategies
The output of the vector space modeling (VSM) is a relatively long feature vector that may
have some redundant and correlated features (curse of dimensionality). This is the main
motivation for using the feature selection techniques. The proposed feature selection
algorithms are classifier dependant. This means that different possible feature subsets are
examined by the algorithm and the performance of a prespecified classifier is tested for each
subset and finally the best discriminatory feature subset is chosen by the algorithm. There are
many feature selection strategies (FSS) that can be applied to produce the resulting feature
set. In what follows, we describe and report results conducted with our proposed FSS.
Copyright ICWS-2009
2.5.1 Hill Climbing (HC)
The basic idea of HC is to choose a solution from the neighborhood of a given solution,
which improves this solution best, and stops if the neighborhood does not contain an
improving solution [8]. The hill climbing used in this paper can be summarized as follows:
1. Randomly create an initial solution SI. This solution corresponds to a binary vector of
length equal to the total number of features in the feature set under consideration. The
l's positions denote the set of features selected by this particular individual. Set I*=S1
and calculate its corresponding accuracy Y(S1).
2. Generate a random neighboring solution S2 based on It and calculate its
corresponding accuracy Y(S2).
3. Compare the two accuracies. If the corresponding accuracy of the neighboring
solution Y(S2) is higher than Y(S1), set I*=S2.
4. Repeat step 2 to 3 for a pre-specified number of iteration (or until a certain criterion is
reached.)
Although hill climbing has been applied successfully to many optimization problems, it has
one main drawback. Since only improving solutions are chosen from the neighborhood, the
method stops if the first local optimum with respect to the given neighborhood has been
reached. Generally, this solution is not globally optimal and no information is available on
how much the quality of this solution differs from the global optimum. A first attempt to
overcome the problem of getting stuck in a local optimum was to restart iterative
improvement several times using different initial solutions (multiple restart). All the resulting
solutions are still only locally optimal, but one can hope that the next local optimum found
improves the best found solution so far. In our experiments, we used 10 different initial
solutions.
2.5.2 Simulated Annealing (SA)
Kirkpatrick et. al [6] proposed SA, a local search technique inspired by the cooling processes
of molten metals. It merges HC with the probabilistic acceptance of non-improving moves.
Similar to HC, SA iteratively constructs a sequence of solutions where two consecutive
solutions are neighbored. However, for SA the next solution does not necessarily have a
better objective value than the current solution. This makes it possible to leave local
optimum. First, a solution is chosen from the current solution. Afterwards, depending on the
difference between the objective values of the chosen and the current solution, it is decided
whether we move to the chosen solution or stay with the current solution. If the chosen
solution has a better objective value, we always move to this solution. Otherwise we move to
this solution with a probability which depends on the difference between the two objective
values. More precisely, if S1 denotes the current solution and S2 is the chosen solution, we
move to S2 with probability:
p(S
1
, S
2
) = e
-max{Y(s1) -Y(s2),O} T
(1)
The parameter T is a positive control parameter (temperature) which decreases with
increasing number of iterations and converges to 0. As the temperature is lowered, it becomes
ever more difficult to accept worsening moves. Eventually, only improving moves are
allowed and the process becomes 'frozen'. The algorithm terminates when the stopping
criterion is met. [7]. Furthermore, the probability above has the property that large
Copyright ICWS-2009
deteriorations of the objective function are accepted with lower probability than small
deteriorations. The simulated annealing used in this paper can be summarized as follows:
1. Randomly create an initial solution 51.This solution corresponds to a binary vector of
length equal to the total number of features in the feature set under consideration. The
l's positions denote the set of features selected by this particular individual. Set I*=S1
and calculate its corresponding accuracy Y(S1).
2. Create the parameter (temperature) T and constant cooling factor a, 0< a<1.
3. Generate a random neighboring solution S2 based on It and calculate its
corresponding accuracy.
4. Compare the two accuracy. If the corresponding accuracy of the neighboring solution
Y (S2) is higher than Y(S1), set I*=S2. Otherwise, generate U= rand (01). Compare
U and P(S1,S2).If U> P(S1,S2), I*=S2.
5. Decrease the temperature by T=T* a.
6. Repeat step 3 to 5 for a pre-specified number of iteration (or until a certain criterion is
reached).
To compare with HC, we also set the same 10 initial solutions and record the best solutions.
2.5.3 Threshold Accepting (TA)
A variant of simulated annealing is the threshold accepting method. It was designed by
Ducek and Scheuer [8] as a partially deterministic version of simulated annealing. The only
difference between simulated annealing and threshold accepting is the mechanism of
accepting the neighboring solution. Where simulated annealing uses a stochastic model,
threshold accepting uses a static model: if the difference between the objective value of the
chosen and the current solution is smaller than a threshold T, we move to the chosen solution.
Otherwise it stays at the current solution [8]. Again, the threshold is a positive control
parameter which decreases with increasing number of iterations and converges to 0. Thus, in
each iteration, we allow moves which do not deteriorate the current solution more than the
current threshold T and finally we only allow improving moves. The steps of the threshold
accepting algorithm used in this paper are identical to the SA except that step 4 above is
replaced by the following:
4. Compare the two accuracy Y(S2) and Y(S1). If the corresponding accuracy of the
neighboring solution Y(S2) is higher than Y(S1), set I*=S2. Otherwise, set = Y (S2)-Y
(S1). Compare and T. If T> , I*=S2.
2.5.4 Linear Discriminant Analysis (LDA)
LDA is a well-known technique for dealing with the class separating problem. LDA can be
used to determine the set of the most discriminate projection axes. After projecting all the
samples onto these axes, the projected samples will form the maximum between-class scatter
and the minimum within-class scatter in the projective feature space [3].
Let
{ }
1 1
1
1 1
.....
l
X
X X
=
and
{ }
2 2
2
1 2
.....
l
X
X X
=
be samples from two different classes and with some
abuse of notation x = x
1
U x
2
={x
1
....x
1
}. The linear discriminant is given by the vector W that
maximizes [9]
Copyright ICWS-2009
( )
T
B
T
W
W S W
J W
W S W
=
and
1 2 1 2
( )( )
T
B
S m m m m = ,
1,2
( )( )
i
T
W i i
l x x
S x m x m
=
=

are the between and within class scatter matrices respectively and m
i
is the mean of the
classes. The intuition behind maximizing J(w) is to find a direction which maximizes the
projected class means (the numerator) while minimizing the classes variance in this direction
(the denominator). After we find the linear transformation matrix W, the dataset can be
transformed by y wT*x. For the c-class problem, the natural generalization of linear
discriminant involves c-1 discriminant functions. Thus, the projection is from a d-
dimensional space to a (c- 1)-dimensional space [1].
3 Experiment Results
We conducted experiments to compare the performance of the proposed feature selection
algorithms. Throughout our experiments, we used a KNN classifier with K=30. It is clear that
the system performance with the feature selection strategies is better than the system
performance without the feature selection strategies. For LDA, only 9 spam e-mails among
48 spam cases and 2 legitimate e-mails of 62 legitimate cases were misclassified. For HC,
only 4 spam e-mails among 48 spam cases and 3 legitimate e-mails of 62 legitimate cases
were misclassified. For TA, only 3 spam e-mails among 48 spam cases and 3 legitimate e-
mails of 62 legitimate cases were misclassified. For SA, only 4 spam e-mails among 48 spam
cases and 1 legitimate e-mail of 62 legitimate cases were misclassified. Among all the four
strategies, SA reached the best performance. Accuracy ordering: SA> TA> HC > LDA.
4 Conclusion
In this paper, we proposed the use of three different local search optimization techniques as
feature selection strategies for application in spam e-mail filtering. The experimental results
show that the proposed strategies not only reduce the dimensions of the e-mail, but also
improve the performance of the classification filter. We obtained a classification accuracy of
90.0% for LDA, 93.6% for HC, 94.6% for TA and 95.5% for SA as compared to 88.1% for
the system without feature selection.
References
[1] R. Duda, P. Hart and D. Stork, "Pattern Classification," John Wiley and Sons, 2001. [2] N. Soonthornphisaj,
K. Chaikulseriwat and P. Tng-on, "Anti-Spam Filtering: A Centroid-Based Classification Approach," 6th
IEEE International Conference on Signal Processing, pp. 1096 - 1099, 2002.
[2] L.F. Chen, H.Y.M. Liao, M.T. Ko, J.C. Lin and G.J Yu, "A new LDA-based face recognition system which
can solve the small sample size problem." Patten recognition. vol. 33. pp. 1713-1726, 2000.
[3] C. Lai and M. Tsai, "An empirical performance comparison of machine learning methods for spam email
categorization," Proceeding of the 4th international conference on hybrid intelligent systems (HIS'04),
2004.
[4] J. F. Pang, D. Bu and S. Bai, "Research and Implementation of Text Categorization System Based on
VSM," Application Research of Computers, 2001.
[5] S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, "Optimization by Simulated Annealing." Science, pp 671-
580, May 1983.
Copyright ICWS-2009
[6] J. A. Clark, J. L. Jacob and S. Stepney, "The Design of SBoxes by Simulated Annealing." Evolutionary
Computation, pp. 15331537, Vol.2, June 2004.
[7] J. Hurink "Introduction to Local Search."
[8] S. Mika, G. Ratsch, J. Weston, B. Scholkopf and K. Mullers, "Fisher discriminant analysis with kernels."
Neural Networks for Signal Processing IX, 1999. pp. 41 - 48 Madison, WI Aug. 1999.
[9] http://iit.demokritos.gr/skel/i-config/downloads.
Analysis & Design of a New Symmetric Key
Cryptography Algorithm and Comparison with RSA

Sadeque Imam Shaikh
Dept. of CSE, University of Science & Technology Chittagong(USTC), Bangladesh
spnusrat@yahoo.com

Abstract

Networking is the main technology for communication. There are various
types of networks but all of them are vulnerable to protect valuable
information due to various attacks. So far cryptography is the main weapon
that can reduce unauthorized attack on valuable information. In the first phase
of this paper literature review based on various types of cryptographic
algorithms has been described. In the 2nd phase as a part of generating new
ideas a new symmetric key cryptography algorithm has been developed.
Although this algorithm is based on symmetric key it has few similarities with
RSA as a result comparison has been done between these two algorithms using
example in the 3rd phase. Finally source code of this algorithm has been
generated using Turbo C++ compiler which successfully encrypts and
decrypts informations as an output.
Keywords: Information Security, Cryptography, Algorithm, Keys, Encryption, Decryption.
1 Proposed Methodologies
Mathematics and Programming are the main parts of this paper as cryptography completely
depends on mathematics. Then to practically implement the encryption and decryption of the
algorithm programming using Turbo C++ has been used. However books and research papers
have been also used to generate new concept.
2 Literature Review
During 1994 The Internet Architecture Board (IAB) published a report that clearly indicated
that the information through network or internet requires better effective security. It also
represented the vulnerabilities of information system based on unauthorized access and
control of network traffic. Those reports were justified by the Computer Emergency
Response Team (CERT) coordinator centre (CERT/CC) as they reported that last ten years
the attacks on internet and network increase rapidly. Thats why a wide range of technology
and tools are needed to face this growing threat and strong cryptographic algorithm is the
main weapon that can face this challenge. Cryptographic system basically has two types
public key and secret-key cryptography. Asymmetric cryptography manipulates two separate
keys for encoding and decoding and provides a robust mechanism for the key transportation.
On the other hand, private key cryptography uses an identical key for both encoding and
decoding, which is more efficient for large amount of data [Shin and Huang,2007].Suppose
there are 4 entities then there should be 6 relationship. From symmetric point of view to
Analysis & Design of a New Symmetric Key Cryptography Algorithm and Comparison with RSA 169
Copyright ICWS-2009
maintain the security this system will require 6 secret keys. But for asymmetric point of view
4 entities will require 4 key pairs. From networking point of view each network may have
many pairs of relationship. For symmetric key cryptography it will be a big challenge to
maintain security for so many secret keys comparing with asymmetric key cryptography. The
inventor of public key algorithm stated that the limit of public key cryptography for key
organization and signature submission almost generally established [Diffie,1988].In 1976
Deffie, Hellman explained the new approach of public key cryptography and challenged all
mathematicians to generate a new better method of public key cryptography [Diffie and
Hellman,1976].The first response of this challenge was introduced in 1978 by Ron Rivest,
Adi Shamir and Len Adleman of MIT. These three scientists introduced another new
technique of public key cryptography which is stated as RSA which is still one of the best
public key cryptographic technique since then [Rivest et al.,1978].
Crypto analysis of RSA is completely based on factoring of n into 2 prime numbers. Defining
q(n) for n is the same for factoring n. If available algorithms are used to calculate d for e and
n then it will be time consuming for factoring problem [Kaliski and Robshaw,1995]. But
problem with RSA is that it always recommends to use large prime number. For small prime
number this may not be effective.This is another important point that has been considered for
designing a new algorithm for comparing with RSA. Stronger security for public key
distribution can be achieved by providing tighter control over the allocation of public key
from the directory [Popek and Kline,1979].But the first alternate approach was suggested by
Kohnfelder.He proposed to use certificate which will be utilized by users to transfer keys
without having any contact with public key authority. In this technique the key transfer will
be taken place so perfectly that it seems to be keys are transferred directly from public key
authority. Although one of the main advantages of secret key cryptography is the speed and
efficiency but during July 1998 the efficiency of DES was failed and it was proved as the
Electronic Frontier Foundation (EFF) declared that encrypted message of DES had been
broken [Sebastopol and Reilly,1998].In November 2001, the National Institute of Standard
Technology (NIST) declared the Advanced Encryption Standard (AES) as an alternative of
the Data Encryption Standard (DES) [Mucci et al.,2007].A crucial common point underlying
RSA-based cryptographic schemes is the assumption that it is difficult to factor big value
which are the product of prime factors. A list of challenge numbers documents the
capabilities of known factoring algorithms, and the current world record is 193 decimal digits
that was factored in 2005.Common minimum requirements suggest the use of numbers with
at least 1,024 bits, which corresponds to 309 decimal digits [Geiselmann and
Steinwandt,2007].
3 Designing New Algorithm
Encryption
1. Chose two prime numbers P & Q.
2. Calculate N=P * Q.
3. Find the relative prime number set S of N.
4. Randomly chose one number from S. say S
1.

5. Calculate S
1
P
& S
1
Q

6. Find the largest prime number between S
1
P
& S
1
Q
, says X.
7. Let the Plain Text is TEXT.
170 Analysis & Design of a New Symmetric Key Cryptography Algorithm and Comparison with RSA
Copyright ICWS-2009
8. Find the immediate large prime number of X, says Y.
9. Calculate V=X+Y.
10. For encryption calculate the transition value T as follows: -
T= (TEXT XOR Y)
11. Now find the cipher text as follows:
CT=T XOR V.
12. The public key would be PK= X*strlen (CT).
Decryption
1. Calculate the length of Cipher Text, says L.
2. Then do PK / L = H
3. Find the next immediate largest prime number from H, i.e. H
1
.
4. Do sum SUM= H + H
1
.
5. Calculate CT=CT XOR SUM.
6. Finally the receiver would get the plain text as follows:
PT=CT XOR H
1
.
4 Comparison with RSA for Same Input Prime Numbers(5 and 3)
RSA Algorithm [Kahate, 2003]
1. Choose two large prime numbers. Say P=5 & Q=3
2. Calculate N=P * Q = 5* 3 =15
3. Select the Public key (i.e. the encryption key) E such that it is not a factor of (P-1) and
(Q-1).
As we can see, (P-1)*(Q-1)=4*2=8. The factors of 8 are 2,2,2. Therefore our public
key E must not have a factor of 2. Let us choose the public key value of E as 5.
4. Select the private key D such that (D*E) mod (P-1)*(Q-1)=1. Let us choose D as 5,
because we can see that (5*5) mod 8 =1, which satisfied our condition.
5. Let the plain text PT=688.
6. For encryption, calculate the cipher text CT for the plain text as follows: CT=PT
E

mod N=6
5
mod 15=13
7. Send CT as the cipher text to the receiver.
8. For decryption, calculate the plain text PT from the cipher text CT as follows:
PT=CT
D
mod N=13
5
mod 15 =13.
New Algorithm (Encryption)
1. Choose two distinct prime numbers. Say P=5 & Q=3
2. Calculate N=P * Q = 5* 3 =15
3. Find the relative prime number of set of N=15,{1,2,4,7,8,11,13,14}.
4. Randomly chose one number from S. say S
1
=13.
5. Calculate S
1
P
= 13
5
= 371293, & S
1
q
=13
3
= 2197
6. Find the largest prime number X between S
1
P
& S
1
q
. X=371291.
7. Let the Plain Text is 688.
8. Find the immediate large prime number Y greater than X. i.e. Y=371299.
9. Calculate V=X+Y=371291+371299=742500.
Copyright ICWS-2009
10. For encryption calculate the transition value T as follows: -
T= (TEXT XOR Y) = 688 XOR 371299 = 370899
11. Now for encryption, find the cipher text as follows:
CT=T XOR V = 370899 XOR 742500 = 982199
12. The private key would be PK= X*strlen (CT)= 371291* 3= 1113873
New Algorithm (Decryption)
1. Calculate the length L of Cipher Text CT, i.e. L=3.
2. Then do H=PK / L=1113879/3=371291

3. Find the next largest prime number H
1
from H, i.e. H
1
=371299

4. Do sum S= H + H
1
= 371291+371299=742500.
5. Calculate CT=CT XOR S = 688 XOR 742500 = 270899
6. Finally the receiver would get the plain text as follows:
PT=CT XOR H
1
= 270899 XOR 371299 = 688.
5 Advantages of New Algorithm Over RSA
1. RSA recommends using large prime number, i.e. it is not very much effective with small
prime number. For example with two small prime numbers say 5 & 3, both the encryption
and decryption key becomes 5, which one can easily get with no difficulty.
On the other hand with the same prime number given above, New algorithm provides
private key 1113873. This time New algorithm is better.
2. RSA algorithm needs the public key directly to encode the plain text. Whereas new
algorithm does not uses the public key directly rather uses a transitional private key. For
example taking the above example, RSA, encrypt the plain text PT as follows:
CT=PT
E
mod N,Where N is 15 & E is 5.
But in new algorithm the private key depends upon the length of encrypted TEXT.
PK= X*strlen (CT)
Where PK is the private key and X is prime number and CT is the cipher text..
So finally we can say, in new algorithm one can not get the actual private key directly, that
could be possible with RSA.
3. If we study the RSA algorithm, we would find that the effectiveness of RSA mostly
depend upon the size of the two prime numbers, which is absence in new algorithm. In
new algorithm with small prime number one can get an effective encrypted data.
5.1 Disadvantages of New Algorithm Over RSA
1. New algorithm is a symmetric key algorithm where RSA is a asymmetric key algorithm
hence it has few limitation comparing with RSA.Moreover RSA often accepts two same
prime numbers while new algorithm never accept two same prime numbers for the same
input.
172 Analysis & Design of a New Symmetric Key Cryptography Algorithm and Comparison with RSA
Copyright ICWS-2009

Fig. 1
6 Output of Encryption Window
6.1 Output of Decryption window

Fig. 2
7 Conclusion
Cryptography especially public key cryptography is one of the hot topics for information
security. If it is necessary to maintain security, privacy and integrity of information system
there is no alternative of cryptography thats why even for satellite communication both the
ground stations and satellite in distance orbit transfer and receive information using
encryption and decryption to ensure security and privacy for all subscribers. With the passage
Copyright ICWS-2009
of time the technique of cryptography is changing because the cipher which was earlier
considered to be effective now becomes insecure. Thats why there is a always chance for
developing or researching about cryptographic algorithm. From this point of view this new
symmetric algorithm that has been described in this paper may be helpful for further research
to enhance maximum information security.
References
[1] [Diffie,1988] Diffie.W, The first ten years of public key cryptography, Proceedings of IEEE,May 1988.
[2] [Diffie and Hellman,1976] Diffie, W, Hellman, M Multi-user cryptographic technique, IEEE transactions
on information theory, November1976.
[3] [Geiselmann and Steinwandt,2007] Willi Geiselmann, Rainer Steinwandt, Special-Purpose Hardware in
Cryptanalysis, The case of 1024 bit RSA, IEEE computer society 2007.
[4] [Kahate,2003] Atul Kahate, Cryptography and Network security, Tata McGraw-Hill publishing company
Limited, 2003, pp115-119.
[5] [Kaliski and Robshaw,1995] Kaliski, B, Robshaw. M, The secure use of RSA,Crypto Bytes, Autumn
1995.
[6] [Mucci et al., 2007] C. Mucci, L. Vanzolini, A. Lodi, A. Deledda, R. Guerrieri, F. Campi, M. Toma,
Implementation of AES/Rijndael on a dynamically reconfigurable architecture Design, Automation &
Test in Europe Conference & Exhibition IEEE, 2007.
[7] [Popek and Kline,1979] Popek,G and Kline,C Encryption and secure computer networks, ACM Computer
surveys, December 1979.
[8] [Rivest et al.,1978] Rivest, R; Shamir, A; and Adleman, L A method for obtaining digital signatures and
public key cryptosystems, Communication of the ACM, February 1978.
[9] [Sebastopol and Reilly, 1998] Sebastopol, C, A; O Reilly, Electronic Frontier Foundation, Cracking DES,
Secrets of encryption research, wiretap Politics and chip design, Electronic Frontier Foundation, Cracking
DES, 1998.
[10] [Shin and Huang, 2007] Shin-Yi Lin and Chih-Tsun Huang, A High-Throughput Low-Power AES Cipher
for Network Applications, Asia and South Pacific Design Automation Conference 2007, IEEE computer
society.

An Adaptive Multipath Source Routing
Protocol for Congestion Control and Load
Balancing in MANET

Rambabu Yerajana A. K. Sarje
Department of ECE Department of Computer ECE
Indian Institute of Technology Indian Institute of Technology
Roorkee Roorkee
rambabuwhy@gmail.com sarjefec@iitr.ernet.in

Abstract

In this paper, we propose a new Multipath routing protocol for ad hoc wireless
networks, which is based on the DSR (Dynamic Source Routing)-On demand
Routing Protocol. Congestion is the main reason for packet loss in mobile ad
hoc networks. If the workload is distributed among the nodes in the system,
based on the congestion of the paths, the average execution time can be
minimized and the lifetime of the nodes can be maximized. We propose a
scheme to distribute load between multiple paths according to the congestion
status of the path. Our simulation results confirm that the proposed protocol-
CCSR improves the throughput and reduces the number of collisions in the
network
Keyword: Ad hoc networks, congestion control and load balancing, routing protocols.
1 Introduction
A mobile Ad hoc network is a collection of wireless mobile hosts forming a temporary
network without the aid of any fixed infrastructure and centralized administration. All nodes
can function, if needed, as relay stations for data packets to be routed to their final
destination. Routing in mobile environments is challenging due to the constraints existing on
the resources (transmission bandwidth, CPU time, and battery power) and the required ability
of the protocol to effectively track topological changes.
Routing protocols for Ad hoc networks can be classified into three categories: proactive, on-
demand also called reactive, and hybrid protocols [7, 8]. The primary characteristic of
proactive approaches is that each node in the network maintains a route to every other node in
the network at all times. In Reactive routing techniques, also called on-demand routing,
routes are only discovered when they are actually needed. When a source node needs to send
data packets to some destination, it checks its route table to determine whether it has a route
to that destination. If no route exists, it performs a route discovery procedure to find a path to
the destination. Hence, route discovery becomes on-demand. Dynamic Source Routing
(DSR) and Ad hoc On-demand Distance Vector (AODV) routing protocols are on demand
routing protocols our proposed protocol is based on DSR protocol [1, 3, 4, and 5].
An Adaptive Multipath Source Routing Protocol for Congestion Control and Load Balancing in MANET 175
Copyright ICWS-2009
The rest of this paper is organized as follows. Section 2 gives a brief introduction to DSR
protocol and our proposed routing protocol CCSR. In section 3, the performance comparisons
between CCSR and DSR are discussed. Section 4 concludes the routing algorithm.
2 Dynamic Source Routing Protocol
In DSR protocol, if a node has a packet to transmit to another node, it checks its Route Cache
for a source route to the destination [1, 6, 7 and 8]. If there is already an available route, then
the source node will just use that route immediately. If there is more than one source route,
the source node will choose the route with the shortest hop-count, Source Route at the
source node includes list of all intermediate traversing nodes in the packet header, when it
desires to send the packet to destination in an ad hoc network. Source node initiates route
discovery, if there are no routes in its cache. Each route request may discover multiple routes
and all routes are cached at the source node.
The Route Reply packet is sent back by the destination to the source by reversing the
received node list accumulated in the Route Request packet. The reversed node list forms the
Source Route for the Route Reply packet. DSR design includes loop free discovery of
routes and discovering multiple paths in DSR is possible because paths are stored in cache [6,
8]. Due to the dynamic topology of Ad hoc networks, the single path is easily broken and
needs to perform a route discovery process again. In Ad hoc networks multipath routing is
better suited than single path in stability and load balance.
3 Cumulative Congestion State Routing Protocol Based on Delimiters
Our motivation is that congestion is a dominant cause for packet loss in MANETs. Unlike
well-established networks such as the Internet, in a dynamic network like a MANET, it is
expensive, in terms of time and overhead, to recover from congestion. Our proposed CCSR
protocol tries to prevent congestion from occurring in the first place. CCSR uses congestion
status of the whole path (Congestion Status of the all nodes participated in route path) and
source node maintains the table called Congestion Status table (Cst) contains the congestion
status of the every path from source node to destination node.
S
1
2
4
D 5
3
cs(d)
cs(d)+cs(5)
cs(d)+cs(4)+cs(3)

Fig. 1: Using Ccsp Packets
A simplified example is illustrated in Fig. 1. Three possible routes S->1-> 2->D, S->5->D
and S->3->4>D are multipaths routes between source node S to the destination node D.
176 An Adaptive Multipath Source Routing Protocol for Congestion Control and Load Balancing in MANET
Copyright ICWS-2009
Source node S maintains a special table called Congestion Status Table, which stores the
congestion status of the every path, remember that here we are calculating the congestion
status not for the single node rather all nodes of the path (cumulative congestion status).
3.1 Load Distribution
In CCSR, Destination node will send Cumulative congestion status packets (Ccsp) packets
periodically towards the source node. Source node after receiving the Ccsp packets it will
update the Cst Table. The distribution procedure at source node will distribute the available
packets according to the delimiters used. CCSR protocol uses three delimiters and will decide
how many packets need to send to congested paths. According to the Cst table Source node
will distribute the packets such that more packet towards the path with less congestion status
and sends less packets to the path with more congestion status in the Cst table. Table 1 shows
the Cst table of the Source node S and the Congestion Status of the path will be calculated as,
Cs (A): indicates Congestion status of the node A.
Ccs(B):indicates Cumulative Congestion Status of the node B, is calculated using the
congestion status of the node B plus congestion status of its previous nodes.
Here Congestion Status of the particular node will be calculated using available buffer size or
queue length and number of packets, the ratio of data to the available queue length will give
the congestion status of the particular node.
Ccs (D): Cumulative Congestion Status of the node D of the path {S, 1, 2, D}
: Congestion Status of the node D
Ccs (1): Cumulative Congestion Status of the node 1 of the path {S, 1, 2, D).
: Congestion Status of the node D + Congestion Status of the node 1
: Cs (D) + Cs (1).
Ccs (3): Cumulative Congestion Status of the node 3 of the path {S, 3, 4, D}.
: Cs (D) + Cs (4) + Cs (3).
The typical Cst table of the source node S is shown in the table 1, where pathID indicates the
nodes involved in routing and Congestion Status indicates the cumulative congestion status of
all nodes involved in the route path. After updating the latest congestion status source node
will choose the path and distribute the packets.
Load distribution procedure is shown below
/*
a, b and c indicates the number packets available at nodes of corresponding paths A, B and C
and x, y and z indicates queue length of the paths then, a/x, b/y and c/z indicates the
congestion status of the paths
NOPACK is data available at source node
L, M, and U are Congestion Status Delimiters Low Congestion status means traffic is low
and High means traffic is very high towards the path.
L= Low, M= Medium H= High are the delimiters for load distribution.
CL =minimum value in the congestion status table and
Copyright ICWS-2009
CU =maximum value in the Congestion status table
*/
//Begin Load Distribution Procedure
Procedure LoadDIST (NOPACK, A, B, C, L, M)
Repeat until NOPACK = 0
IF Cs {X} < = L
Send more packets towards the this path
IF Cs {X} > = H
Stop sending the packets to words the path
X is the one of some paths in path list
IF L < Cs {X} < M
Send CU / Cs {X} towards the paths X
ELSE IF M < Cs {X} < U
Send CU / Cs {X} towards the paths X
3.2 Additional Analysis
If the Congestion Status of the path.i.e.: Cs{x} is very high for the long period then removes
or deletes the path from the list. The overhead is reduced because maintaining such multi
paths is very difficult. Deleting the paths which has more congestion status for long time
processing time is reduced at source node.
3.3 Congestion State Table
Source node maintains a separate table to keep track of congestion status of the available
paths. The Congestion State Table of the Source node S is shown in Table 1.
Table 1
Path ID Congestion Status
{S,1,2,D} Cs(S+1+2+D)
{S,5,D} Cs(S+5+D)
{S,3,4,D} Cs(S+3+4+D)

.
..
..
4 Simulation
CCSR protocol was simulated in GloMoSim Network Simulator. Number of nodes present in
the network was varied from 20 to 60. Nodes moved in an area of (1000x300) m2 in
accordance with random waypoint mobility model, with a velocity of 20 m/s and a pause time
of 0 second. Simulation time was set as 700 seconds.
We considered the following important metrics for the evaluation: Packet delivery ratio,
number of collisions and end-to end delay.
Data Throughput (kilobits per second Kbps) - describes the average number of bits received
successfully at the destination per unit time (second). This metric was chosen to measure the
resulting network capacity in the experiments.
178 An Adaptive Multipath Source Routing Protocol for Congestion Control and Load Balancing in MANET
Copyright ICWS-2009
End-to-end delay (seconds) This is an average of the sum of delays (including latency), at
each destination node during the route discovery from the source to destination. The
performance of the CCSR protocol gives better result in terms of delay and throughput than
DSR and AODV. As shown in figure 2 and 3 we have compared the result with DSR and in
figure 4 and 5 we have compared the proposed protocol result with AODV as a result of that,
end-to-end delay of the DSR and AODV suffers the worst delay, this is because of high load
congestion in the network nodes and absence of the load balancing mechanism. The
simulation shows 5 to 25 percentage improvement in packet delivery ratio and delay.

Fig. 2: End-to-End Delay

Fig. 3: Throughput
End-to-End Delay(Sec)
0
2
4
6
8
10
1
No of nodes
E
n
d
-
t
o
-
E
n
d

d
e
l
a
y
(
A
v
g
)
(
S
e
c
)
AODV
CCSR
20 30 40 50 60
600Sec Simulation

Fig. 4: Endto-End Delay
Copyright ICWS-2009

Fig. 5: Throughput
5 Conclusion
In this paper, we proposed a new routing protocol called CCSR to improve the performance
of Multipath routing protocol for ad hoc wireless networks. The CCSR uses the cumulative
congestion status of the path rather than congestion status of the neighborhood. According to
the values of the congestion status for the path stored in a separate table and maintained by
source node for processing, the source node will distribute the packets such that more packets
to paths with less congestion. It is evident from simulation results that CCSR outperforms
both AODV and DSR because it balances the load according to the situation of the network
and adaptively changes the decision by source node.
References
[1] [David B. Johnson, David A.Maltz,Yih Chun Hu] The Dynamic Source Routing Protocol for Mobile Ad
Hoc Networks (DSR), Internet Draft,draftietfManetdsr09.txt.April,2003.
URL://www.ietf.org/internetdrafts/draftietf- manet-dsr-09.txt.
[2] [Yashar Ganjali and Abtin Keshavarzian] Load Balancing in Ad Hoc Networks: Single-path routing vs.
Multi-path Routing, IEEE INFOCO 2004, Twenty-third annual joint conference of the IEEE computer
communications society, volume 2, March 2004, pp: 1120-1125.
[3] [Salma Ktari and Houda Labiod and Mounir Frikha] Load Balanced Multipath Routing in Mobile Ad hoc
Networks, Communication Systems 2006, ICCS2006 10th IEEE Singapore international conference, Oct
2006, pp: 1-5.
[4] [Wen Song and Xuming Fang] Routing with Congestion Control and Load Balancing in Wireless Mesh
Networks, ITS Telecommunications proceedings 2006, 6
th
international conference, pp:719-724.
[5] [Neeraj NEhra R.B. Patel and V.K.Bhat] Routing with Load Balancing in Ad Hoc Network: A Mobile
Agent Approach, 6th IEEE/ACIS International Conference on Computer and Information science (ICIS
2007), pp: 480-486.
[6] [Mahesh K. Marina and Samir R. Das] Performance of Route Caching Strategies in Dynamic Source
Routing, Distributed Computing Systems Workshop, 2001 International Conference, 16-19, April 2001
Pp: 425 432.
[7] [A. Nasipuri and S. R. Das] On-demand Multipath routing for mobile ad hoc networks, Proc. IEEE
ICCCN, Oct. 1999, pp. 6470.
[8] [S.J. Lee, C.K. Toh, and M. Gerla] Performance Evaluation of Table-Driven and On-Demand Ad Hoc
Routing Protocols, Proc. IEEE Symp. Personal, Indoor and Mobile Radio Comm, Sept. 1999, pp. 297-301.
Spam Filtering Using Statistical Bayesian
Intelligence Technique

Lalji Prasad RashmiYadav Vidhya Samand
SIMS (RGPV) University SIMS (RGPV) University SIMS (RGPV) University
Indore-453002 Indore-453002 Indore-453002
lalji_prasad@rediffmail.com rasneeluce@gmail.com vidhya_samand@yahoo.ac.in

Abstract

This paper describes how Bayesian mathematics can be applied to the spam
problem, resulting in an adaptive, statistical intelligence technique that is
much harder to circumvent by spammers. It also explains why the Bayesian
approach is the best way to tackle spam once and for all, as it overcomes the
obstacles faced by more static technologies such as blacklist checking,
databases of known spam and keyword checking. Spam is an ever-increasing
problem. The number of spam mails is increasing daily. Techniques currently
used by anti-spam software are static, meaning that it is fairly easy to evade by
tweaking the message a little. To effectively combat spam, an adaptive new
technique is needed. This method must be familiar with spammers' tactics as
they change over time. It must also be able to adapt to the particular
organization that it is protecting from spam. The answer lies in Bayesian
mathematics, which can be applied to the spam problem, resulting in an
adaptive, statistical intelligence technique that is much harder to circumvent
by spammers. The Bayesian approach is the only and best way to tackle spam
once and for all, as it overcomes the obstacles faced by more static
technologies such as blacklist checking, databases of known spam and
keyword checking.
1 Introduction
Every day we receive many times more spam than legitimate correspondences while
checking mail. On average, we probably get ten spams for every appropriate e-mail. The
problem with Spam is that it tends to swamp desirable e-mail. Junk E-Mail courses through
the Internet, clogging our computers and diverting attention from mail we really want.
Spammers waste the time of a million people. In future, Spam would like OS crashes,
viruses, and popup, become one of those plagues that only afflict people who dont bother to
use the right software.
The problem of unsolicited e-mail has been increasing for years. Spam encompasses all the e-
mail that we do not want and that is only very loosely directed at us. Unethical e-mail senders
bear little or no cost for mass distribution of messages; yet normal e-mail users are forced to
spend time and effort purging fraudulent and otherwise unwanted mail from their mailboxes.
Bayesian filters are advantageous because they take the whole context of a message into
consideration. Unlike other filtering techniques that look for spam-identifying words in
subject lines and headers, a Bayesian filter uses the entire context of an e-mail when it looks
Spam Filtering Using Statistical Bayesian Intelligence Technique 181
Copyright ICWS-2009
for words or character strings that will identify the e-mail as spam. A Bayesian filter is
constantly self-adapting. Bayesian filters are adaptable in that the filter can train itself to
identify new patterns of spam. The Bayesian technique learns the email habits of the
company and understands that. It let each user define spam so that the filter is highly
personalized. Bayesian filters also automatically update and are self-correcting as they
process new information and add it to the database.
2 What is Spam?
Spam is somewhat broader than the category "unsolicited commercial automated e-mail";
Spam encompasses all the e-mail that we do not want and that is only very loosely directed at
us.
2.1 How Spam Creates Problem
The problem of unsolicited e-mail has been increasing for years. Spam encompasses all the e-
mail that we do not want and that is very loosely directed at us. Normal e-mail users are
forced to spend time and effort purging fraudulent mail from their mailboxes. The problem
with spam is that it tends to swamp desirable e-mail.
2.2 Looking at Filtering Algorithm
2.2.1 Basic Structured Text Filters
The e-mail client has the capability to sort incoming e-mail based on simple strings found in
specific header fields, the header in general, and/or in the body. Its capability is very simple
and does not even include regular expression matching. Almost all e-mail clients have this
much filtering capability.
2.2.2 White List Filter
The "white list plus automated verification" approach. There are several tools that implement
a white list with verification: TDMA is a popular multi-platform open source tool; Choice
Mail is a commercial tool for Windows. A white list filter connects to an MTA and passes
mail only from explicitly approved recipients on to the inbox. Other messages generate a
special challenge response to the sender. The white list filter's response contains some kind of
unique code that identifies the original message, such as a hash or sequential ID. This
challenge message contains instructions for the sender to reply in order to be added to the
white list (the response message must contain the code generated by the white list filter.
2.2.3 Distributed Adaptive Blacklists
Spam is delivered to a large number of recipients. And as a matter of practice, there is little if
any customization of spam messages to individual recipients. Each recipient of a spam,
however, in the absence of prior filtering, must press his own "Delete" button to get rid of the
message. Distributed blacklist filters let one user's Delete button warns millions of other users
as to the spamminess of the message. Tools such as Razor and Pyzor operate around servers
that store digests of known spam. When a message is received by an MTA, a distributed
blacklist filter is called to determine whether the message is a known spam. These tools use
clever statistical techniques for creating digests, so that spam with minor or automated
mutations In addition, maintainers of distributed blacklist servers frequently create "honey-
182 Spam Filtering Using Statistical Bayesian Intelligence Technique
Copyright ICWS-2009
pot" addresses specifically for the purpose of attracting spam (but never for any legitimate
correspondences).
2.2.4 Rule-Based Rankings
The most popular tool for rule-based Spam filtering, by a good margin, is Spam Assassin.
Spam Assassin (and similar tools) evaluates a large number of patterns mostly regular
expressions against a candidate message. Some matched patterns add to a message score,
while others subtract from it. If a message's score exceeds a certain threshold, it is filtered as
spam; otherwise it is considered legitimate.
2.2.5 Bayesian Word Distribution Filters
The general idea is that some words occur more frequently in known spam, and other words
occur more frequently in legitimate messages. Using well-known mathematics, it is possible
to generate a "spam-indicative probability" for each word. It can generate a filter
automatically from corpora of categorized messages rather than requiring human effort in
rule development. It can be customized to individual users' characteristic spam and legitimate
messages.
2.2.6 Bayesian Trigram Filters
Bayesian techniques built on a word model work rather well. One disadvantage of the word
model is that the number of "words" in e-mail is virtually unbounded. The number of "word-
like" character sequences possible is nearly unlimited, and new text keeps producing new
sequences. This fact is particularly true of e-mails, which contain random strings in Message-
IDs, content separators, UU and base64 encodings, and so on. There are various ways to
throw out words from the model. It uses trigrams for probability model rather than "words.
Trigram is smaller unit of words. Among all the techniques described above we have chosen
Bayesian approach for i2.3 Algorithm for Bayesian probability model of Spam and No spam
words. When user logins the administrator checks all new mails for Spam and set the status
as Spam, non-Spam, Blacklist for Blacklisted sender and White list for white listed sender.
The sender will be checked against sender XML file which maintains the status of user
(Blacklisted, White listed, No status).If sender is blacklisted or white listed then no check for
Spam is applied. If sender is with No status then we use the following algorithm for checking
against Spam. We have used Grahams Bayesian statistical based approach. Steps of
statistical filtering:
We started with corpus of Spam and No spam tokens mapping each token to the probability
that an email containing it is a Spam, contained in Probability XML file.
We scanned the entire text, including subject header of each message. We currently
considered alphanumeric characters, exclamation mark and dollar signs to be part of tokens,
and everything else to be a token separator.
When a new email arrives, we extracted all the tokens and find at most fifteen with
probabilities p1...p15 furthest (in either direction) from 0.5.The factor used for extracting 15
interesting words is calculated as follows:
a. For words having probability greater than 0.5 in probability XML Factor = Token
probability -0.5
b. For words having probability less than 0.5 in probability XML Factor = 0.5 - Token
probability
Copyright ICWS-2009
c. One question that arises in practice is what probability to assign to a token weve
never seen, i.e. one that doesn't occur in the probability XML file. We have assigned.4
to that token.
d. The probability that the mail is a Spam is
p1p2...p15
p1p2...p15 + (1 - p1) (1 - p2)... (1 - p15)
e. We treated mail as Spam if the algorithm above gives it a probability of more than.9
of being Spam.
At this stage we maintained two XML files (Spam XML file, No spam XML file) for
each corpus, mapping tokens to number of occurrences
f. According to combined probability if message is a spam then we count the number of
times each token (ignoring case) occurs in message and update Spam XML file. And
if it is no spam we update No spam XML file.
g. Also the number of spam or no spam mails is updated based on message status.
h. We looked through the entire user's email and, for each token, calculated the ratio of
spam occurrences to total occurrences. Pi = Spam occurrences/ Total occurrences
For example, if "cash" occurs in 200 of 1000 spam and 3 of 500 no spam emails, its
spam probability is (200/1000) (3/500 + 200/1000) or.971.
i. Whenever the no. of Spam and Ham mails will reach 1000 then probability XML will
be updated according to above formula. We want to bias the probabilities slightly to
avoid false positives.
Bias Used
There is the question of what probability to assign to words that occur in one corpus but not
the other. We choose.01 (For not occurring in Spam XML) and.99 (For not occurring in No
spam XML). We considered each corpus to be a single long stream of text for purposes of
counting occurrences; we use their combined length, for calculating probabilities. This adds
another slight bias to protect against false positives. The token probability is calculated if and
only if the no. of both Spam and Ham mails reaches 1000. Here we have used 1000 but you
can use even larger corpus of messages. Unless and until the message no. For both Spam and
Ham mails reach 1000 the messages having probability greater than 0.6 will be treated as
Spam. Afterwards 0.9 is used.
We are using very large corpus of token probabilities in Probability XML instead of corpus
of Spam and Ham messages. If user marks a non-Spam mail as Spam the Spam and non
Spam XML will be updated accordingly also all words will be assigned high probability in
Probability XML, until the no. of mails reaches 1000.
3 Bayesian Model of Spam and Non Spam Words
The spam filtering technique implemented in software is Bayesian statistical probability
models of spam and non-spam words. The general idea is that some words occur more
frequently in known spam, and other words occur more frequently in legitimate messages.
Using well-known mathematics, it is possible to generate a "spam-indicative probability" for
each word. Another simple mathematical formula can be used to determine the overall "spam
184 Spam Filtering Using Statistical Bayesian Intelligence Technique
Copyright ICWS-2009
probability" of a novel message based on the collection of words it contains.Bayesian email
filters take advantage of Bayes' theorem. Bayes' theorem, in the context of spam, says that the
probability that an email is spam, given that it has certain words in it, is equal to the
probability of finding those certain words in spam email, times the probability that any email
is spam, divided by the probability of finding those words in any email:
Pi = spam occurrences /Total occurrences
3.1 Process
Particular words have particular probabilities of occurring in spam email and in legitimate
email. For instance, most email users will frequently encounter the word Viagra in spam
email, but will seldom see it in other email. The filter doesn't know these probabilities in
advance, and must first be trained so it can build them up.
To train the filter, the user must manually indicate whether a new email is spam or not. For
all words in each training email, the filter will adjust the probabilities that each word will
appear in spam or legitimate email in its database. For instance, Bayesian spam filters will
typically have learned a very high spam probability for the words "Viagra" and "refinance",
but a very low spam probability for words seen only in legitimate email, such as the names of
friends and family members. After training, the word probabilities (also known as likelihood
functions) are used to compute the probability that an email with a particular set of words in
it belongs to either category. Each word in the email contributes to the e-mails Spam
probability. This contribution is called the posterior probability and is computed using Bayes'
theorem. Then, the e-mails spam probability is computed over all words in the email, and if
the total exceeds a certain threshold (say 95%), the filter will mark the email as a Spam.
Email marked as Spam can then be automatically moved to a "Spam" email folder, or even
deleted outright
3.2 Advantages
1. A statistical model basically just works better than a rule-based approach.
2. Feature-recognizing filters like Spam Assassin assign a spam "score" to email. The
Bayesian approach assigns an actual probability.
3. Makes the filters more effective.
4. Lets each user decide their own precise definition of spam.
5. Perhaps best of all makes it hard for spammers to tune mails to get through the filters.
4 Conclusion
With more and more people use email as the everyday communication tool, there are more
and more spam, viruses, phishing and fraudulent emails sent out to our email Inbox. Several
email systems use filtering techniques that seek to identify emails and classify them by some
simple rules. However, these email filters employ conventional database techniques for
pattern matching to achieve the objective of junk email detection. There are several
fundamental shortcomings for this kind of junk email identification technique, for example,
the lack of a learning mechanism, ignorance of the temporal localization concept, and poor
description of the email data.
Copyright ICWS-2009
Spam Filter Express is a powerful spam filter quickly identifies and separates the hazardous
and annoying spam from your legitimate email. Based on Bayesian filtering technology,
Spam Filter Express adapts itself to your email automatically, filtering out all of the junk mail
with close to 100% accuracy. No adding rules, no complex training, no forcing your friends
and colleagues to jump through hoops to communicate with you.
5 References
[1] [M. Sahami, S. Dumais, D. Heckerman, E. Horvitz (1998)] "A Bayesian approach to filtering junk e-mail".
AAAI'98 Workshop on Learning for Text Categorization.
[2] [BOW] Bowers, Jeremy, Spam Filtering Last Stand, http://www.jerf.org/iri/2002/11/18.html, November
2002 3 [JGC] Graham-Cumming, John, 2004 MIT Spam Conference: How to beat an adaptive spam filter,
http://www.jgc.org/SpamConference011604.pps, January 2004
[3] [ROB3] Robinson, Gary, Spam Filtering: Training to Exhaustion, http://www.garyrobinson.net/2004/02/
spam_filtering_.html, February 2004.
[4] [Paul Graham] Better Bayesian filtering http://www.paulgraham.com/better.html
[5] [Graham (2002) Paul Graham] A plan for spam. WWW Page, 2002. URL
http://www.paulgraham.com/spam.html.
[6] [Spam Cop FAQ.] "On what type of email should I (not) use Spam Cop?" (FAQ). Iron Port Systems, Inc..
Retrieved on 2007-01-05.
[7] [Scott Hazen Mueller] "What is spam?". Information about spam. spam.abuse.net. Retrieved on 2007-01-
05.
[8] [Center for Democracy and Technology (March 2003)] "Why Am I Getting All This Spam? Unsolicited
Commercial E-mail Research Six Month Report" Retrieved on 2007-06-05. (Only 31 sites were sampled,
and the testing was done before CAN-SPAM was enacted.)
[9] ["Spamhaus Statistics : The Top 10"] Spamhaus Blocklist (SBL) database. The Spamhaus Project Ltd.
(dynamic report). Retrieved on 2007-01-06.
[10] [Shawn Hernan; James R. Cutler; David Harris (1997-11-25)] "I-005c: E-Mail Spamming countermeasures:
Detection and prevention of E-Mail spamming". Computer Incident Advisory Capability Information
Bulletins. United States Department of Energy. Retrieved on 2007-01-06.
[11] [Gary Robinson] Spam detection. URL
http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html accessed 18 November 2002,
22:00 UTC.
[12] [Yerazunis, W.] "The Spam Filtering Accuracy Plateau", MIT Spam Conference 2003
[13] http://crm114.sourceforge.net/Plateau_Paper.pdf
[14] [Meyer, T.A., and Whateley, B., (2004)] SpamBayes: Effective open-source, Bayesian based, email
classification system. Conference on Email and Anti-Spam, July 30 and 31, 2004.
[15] <http://ceas.cc/papers-2004/136.pdf>.
[16] [I. Androutsopoulos, G. Paliouras, V. Karkaletsis,G.Sakkis, C.D. Spyropoulos, and P. Stamatopoulos]
Learning to filter spam e-mail: A comparison of a Naive Bayesian and a memory-based approach,
Proceedings of the workshop: Machine Learning and Textual Information Access, 2000, pp. 1-13.
[17] [Cohen, W. W. (1996)] Learning rules that classify e-mail. In AAAI Spring Symposium on Machine
Learning in Information Access.
Ensure Security on Untrusted Platform
for Web Applications
Surendrababu K. Surendra Gupta
Computer Engineering Department Computer Engineering Department
SGSITS, Indore-452003 SGSITS, Indore-452003
its_surendra18@yahoo.com

Abstract

The web is an indispensable part of our lives. Every day, millions of users
purchase items, transfer money, retrieve information, and communicate over
the web. Although the web is convenient for many users because it provides
anytime, anywhere access to information and services, at the same time, it has
also become a prime target for miscreants who attack unsuspecting web users
with the aim of making an easy profit. The last years have shown a significant
rise in the number of web-based attacks, highlighting the importance of
techniques and tools for increasing the security of web applications. An
important web security research problem is how to enable a user on an
untrusted platform (e.g., a computer that has been compromised by malware)
to securely transmit information to a web application. Solutions that have been
proposed to date are mostly hardware-based and require (often expensive)
peripheral devices such as smartcard readers and chip cards. In this paper, we
discuss some common aspects of client-side attacks (e.g., Trojan horses)
against web applications and present two simple techniques that can be used
by web applications to enable secure user input. We also conducted two
usability studies to examine whether the techniques that we propose are
feasible.
1 Introduction
Since the advent of the web, our lives have changed irreversibly. Web applications have
quickly become the most dominant way to provide access to online services. For many users,
the web is easy to use and convenient because it provides anytime, anywhere access to
information and services. Today, a significant amount of business is conducted over the web,
and millions of web users purchase items, transfer money, retrieve information, and
communicate via web applications. Unfortunately, the success of the web and the lack of
technical sophistication and understanding of many web users have also attracted miscreants
who aim to make easy financial profits. The attacks these people have been launching range
from simple social engineering attempts (e.g., using phishing sites) to more sophisticated
attacks that involve the installation of Trojan horses on client machines (e.g., by exploiting
vulnerabilities in browsers in so-called drive-by attacks [19]).
An important web security research problem is how to effectively enable a user who is
running a client on an untrusted platform (i.e., a platform that may be under the control of an
attacker) to securely communicate with a web application. More precisely, can we ensure the
confidentiality and integrity of sensitive data that the user sends to the web application even if
Ensure Security on Untrusted Platform for Web Applications 187
Copyright ICWS-2009
the users platform is compromised by an attacker? Clearly, this is an important, but difficult
problem. Ensuring secure input to web applications is especially relevant for online services
such as banking applications where users perform money transfers and access sensitive
information such as credit card numbers. Although the communication between the web
client and the web application is typically encrypted using technologies such as Transport
Layer Security [9] (TLS) to thwart sniffing and man-in-the-middle attacks, the web client is
the weakest point in the chain of communication. This is because it runs on an untrusted
platform, and thus, it is vulnerable to client side attacks that are launched locally on the users
machine. For example, a Trojan horse can install itself as a browser plugin and then easily
access, control, and manipulate all sensitive information that flows through the browser.
Malware that manipulates bank transactions already appears in the wild. This year, for
example, several Austrian banks were explicitly targeted by Trojan horses that were used by
miscreants to perform illegal money transactions [13, 21]. In most cases, the victims did not
suspect anything, and the resulting financial losses were significant. Note that even though
the costs of such an attack are covered by insurance companies, it can still easily harm the
public image of the targeted organization. A number of solutions have been proposed to date
to enable secure input on untrusted platforms for web-based applications. The majorities of
these solutions are hardware based and require integrated or external peripheral devices such
as smart-card readers [10, 23] or mobile phones [15]. Such hardware-based solutions have
several disadvantages. They impose a financial and organizational burden on users and on
service providers, they eliminate the anytime, anywhere advantage of web applications and
they often depend on the integrity of underlying software components which may be replaced
with tampered versions [12, 24, 25].
In this paper, we discuss some common aspects of client side attacks against web applications
and present two simple techniques that can be used by web applications to enable secure
input, at least for a limited quantity of sensitive information (such as financial transaction
data). The main advantage of our solutions is that they do not require any installation or
configuration on the users machine. Additionally, in order to evaluate the feasibility of our
techniques for mainstream deployment, we conducted usability studies. The main
contributions of this paper are as follows:
We present a technique that extends graphical input with CAPTCHAs [3] to protect
the confidentiality and integrity of the user input even when the user platform is under
the control of an automated attack program (such as a Trojan horse).
We present a technique that makes use of confirmation tokens that are bound to the
sensitive information that the user wants to transmit. This technique helps to protect
the integrity of the user input even when the user platform is under the control of the
attacker.
We present usability studies that demonstrate that the two techniques we propose in
this paper are feasible in practice.
2 A Typical Client-Side Attack
In a typical client-side web attack, the aim of the attacker is to take control of the users web
client in order to manipulate the clients interaction with the web application. Such an attack
188 Ensure Security on Untrusted Platform for Web Applications
Copyright ICWS-2009
typically consists of three phases. In the first phase, the attackers objective is to install
malware on the users computer. Once this has been successfully achieved, in the second
phase, the installed malware monitors the users interaction with the web application. The
third phase starts once the malware detects that a security critical operation is taking place
and attempts to manipulate the flow of sensitive information to the web application to fulfill
the attackers objectives.
Imagine, for example, that John Smith receives an email with a link to a URL. This email has
been sent by attackers to thousands of users. John is naive and curious, so he clicks on the
link. Unfortunately, he has not regularly updated his browser (Internet Explorer in this case),
which contains a serious parsing-related vulnerability that allows malicious code to be
injected and executed on his system just by visiting a hostile web site. As a result, a Trojan
horse is automatically installed on Johns computer when his browser parses the contents of
the web page. The Trojan horse that the attackers have prepared is a Browser Helper Object
(BHO) for the Internet Explorer (IE). This BHO is automatically loaded every time IE is
started. With the BHO, the attackers have access to all events (i.e., interactions) and HTML
components (i.e., DOM objects) within the browser. Hence, they can easily check which web
sites the user is surfing, and they can also modify the contents of web pages. In our example,
the attackers are interested in web sessions with a particular bank (the Bank Austria).
Whenever John is online and starts using the Bank Austria online banking web application,
the Trojan browser plugin is triggered. It then starts analyzing the contents of the bank web
pages. When it detects that he is about to transfer money to another account, it silently
modifies the target account number.
Note that the imaginary attack we described previously is actually very similar to the attacks
that have been recently targeting Austrian banks. Clearly, there can be many technical
variations of such an attack. For example, instead of using a BHO, the attackers could also
inject Dynamic Link Libraries (DLLs) into running applications or choose to intercept and
manipulate Operating System (OS) calls. The key observation here is that the online banking
web application has no way to determine whether the client it is interacting with has been
compromised. Furthermore, when the client has indeed been compromised, all security
precautions the web application can take to create a secure communication channel to the
client (e.g., TLS encryption) fail. That is, the web application cannot determine whether it is
directly interacting with a user, or with a malicious application performing illegitimate
actions on behalf of a user.
3 Our Solution
As described in the previous section, the web application must assume that the users web
client (and platform) is under the control of an attacker. There are two aspects of the
communication that an attacker could compromise: the confidentiality or the integrity of
input sent from the client to the web application. The confidentiality of the input is
compromised when the attacker is able to eavesdrop on the entered input and intercept
sensitive information. Analogously, the integrity of the input is compromised when the
attacker is able to tamper, modify, or cancel the input the user has entered. As far as the user
is considered, there are cases in which the integrity of input may be more important than its
confidentiality. For example, as described in Section 2, only when the attacker can effectively
modify the account number that has been typed, an illegitimate money transaction causing
Copyright ICWS-2009
financial damage can be performed. In this section, we present two techniques that web
applications can apply to protect sensitive user input. We assume a threat model in which the
attacker has compromised a machine and installed malicious code. This code has complete
control of the clients machine, but must perform its task in an autonomous fashion (i.e.,
without being able to consult a human). Our solutions are implemented on the server and are
client-independent. The first solution we discuss aims to protect the integrity of user input.
The second solution we discuss aims to protect the confidentiality and integrity of the user
input, but only against automated attacks (i.e., the adversary is not a human).
3.1 Solution 1: Binding Sensitive Information to Confirmation Tokens
3.1.1 Overview
The first solution is based on confirmation tokens. In principle, the concept of a confirmation
token is similar to a transaction number (i.e., TANs) commonly used in online banking.
TANs are randomly generated numbers that are sent to customers as hardcopy letters via
regular (snail) mail. Each time a customer would like to confirm a transaction, she selects a
TAN entry from her hardcopy list and enters it into the web application. Each TAN entry can
be used only once. The idea is that an attacker cannot perform transactions just by knowing a
customers user login name and password. Obviously, TAN-based schemes rely on the
assumption that an attacker will not have access to a users TAN list and hence, be able to
perform illegitimate financial transactions at a time of his choosing. Unfortunately, TAN-
based schemes are easily defeated when an attacker performs a client-side attack (e.g., using
a Trojan horse as described in Section 2). Furthermore, such schemes are also vulnerable to
phishing attempts in which victims are prompted to provide one (or more) TAN numbers on
the phishing page. The increasing number of successful phishing attacks prompted some
European banks to switch to so called indexed TAN (i-TAN) schemes, where the bank server
requests a specific i-TAN for each transaction. While this partially mitigated the phishing
threat, i-TANs are as vulnerable to client-side attacks as traditional TANs. In general, the
problem with regular transactions numbers is that there is no relationship between the data
that is sent to the web application and the (a-priori shared) TANs. Thus, when the bank
requests a certain TAN, malicious code can replace the users input without invalidating this
transaction number. To mitigate this weakness and to enforce integrity of the transmitted
information, we propose to bind the information that the user wants to send to our
confirmation token. In other words, we propose to use confirmation tokens that (partially)
depend on the user data. Note that when using confirmation tokens, our focus is not the
protection of the confidentiality, but the integrity of this sensitive information.
3.1.2 Details
Imagine that an application needs to protect the integrity of some input data x. In our solution,
the idea is to specify a function f (.) that the user is requested to apply to the sensitive input x.
The user then submits both her input data x and, as a confirmation token, f(x). Suppose that in
an online banking scenario, the bank receives the account number n together with a
confirmation token t from the user. The bank will then apply f(.) to n and verify that f(n) = t.
If the value x, which the user desires to submit, is the same as the input n that the bank
receives (x = n), then the computation of f(n) by the bank will equal the computation of f(x)
by the user. That is, f(x) = f(n) holds. If, however, the user input is modified, then the banks
computation will yield f(n) _= f(x), and the bank will know that the integrity of the users
Copyright ICWS-2009
input is compromised. Any important question that needs to be answered is how f(.) should
be defined. Clearly, f(.) has to be defined in a way so that malicious software installed on a
users machine cannot easily compute it. Otherwise, the malware could automatically
compute f(x) for any input x that it would like to send, and the proposed solution fails. Also,
f(.) has to remain secret from the attacker.
We propose two schemes for computing f(x). For both schemes, the user will require a code
book. This code book will be delivered via regular mail, similar to TAN letters described in
the previous section. In the first scheme, called token calculation, the code book contains a
collection of simple algorithms that can be used by users to manually compute confirmation
tokens (similar to the obfuscation and challenge-response idea presented in [4] for secure
logins). All algorithms are based on the input that the user would like to transmit.

Fig. 1: Sample token calculation code book
Suppose that the user has entered the account number 980.243.276, but a Trojan horse has
actually sent the account number 276.173.862 to the bank (unnoticed by the user). In the first
scheme, the bank would randomly choose an algorithm from the users code book. Clearly, in
order to make the scheme more resistant against attacks, a different code book would have to
be created for each user (just like different TANs are generated for different users). Figure 1
shows an excerpt from our sample token calculation code book. Suppose the bank asks the
user to apply algorithm ID 6 to the target account number. That is, the user would have to
multiply the 4th and 8th digits of the account number and add 17 to the result. Hence, the user
would type 31 as the confirmation token. The bank, however, would compute 23 and,
because these confirmation values do not match, it would not execute the transaction,
successfully thwarting the attack. Suppose that the user has entered the account number
980.243.276, but a Trojan horse has actually sent the account number 276.173.862 to the
bank (unnoticed by the user). In the first scheme, the bank would randomly choose an
algorithm from the users code book. Clearly, in order to make the scheme more resistant
against attacks, a different code book would have to be created for each user (just like
different TANs are generated for different users). Figure 1 shows an excerpt from our sample
token calculation code book. Suppose the bank asks the user to apply algorithm ID 6 to the
target account number. That is, the user would have to multiply the 4th and 8th digits of the
account number and add 17 to the result. Hence, the user would type 31 as the confirmation
token. The bank, however, would compute 23 and, because these confirmation values do not
match, it would not execute the transaction, successfully thwarting the attack.
For our second scheme to implement f(.), called token lookup, users are not required to
perform any computation. In this variation, the code book would consist of a large number of
random tokens that are organized in pages. The bank and the user previously and secretly
agree on which digits of the account number are relevant for choosing the correct page. The
..
Token ID5:
Create a number using 3
rd
and 4
th
digits of target
account and 262 to it.
Token ID6:
Create a number using 2
nd
and 8
th
digits of target
account and 540 to it.
..
Copyright ICWS-2009
bank then requests the user to confirm a transaction by asking her to enter the value of a
specific token on that page. For example, suppose that the relevant account digits are 2 and 7
for user John and that the bank asks John to enter the token with the ID 20. In this case, John
would determine the relevant code page by combining the 2nd and 7th digits of the account
number and look up the token on that page that has the ID 20. Suppose that the user is faced
with the same attack that we discussed previously. That is, the user enters 980.243.276, but
the malicious application sends 276.173.862 to the bank. In this case, the user would look up
the token with ID 20 on page 82, while the bank would consult page 78. Thus, the transmitted
token would not be accepted as valid.
3.2 Solution 2: Using CAPTCHAs for Secure Input
3.2.1 Overview
Graphical input is used by some banks and other institutions to prevent eavesdropping of
passwords or PINs. Instead of using the keyboard to enter sensitive information, an image of
a keypad is displayed, and the user enters data by clicking on the corresponding places in the
image. Unfortunately, these schemes are typically very simple. For example, the letters and
numbers are always located at the same window coordinates, or the fonts can be easily
recognized with optical character recognition (OCR). As a result, malware can still recover
the entered information. The basic idea of the second solution is to extend graphical input
with CAPTCHAs [3]. A CAPTCHA, which stands for Completely Automated Public Turing
test to tell Computers and Humans Apart, is a type of challenge-response test that is used in
computing to determine whether or not the user is human. Hence, a CAPTCHA test needs to
be solvable by humans, but not solvable (or very difficult to solve) for computer applications.
CAPTCHAs are widely employed for protecting online services against automated (mis)use
by malicious programs or scripts. For example, such programs may try to influence online
polls, or register for free email services with the aim of sending spam. Figure 2 shows a
graphical CAPTCHA generated by Yahoo when a user tries to subscribe to its free email
service.

Fig. 2: A graphical CAPTCHA generated by yahoo.
An important characteristic of a CAPTCHA is that it has to be resistant to attacks. That is, it
should not be possible for an algorithm to automatically solve the CAPTCHA. Graphical
CAPTCHAs, specifically, need to be resistant to optical character recognition [18]. OCR is
used to translate images of handwritten or typewritten text into machineeditable text. To
defeat OCR, CAPTCHAs generally use background clutter (e.g., thin lines, colors, etc.), a
large range of fonts, and image transformations. Such properties have been shown to make
OCR analysis difficult [3]. Usually, the algorithm used to create a CAPTCHA is made public.
The reason for this is that a good CAPTCHA needs to demonstrate that it can only be broken
by advances in OCR (or general pattern recognition) technology and not by the discovery of a
secret algorithm. Note that although some commonly used CAPTCHA algorithms have
already been defeated (e.g., see [17]), a number of more sophisticated CAPTCHA algorithms
[3, 7] are still considered resistant against OCR and are currently being widely used by
companies such as Yahoo and Google.
Copyright ICWS-2009
3.2.2 Details
Although CAPTCHAs are frequently used to protect online services against automated
access, to the best of our knowledge, no one has considered their use to enable secure input to
web applications. In our solution, whenever a web application requires to protect the integrity
and confidentiality of user information, it generates a graphical input field with randomly
placed CAPTCHA characters. When the user wants to transmit input, she simply uses the
mouse to click on the area that corresponds to the first character that should be sent. Clicking
on the image generates a web request that contains the coordinates on the image where the
user has clicked with the mouse. The key idea here is that only the web application knows
which character is located at these coordinates. After the first character is transmitted, the
web application generates another image with a different placement of the characters, and the
process is repeated. By using CAPTCHAs to communicate with the human user, a web
application can mitigate client-side attacks that intercept or modify the sensitive information
that users type. Because the CAPTCHA characters cannot be identified automatically, a
malware program has no way to know which information was selected by the user, nor does
it have a way to meaningful select characters of its own choosing.
4 Related Work
Client-side sensitive information theft (e.g., spyware, keyloggers, Trojan horses, etc.) is a
growing problem. In fact, the Anti-Phishing Working Group has reported over 170 different
types of keyloggers distributed on thousands of web sites [1]. Hence, the problem has been
increasingly gaining attention and a number of mitigation ideas have been presented to date.
Several client-side solutions have been proposed that aim to mitigate spoofed web-site-based
phishing attacks. Pwd- Hash [22] is an Internet Explorer plug-in that transparently converts a
users password into a domain-specific password.
A side-effect of the tool is some protection from phishing attacks. Because the generated
password is domain-specific, the password that is phished is not useful. SpoofGuard [5] is a
plug-in solution specifically developed to mitigate phishing attacks. The plug-in looks for
phishing symptoms such as similar sounding domain names and masked links. Note that
both solutions focus on the mitigation of spoofed web-site-based phishing attacks. That is,
they are vulnerable against client-side attacks as they rely on the integrity of the environment
they are running in. Similarly, solutions such as the recently introduced Internet Explorer
antiphishing features [16] are ineffective when an attacker has control over the users
environment. Spyblock [11] aims to protect user passwords against network sniffing and
dictionary attacks. It proposes to use a combination of password-authenticated key exchange
and SSL. Furthermore, as additional defense against pharming, cookie sniffing, and session
hijacking, it proposes a form of transaction confirmation over an authenticated channel. The
tool is distributed as a client-side system that consists of a browser extension and an
authentication agent that runs in a virtual machine environment that is protected from
spyware. A disadvantage of Spyblock is that the user needs to install and configure it, as
opposed to our purely server-side solution.
A number of hardware-based solutions have been proposed to enable secure input on
untrusted platforms. Chip cards and smart-card readers [10, 23], for example, are popular
choices. Unfortunately, it might be possible for the attacker to circumvent such solutions if
the implementations rely on untrusted components such as drivers and operating system calls
Copyright ICWS-2009
[12, 24, 25]. As an alternative to smart-cardbased solutions, several researchers have
proposed using handhelds as a secure input medium [2, 15]. Note that although hardware-
based solutions are useful, unfortunately, they are often expensive and have the disadvantage
that they have to be installed and available to users.
A popular anti-keylogger technique that is already being deployed by certain security-aware
organizations are graphical keyboards. Similar to our graphical input technique, the idea is
that the user types in sensitive data using a graphical keyboard. As a result, she is safe from
key loggers that record the keys that are pressed. However, there have been increasing reports
of so-called screen scrapers that capture the users screen and send the screenshot to a
remote phishing server for later analysis [6]. Also, with many graphical keyboard solutions,
sensitive information can be extracted from user elements that show the entered data to
provide feedback for the user. Finally, to the best of our knowledge, no graphical keyboard
solution uses CAPTCHAs. Thus, the entered information can be determined in a
straightforward fashion using simple OCR schemes.
The cryptographic community has also explored different protocols to identify humans over
insecure channels [8, 14, 27]. In one of the earliest papers [14], a scheme is presented in
which users have to respond to a challenge, having memorized a secret of the modest amount
of ten characters and five digits. The authors present a security analysis, but no usability
study is provided (actually, the authors defer the implementation of their techniques to future
work). The importance of usability studies is shown in a later paper by Hopper and Blum [8].
In their work, the authors develop a secure scheme for human identification, but after
performing user studies with 54 persons, conclude that their approach is impractical for use
by humans. In fact, a transaction takes on average 160 seconds, and can only be performed
by 10% of the population. Our scheme, on the other hand, takes less than half of this time,
and 95% of the transactions completed successfully.
Finally, client-side attacks could be mitigated if the user could easily verify the integrity of
the software running on her platform. Trusted Computing (TC) [20] initiatives aim to achieve
this objective by means of software and hardware. At this time, however, TC solutions
largely remain prototypes that are not widely deployed in practice.
5 Conclusion
Web applications have become the most dominant way to provide access to online services.
A growing class of problems are client-side attacks in which malicious software is
automatically installed on the users machine. This software can then easily access, control,
and manipulate all sensitive information in the users environment. Hence, an important web
security research problem is how to enable a user on an untrusted platform to securely
transmit information to with a web application.
Previous solutions to this problem are mostly hardware based and require peripheral devices
such as smart-card readers and mobile phones. In this paper, we present two novel server-side
techniques that can be used to enable secure user input. The first technique uses confirmation
tokens that are bound to sensitive data to ensure data integrity. Confirmation tokens can
either be looked up directly in a code book or they need to be calculated using simple
algorithms. The second technique extends graphical input with CAPTCHAs to protect the
confidentiality and integrity of user input against automated attacks. The usability studies that
Copyright ICWS-2009
we conducted demonstrate that, after an initial learning step, our techniques are understood
and can also be applied by a non-technical audience.
Our dependency on the web will certainly increase in the future. At the same time, client-side
attacks against web applications will most likely be continuing problems as the attacks are
easy to perform and profitable. We hope that the techniques we present in this paper will be
useful in mitigating such attacks.
References
[1] Anti-phishing Working Group. http://www.antiphishing.org.
[2] D. Balfanz and E. Felten. Hand-Held Computers Can Be Better Smart Cards. In Proceedings of the 8th
Usenix Security Symposium, 1999.
[3] Carnegie Mellon University. The CAPTCHA Project. http://www.captcha.net.
[4] W. Cheswick. Johnny Can Obfuscate: Beyond Mothers Maiden Name. In Proceedings of the 1st USENIX
Workshop on Hot Topics in Security (HotSec), 2006.
[5] N. Chou, R. Ledesma, Y. Teraguchi, and J. C. Mitchell. Client-side defense against web-based identity
theft. In Proceedings of the Network and Distributed Systems Security (NDSS), 2004.
[6] FinExtra.com. Phishers move to counteract bank security programmes. http://www.finextra.com/
fullstory.asp?id=14149.
[7] S. Hocevar. PWNtcha - Captcha Decoder. http://sam.zoy.org/pwntcha.
[8] N. Hopper and M. Blum. Secure Human Identification Protocols. In AsiaCrypt, 2001.
[9] IETF Working Group. Transport Layer Security (TLS). http://www.ietf.org/html.charters/ tls-charter.html,
2006.
[10] International Organization for Standardization (ISO). ISO 7816 Smart Card Standard. http://www.iso.org/.
[11] C. Jackson, D. Boneh, and J. C. Mitchell. Stronger Password Authentication Using Virtual Machines.
http://crypto.stanford.edu/SpyBlock/spyblock.pdf.
[12] A. Josang, D. Povey, and A. Ho. What You See is Not Always What You Sign. In Annual Technical
Conference of the Australian UNIX and Open Systems User Group, 2002.
[13] I. Krawarik and M. Kwauka. Attacken aufs Konto (in German).
http://www.ispa.at/www/getFile.php?id=846, Jan 2007.
[14] T. Matsumoto and H. Imai. Human Identification Through Insecure Channel. In EuroCrypt, 1991.
[15] J. M. McCune, A. Perrig, and M. K. Reiter. Bump in the Ether: A Framework for Securing Sensitive User
Input. In Proceedings of the USENIX Annual Technical Conference, June 2006.
[16] Microsoft Corporation. Internet Explorer 7 features.
http://www.microsoft.com/windows/ie/ie7/about/features/default.mspx.
[17] G. Mori and J. Malik. Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA. In
Proceedings of the IEEE Computer Vision and Pattern Recognition Conference (CVPR). IEEE Computer
Society Press, 2003.
[18] S. Mori, C. Y. Suen, and K. Yamamoto. Historical review of OCR research and development. Document
image analysis, pages 244273, 1995.
[19] A. Moshchuk, T. Bragin, S. D. Gribble, and H. M. Levy. A Crawler-based Study of Spyware on the Web.
In Proceedings of the 13th Annual Network and Distributed System Security Symposium (NDSS), February
2006.
[20] S. Pearson. Trusted Computing Platforms. Prentice Hall, 2002.
[21] Pressetext Austria. Phishing-Schaden bleiben am Kunden hangen (in German).
http://www.pressetext.at/pte.mc?pte=061116033, Nov 2006.
[22] B. Ross, C. Jackson, N. Miyake, D. Boneh, and J. C. Mitchell. Stronger Password Authentication Using
Browser Extensions. In Proceedings of the 14th Usenix Security Symposium, 2005.
[23] Secure Information Technology Center Austria (A-SIT). The Austrian Citizen Card.
http://www.buergerkarte.at/index en.html, 2005.
[24] A. Spalka, A. Cremers, and H. Langweg. Protecting the Creation of Digital Signatures with Trusted
Computing Platform Technology Against Attacks by Trojan Horse. In IFIP Security Conference, 2001.

A Novel Approach for Routing Misbehavior
Detection in MANETs

Shyam Sunder Reddy K. C. Shoba Bindu
Dept. of Computer Science Dept. of Computer Science
JNTU University, Anantapur JNTU University, Anantapur
shyamd4@gmail.com shoba_bindu@yahoo.co.in

Abstract

A mobile ad hoc network (MANET) is a temporary infrastructureless network,
formed by a set of mobile hosts that dynamically establish their own network
without relying on any central administration. By definition the nature of
AdHoc networks is dynamically changing. However, due to the open structure
and scarcely available battery-based energy, node misbehaviors may exist.
The network is vulnerable to routing misbehavior, due to faulty or malicious
nodes. Misbehavior detection systems aim at removing this vulnerability. In
this approach we built a system to detect misbehaving nodes in a mobile ad
hoc network. Each node in the network monitored its neighboring nodes and
collected one DSR protocol trace per monitored neighbor. Network simulator
GloMoSim is used to implement the system. After collecting parameters for
each node in network represents normal behavior of the network. In the next
step we incorporate misbehavior in the system and capture behavior of
network, which yields as input to our detection system. Detection system is
implemented based on 2ACK concept. Simulation results show that the system
has good detection capabilities in finding malicious nodes in network.
Keywords: Mobile Ad Hoc Networks, routing misbehavior, network security.
1 Introduction
A Mobile Ad Hoc Network (MANET) is a collection of mobile nodes (hosts) which
communicate with each other via wireless links either directly or relying on other nodes as
routers. In some MANETs applications, such as the battlefield or the rescue operations, all
nodes have a common goal and their applications belong to a single authority, thus they are
cooperative by nature. However, in many civilian applications, such as networks of cars and
provision of communication facilities in remote areas, nodes typically do not belong to a
single authority and they do not pursue a common goal. In such selforganized networks
forwarding packets for other nodes is not in the direct interest of any one, so there is no good
reason to trust nodes and assume that they always cooperate. Indeed, each node tries to save
its resources, particularly its battery power which is a precious resource. Recent studies show
that most of the nodes energy in MANETs is likely to be devoted to forward packets for other
nodes. For instance, Buttyan and Hubaux simulation studies show that; when the average
number of hops from a source to a destination is around 5 then almost 80% of the
transmission energy will be devoted to packet forwarding. Therefore, to save energy, nodes
may misbehave and tend to be selfish. A selfish node regarding the packet forwarding
196 A Novel Approach for Routing Misbehavior Detection in MANETs
Copyright ICWS-2009
process is a node which takes advantage of the forwarding service and asks others to forward
its own packets but does not actually participate in providing this service. Several techniques
have been proposed to detect and alleviate the effects of such selfish nodes in MANETs, two
techniques were introduced, namely, watchdog [4] and pathrater [3], to detect and mitigate
the effects of the routing misbehavior, respectively. The watchdog technique identifies the
misbehaving nodes by overhearing on the wireless medium. The pathrater technique allows
nodes to avoid the use of the misbehaving nodes in any future route selections. The watchdog
technique is based on passive overhearing. Unfortunately, it can only determine whether or
not the next-hop node sends out the data packet. The reception status of the next-hop links
receiver is usually unknown to the observer. In order to mitigate the adverse effects of routing
misbehavior, the misbehaving nodes need to be detected so that these nodes can be avoided
by all well-behaved nodes. In this paper, we focus on the following problem:
Misbehavior Detection and Mitigation
In MANETs, routing misbehavior can severely degrade the performance at the routing layer.
Specifically, nodes may participate in the route discovery and maintenance processes but
refuse to forward data packets. How do we detect such misbehavior? How can we make such
detection processes more efficient (i.e., with less control overhead) and accurate (i.e., with
low false alarm rate and missed detection rate)?
We propose the 2ACK scheme to mitigate the adverse effects of misbehaving nodes. The
basic idea of the 2ACK successfully over the next hop, the destination node of the next-hop
link will send back a special two-hop acknowledgment called 2ACK to indicate that the data
packet has been received successfully. Such a 2ACK transmission takes place for only a
fraction of data packets, but not all. Such a selective acknowledgment1 is intended to
reduce the additional routing overhead caused by the 2ACK scheme. Judgment on node
behavior is made after observing its behavior for a certain period of time.
In this paper, we present the details of the 2ACK scheme and our evaluation of the 2ACK
scheme as an add-on to the Dynamic Source Routing (DSR) protocol.
2 Related Work
Malicious networks nodes that participate in routing protocols but refuse to forward messages
may corrupt a MANET. These problems can be circumvented by implementing a reputation
system. The reputation system is used to instruct correct nodes of those that should be
avoided in messages routes. However, as is, the system rewards selfish nodes, who benefit
from not forwarding messages while being able to use the network. On modern society,
services are usually provided in exchange of an amount of money, previously agreed between
both parts. The Terminodes project defined a virtual currency named beans used by nodes to
pay for the messages. Those beans would be distributed by the intermediary nodes that
forwarded the message. Implementations of digital cash systems supporting fraud detection
require several different participants and the exchange of a significant number of messages.
To reduce this overhead, Terminodes assumes that hosts are equipped with a tamper resistant
security module, responsible for all the operations over the beans counter, that would refuse
to forward messages whenever the number of beans available are not sufficient to pay for the
A Novel Approach for Routing Misbehavior Detection in MANETs 197
Copyright ICWS-2009
service. The modules use a Public Key Infrastructure (PKI) to ensure the authentication of the
tamper resistant modules. This infrastructure can be used with two billing models. In the
Packet Purse Model, the sender pays to every intermediary node for the message, while in the
Packet Trade Model is the receiver that is charged. In both models, hosts are charged as a
function of the number of hops traveled by the message.
The CONFIDANT protocol implements a reputation system for the members of MANETs.
Nodes with a bad reputation may see their requests ignored by the remaining participant, this
way excluding them from the network. When compared with the previous system,
CONFIDANT shows two interesting advantages. It does not require any special hardware
and avoids the self-inflicted punishment that could be the exploitation point for malicious
users. The system tolerates certain kinds of attacks by being suspicious on the incoming
selfishness alerts that other nodes broadcast and relying mostly on its self experience.
These systems show two approaches that conflict in several aspects. The number of requests
received by hosts depends of their geographical position. Hosts may become overloaded with
requests because they are positioned in a strategical point in the MANET. A well-behaved
node that temporarily supports a huge amount of requests should latter be rewarded by this
service. CONFIDANT has no memory, in the sense that the services provided by some host
are quickly forgotten by the reputation system. On the other hand, beans can be kept
indefinitely by hosts. In MANETs, it is expected that hosts move frequently, therefore
changing the network topology. The number of hops that a message must travel is a function
based on the instant position of the sender and the receiver and varies with time. Terminodes
charges the sender or the receiver of a message based on the number of hops traveled what
may seems unfair since any of them will pay based on a factor that is outside his control.
3 Routing Misbehavior Model
We present the routing misbehavior model [1] considered in this paper in the context of the
DSR protocol. Due to DSRs popularity, we use it as the basic routing protocol to illustrate
our proposed add-on scheme. We focus on the following routing misbehavior: A selfish node
does not perform the packet forwarding function for data packets unrelated to itself.
However, it operates normally in the Route Discovery and the Route Maintenance phases of
the DSR protocol. Since such misbehaving nodes participate in the Route Discovery phase,
they may be included in the routes chosen to forward the data packets from the source. The
misbehaving nodes, however, refuse to forward the data packets from the source. This leads
to the source being confused.
In guaranteed services such as TCP, the source node may either choose an alternate route
from its route cache or initiate a new Route Discovery process. The alternate route may again
contain misbehaving nodes and, therefore, the data transmission may fail again. The new
Route Discovery phase will return a similar set of routes, including the misbehaving nodes.
Eventually, the source node may conclude that routes are unavailable to deliver the data
packets. As a result, the network fails to provide reliable communication for the source node
even though such routes are available. In best-effort services such as UDP, the source simply
sends out data packets to the next-hop node, which forwards them on. The existence of a
misbehaving node on the route will cut off the data traffic flow. The source has no knowledge
of this at all.
Copyright ICWS-2009
In this paper, we propose the 2ACK technique to detect such misbehaving nodes. Routes
containing such nodes will be eliminated from consideration. The source node will be able to
choose an appropriate route to send its data. In this work, we use both UDP and TCP to
demonstrate the adverse effect of routing misbehavior and the performance of our proposed
scheme.
The attackers (misbehaving nodes) are assumed to be capable of performing the following
tasks:
dropping any data packet,
masquerading as the node that is the receiver of its next-hop link,
sending out fabricated 2ACK packets,
sending out fabricated hn, the key generated by the 2ACK packet senders, and
claiming falsely that its neighbor or next-hop links are misbehaving.
4 The New Approach
4.1 Solution Overview
To mitigate the watchdog problem related to the power control usage we propose a new
approach. Like the watchdog, we suggest that each node in the route monitors the forwarding
of each packet it sends. To explain the concepts we suppose without lose of generality that A
sends packets to B and monitors its forwarding to C. A source routing protocol is also
assumed to be used.
We define a new kind of feedbacks we call two-hop ACK [4], it is an ACK that travels two
hops. Node C acknowledges packets sent from A by sending this latter via B a special ACK.
Node B could, however, escape from the monitoring without being detected by sending A a
falsified two-hop ACK. Note that performing in this way is power economic for B, since
sending a short packet like an ACK consumes too less energy than sending a data packet. To
avoid this vulnerability we use an asymmetric cryptography based strategy as follows:

Fig. 1: Solution framework
Node A generates a random number and encrypts it with Cs public key (PK) then appends it
in thepackets header as well as As address. When C re-ceives the packet it gets the number
Copyright ICWS-2009
back, decrypts it using its secret key (SK), encrypts it using As PK, and puts it in a two-hop
ACK which is sent back to A via B. When A receives the ACK it decrypts the random
number and checks if the number within the packet matches with the one it has generated, to
validate Bs forwarding regarding the appropriate packet. However, if B does not forward the
packet A will not receive the two-hop ACK, and it will be able to detect this misbehavior
after a time out. This strategy needs a security association between each pair of nodes to
ensure that nodes share their PK with each other. This requires a key distribution mechanisms
which is out of the scope of this paper. Another problem would take place when node C
misbehave. If C does neither forward the packet nor send the two-hop ACK back to A, B
could be supposed by A to not forward the packet even it actually does. To overcome this
problem we propose that the sending of the two-hop ACKs is provided implicitly upon the
reception of the packet at the MAC layer, and we assume that lower layers (the MAC and
physical layers) are robust and tamper resistant. This can be ensured by the hardware and the
operating system of each node, that is the operations of the lower layers cannot be modified
by any node, and node C could not get rid of sending the two-hop ACK back to A upon the
reception of the packet, thereby the Bs monitoring is performed accurately. However, the
upper layer including the network layer may be tampered by a selfish or a malicious, and
falsified packets can be sent. Our solution is composed of two parts, the first one is located at
the network layer and can be viewed as a sub layer at the bottom of this layer, whereas the
second one is located over the MAC layer and is a sub layer at the top of this latter. Figure 1
illustrates this framework.

Fig: 1.1: The 2ACK scheme.
4.2 Details of the 2ACK Scheme
The 2ACK scheme is a network-layer technique to detect misbehaving links and to mitigate
their effects. It can be implemented as an add-on to existing routing protocols for MANETs,
such as DSR. The 2ACK scheme detects misbehavior through the use of a new type of
acknowledgment packet, termed 2ACK. A 2ACK packet is assigned a fixed route of two
hops (three nodes) in the opposite direction of the data traffic route.
Fig. 1.1 illustrates the operation of the 2ACK scheme. Suppose that N1, N2, and N3 are three
consecutive nodes (triplet) along a route. The route from a source node, S, to a destination
node, D, is generated in the Route Discovery phase of the DSR protocol. When N1 sends a
data packet to N2 and N2 forwards it to N3, it is unclear to N1 whether N3 receives the data
packet successfully or not. Such an ambiguity exists even when there are no misbehaving
nodes. The problem becomes much more severe in open MANETs with potential
misbehaving nodes.
The 2ACK scheme requires an explicit acknowledgment to be sent by N3 to notify N1 of its
successful reception of a data packet: When node N3 receives the data packet successfully, it
sends out a 2ACK packet over two hops to N1 (i.e., the opposite direction of the routing path
as shown), with the ID of the corresponding data packet. The triplet [N1 N2 N3] is
Copyright ICWS-2009
derived from the route of the original data traffic. Such a triplet is used by N1 to monitor the
link N2 N3. For convenience of presentation, we term N1 in the triplet [N1 N2 N3]
the 2ACK packet receiver or the observing node and N3 the 2ACK packet sender.
Such a 2ACK transmission takes place for every set of triplets along the route. Therefore,
only the first router from the source will not serve as a 2ACK packet sender. The last router
just before the destination and the destination will not serve as 2ACK receivers.
To detect misbehavior, the 2ACK packet sender maintains a list of IDs of data packets that
have been sent out but have not been acknowledged. For example, after N1 sends a data
packet on a particular path, say, [N1 N2 N3] in Fig. 1.1, it adds the data ID to LIST
(refer to Fig. 2, which illustrates the data structure maintained by the observing node), i.e., on
its list corresponding to N2 N3. A counter of forwarded data packets, Cpkts, is incremented
simultaneously. At N1, each ID will stay on the list for seconds, the timeout for 2ACK
reception. If a 2ACK packet corresponding to this ID arrives before the timer expires, the ID
will be removed from the list. Otherwise, the ID will be removed at the end of its timeout
interval and a counter called Cmis will be incremented.
When N3 receives a data packet, it determines whether it needs to send a 2ACK packet to
N1. In order to reduce the additional routing overhead caused by the 2ACK scheme, only a
fraction of the data packets will be acknowledged via 2ACK packets. Such a fraction is
termed the acknowledgment ratio, Rack. By varying Rack, we can dynamically tune the
overhead of 2ACK packet transmissions. Node N1 observes the behavior of link N2 ! N3 for
a period of time termed Tobs. At the end of the observation period, N1 calculates the ratio of
missing 2ACK packets as Cmis/Cpkts and compares it with a threshold Rmis. If the ratio is
greater than Rmis, link N2N3 is declared misbehaving and N1 sends out an RERR (or the
misbehavior report) packet. The data structure of RERR is shown in Fig. 3. Since only a
fraction of the received data packets are acknowledged, Rmis should satisfy Rmis > 1 - Rack
in order to eliminate false alarms caused by such a partial acknowledgment technique.

Fig. 2: Data structure maintained by the observing node.

Fig. 3: Data structure of the RERR packet
Each node receiving or overhearing such an RERR marks the link N2 ! N3 as misbehaving
and adds it to the blacklist of such misbehaving links that it maintains. When a node starts its
own data traffic later, it will avoid using such misbehaving links as a part of its route.
The 2ACK scheme can be summarized in the pseudocode provided in the appendix for the
2ACK packet sender side (N3) and the observing node side (N1).
Copyright ICWS-2009
5 Simulation Results
GloMoSim (Tool for Simulating Misbehavior in Wireless AdHoc Network)
Global Mobile Information System Simulator (GloMoSim) [7] provides a scalable simulation
environment for large wireless and wire line communication networks. Its scalable
architecture supports up to thousand nodes linked by a heterogeneous communications
capability that includes multihop wireless communications using adhoc networking.
Provisions exist for setting the general Simulation Parameters, Scenario Topology, Mobility
Radio and Propagation Models, MAC Protocol, Routing Protocol. Using the application
configuration file, the following traffic generators are supported: TELNET and CBR. The
following parameters are used in the simulation simulation time : 150 seconds,
area:1000*1000m^2, Number of nodes : 30,number of connections : 8, Transmission power
:15dBm, Number of malicious node : Variable(110),In defining degree of membership
function for each input parameter of fuzzy inference system, we have taken into The MAC
layer protocol used in the simulations was the IEEE standard 802.11.Traffic is generated as
constant bitrate, with packets of length 512 B sent every 0.21 s.
Misbehavior Implementation
Malicious nodes simulate the following types of active attacks:
1. Modification Attack: These attacks are carried out by adding, altering, or deleting IP
addresses from the ROUTE REQUEST, ROUTE REPLY, which pass through the
malicious nodes.
2. No forwarding Attack: This attack is carried out by dropping control packets or data
packets pass through the malicious nodes.
6 Conclusion
MANETs are particularly sensible to unexpected behaviors. The generalization of wireless
devices will soon turn MANETs in one of the most important connection methods to the
Internet. However, the lack of a common goal in MANETs without a centralized human
authority will make them difficult to maintain: each user will attempt to retrieve the most of
the network while expecting to pay as less as possible. In human communities, this kind of
behavior is called selfishness. While prohibiting selfishness shows to be impossible over a
decentralized network, applying punishments to those that present this behavior may be
beneficial. As we have seen, the watch dog technique, used by almost all the solutions
currently proposed to detect nodes that misbehave on packets forwarding in MANETs, fails
when employing the power control.In this paper, we have proposed a new approach that
overcomes this problem. We have proposed and evaluated a technique, termed 2ACK, to
detect and mitigate the effect of such routing misbehavior. The 2ACK technique is based on a
simple 2-hop acknowledgment packet that is sent back by the receiver of the next-hop link.
Compared with other approaches to combat the problem, such as the overhearing technique,
the 2ACK scheme overcomes several problems including ambiguous collisions, receiver
collisions, and limited transmission powers. The 2ACK scheme can be used as an add-on
technique to routing protocols such as DSR in MANETs.
Copyright ICWS-2009
Simulation results also show that there is always possibility of false detection. Consequently,
one monitoring node cannot immediately accuse another as selfish when detecting that a
packet has been dropped at this latter. Instead, a threshold should be used like in the
watchdog, and the monitored node will be considered selfish as soon as the number of
packets dropped at this latter exceeds this threshold whose value should be well configured to
overcome dropping caused by collisions and nodes mobility.
These results show that we can gain the benefits of an increased number of routing nodes
while minimizing the effects of misbehaving nodes. In addition we show that this can be done
without a priori trust or excessive overhead.
References
[1] Kejun Liu, Jing Deng, Pramod K. Varshney, Kashyap Balakrishnan An Acknowledgement-Based
Approach for the Detection of Routing Misbehavior in MANETs, IEEE Transactions on Mobile
Computing vol.6,No.5,May 2007.
[2] H.Miranda and L.Rodrigues,Preventing Selfishness in Open Mobile Ad Hoc Networks,October 2002.
[3] S.Marti,T.Giuli,K.Lai and M.Baker, Mitigating Routing Misbehavior in Mobile Ad Hoc Networks,Aug
2000.
[4] Djamel Djenouri, Nadjib Badache,New Approach for Selfish Nodes Detection in Mobile Ad hoc
Networks.
[5] J.-P. Hubaux, T. Gross, J.-Y. LeBoudec, and M. Vetterli, Toward Self-Organized Mobile Ad Hoc
Networks: The Terminodes Project, IEEE Comm. Magazine, Jan. 2001.
[6] K. Balakrishnan, J. Deng, and P.K. Varshney, TWOACK: Preventing Selfishness in Mobile Ad Hoc
Networks, Proc. IEEE Wireless Comm. and Networking Conf. (WCNC 05), Mar. 2005.
[7] GloMoSim. Available on: http://pcl.cs.ucla.edu/projects/glomosim

Multi Layer Security Approach for Defense Against
MITM (Man-in-the-Middle) Attack

K.V.S.N. Rama Rao Shubham Roy Choudhury

Satyam Computer Services Ltd Satyam Computer Services Ltd
kvsn_ramarao@satyam.com shubham_roy@satyam.com
Manas Ranjan Patra Moiaz Jiwani
Berhampur University Satyam Computer Services Ltd
mrpatra12@gmail.com moiaz.jiwani@gmail.com

Abstract

Security threats are the major deterrent for the widespread acceptability of
web applications. Web applications have become a universal channel used by
many people, which has introduced potential security risks and challenges.
Though hardening of web server security is one of the ways to secure data on
servers but it fails to handle hackers those who target client side by tapping the
connection between client and server, thereby gain access to sensitive data
Commonly known as Man-in-the-Middle (MITM) attack. This paper provides
an approach to multi layer security to protect HTTPS further from MITM
which is the only attack possible on HTTPS connection. In this paper we
proposed security at different OSI layers and also provided an approach to
design various topologies on LAN and WAN to enhance the security
mechanism.
1 Introduction
Web applications are becoming the dominant way to provide access to on-line services, like
webmail, ecommerce etc. Unfortunately, all users are not using the Internet in a positive way.
Along with usage of internet the security issues are also increasing every day. So there is an
urgent need for tighter security measures. The web server security is tightened now a days
and so attackers are targeting client side. Now attackers are trying to hack the sensitive data
by intruding into the connection between client and server. Http communications are fine for
the average Web server, which just contains informational pages. But in the case of running
an e-commerce site that requires secure transactions, the connection between client and web
server should be secure. The most common means is to use https on Secure Sockets Layer
(SSL), which uses public key cryptography to protect confidential user information. But https
provides security only at top layers of OSI (application layer, presentation layer) in the
protocol stack and ignoring security at lower layers. Hence attackers can use lower layers of
OSI to gain access to the connection through MITM (Man in The Middle) attack. Users on
LAN as well as on WAN are vulnerable to MITM attack. This paper provides an approach to
multi layer security to protect HTTPS from MITM which is the only attack possible on
HTTPS connection. It also discusses security concerns at different OSI layers and provides an
204 Multi Layer Security Approach for Defense Against MITM (Man-in-the-Middle) Attack
Copyright ICWS-2009
approach to design various topologies on LAN and WAN to enhance the security mechanism.
The rest of the paper is organized as follows. In Section 1 we introduce several attacks on
HTTP and HTTPS. In section 2 we describe MITM attack on LAN.In section 3 we present
our multi layer security approach to prevent MITM attack on LAN.In section 4 we describe
MITM attack on WAN and present our approach to prevent from such attack. In section 5 we
briefly conclude.
2 Attacks on Http/Https
Attacks on HTTP: Attacks on HTTP protocol can be broadly classified into three types.
1. The basic attack is sniffing the request and response parameters over the network.
With this attack, an attacker can get access to confidential information i.e. credit card
numbers, passwords etc, as these information can be retrieved as plain text.
2. Secondly, one can manipulate request and response parameters..
3. Attacker can get access to your account without knowing username and password by
session hijacking and cloning cookies.
In order to circumvent these attacks HTTPS was introduced which was considered to be
secure.
Attacks on HTTPS
There are two ways to attack any communication secured via HTTPS.
1. By sniffing the HTTPS packets over network using software such as Wireshark. The
sniffed packets are then decrypted and the attacker can extract the hidden information,
if the encryption is weak. But when the information is encrypted using 128 bit RSA
algorithm, it is difficult to decrypt the information.
2. The most prevalent attack on HTTPS is MITM (Man in the Middle Attack) which is
described in the next section.
3 Man in the Middle Attack (MITM) on SSL
To access any secure website (HTTPS) through internet, initially a secure connection is
established, which is done by exchanging public keys. During this process of exchanging
public keys, chances of the client getting exposed to MITM attack is more. Protocols that rely
on the exchange of public keys to protect communications are often the target of these types
of attacks.
3.1 MITM on LAN
A person on LAN is more prone to MITM as the victim is on the same physical network as
that of attacker. Below are the steps involved for the attack.
Step 1 ARP CACHE POISONING/ARP SPOOFING Consider three hosts in a switched
environment as shown in figure 1, where one of the hosts is an attacker
Multi Layer Security Approach for Defense Against MITM (Man-in-the-Middle) Attack 205
Copyright ICWS-2009

Fig. 1: Three Hosts in a switched environment
In a switched network, when Host A sends some data to Host B then the switch on receiving
the packet from Host A, reads the destination address from the header and then sends the
packet to Host B by establishing a temporary connection between Host A and Host B. Once
the transfer of data is complete, the connection is terminated. Due to this behavior of a
switch, sniffing the traffic that is flowing from Host A to Host B and vice-versa, is not
possible. So the attacker uses ARP poisoning technique to capture the traffic
Step 2: GIVING CLIENT FAKE CERTIFICATE: Since all the traffic is flowing through the attacker,
he has full access over the victims requests. Whenever the victim requests for a secure connection via
SSL (HTTPS) and waits for a digital certificate (public key), the attacker generates and sends a fake
certificate to the victim and makes the victim trust that a secure connection is established. As from
above steps it is clear that these attacks take advantage of protocols that work on OSI layers, i.e.
HTTP on layer 7 and TCP on layer 4, whereas ARP works on layer 2.Hence, here we use a multilayer
security approach to secure the vulnerable layers.
4 Multi Layer Security Approach to Prevent Mitm Attack on Lan
Generally security approach is mainly concentrated on Application layer, not giving much
emphasis to lower layer. So the first step is to prevent ARP cache poisoning/ARP spoofing,
which occurs at layer 2. To protect a hosts ARP cache from being poisoned it is possible to
make it static. If an ARP cache has been made static it will not process any ARP Replies and
will not broadcast any ARP Requests, unlike a dynamic ARP cache. The static ARP entry is
not practical for large networks. So for larger networks, we propose the following steps to
secure it against Arp spoofing. First Step will be to change the network topology, i.e., when
designing the network, if its feasible, adding more subnets. If we subnet the LAN more, then
there is less static Arps that would need to be applied. Also at each entry/exit node of subnet
place an IDS (Intrusion Detection System). IDS will monitor each subnet (a small network)
for any changes in MAC address to IP address association, giving an alert as shown in figure
3.
This is how we protect layer 2. Now the second layer involved is layer 3 i.e. Network layer.
To secure this layer, we use IPSec (IP Security) Protocol. IPSec protocols can supply access
control, authentication, data integrity, and confidentiality for each IP packet between two
participating network nodes. After securing layer 3, other layer which is involved is layer 7
which can be secured using HTTPS. However users should be careful about
accepting/installing certificates by verifying that certificates are signed by a trusted
Certificate Authority, by paying attention to browsers warning.
Copyright ICWS-2009

Fig. 3: IDS on each subnet
5 MITM on WAN and Defense
Generally MITM on WAN is used for traffic analysis. In traffic analysis an attacker will
intercept and examine packets over a public network. This will let the attacker know about
your surfing profile and also allows him to track your behaviour over the internet. Data
payload consists of the actual message whereas header information consists information
about source, destination, size and other details about packets. Even if data payload is
encrypted, traffic analysis reveals a lot of information about the data, source and destination
which is in header and is not encrypted. Traffic analysis can be performed using tools such as
i2, Visual Analytics, etc. In order to minimize the risk of traffic getting intercepted and
analyzed we propose this solution.
5.1 Defence Against MITM in WAN
Random Routing: - To protect against MITM on WAN we propose the concept of Random
Routing. Here the gateway will also act as a directory server which will maintain a list of
different routes through which packets can be routed to the destination. Figure 5 demonstrates
the MITM attack on WAN. Traffic between Gateway and Service/Server is interrupted by
MITM.

Fig. 5: MITM on WAN
Multi Layer Security Approach for Defense Against MITM (Man-in-the-Middle) Attack 207
Copyright ICWS-2009
We will first find all available paths from our gateway to the destination server. These paths
will be arranged on the basis of path having least network congestion. Now, the traffic will be
divided into small chunks. Each chunk will be send through a different path, depending upon
the above arrangement of the path.
Algorithm
Step 1: Let the total number of Nodes on network be N.
Number of possible paths (P) that can be taken by traffic from source to destination is N!, i.e.
P=N! (Where N=total number of nodes)
Let, Time taken through path 1(pt1) is t1. Time taken through path 2(pt2) is t2
Time taken through path n(ptn) is tn
Hence, Time_taken_through_eachpath[]={t1,t2,t3,t4..tn}; Available Paths[] =
{pt1,pt2, pt3, pt4..};
Step 2: Now using a sorting algorithm, we sort the times in ascending order and arrange the
paths corresponding to the time taken. The path that takes least time will be at the top. Output
of Step 2 is the array of sorted paths depending on the roundtrip time.
Step 3: Now, outgoing traffic (T) from source or destination will be divided into segments
such as T1, T2, T3, T4 ..Ts Such that Traffic [ ] = {T1, T2, T3, T4.....Ts};
Step 4: Now send the Traffic segments through the optimized paths which are obtained in
Step 2. For Example, Segment T1 will be send through the path having the least roundtrip
time, T2 through path having the next least roundtrip time and so on.
Example Scenario: Let us assume that the number of nodes available is 6 which give 720
possible paths (6! = 720). Assume that we divide our outgoing traffic into four segments (sg1,
sg2, sg3, sg4) and send them through four different paths. The four paths that are selected
have least roundtrip time

Fig 6: Demonstrates the above scenario
Copyright ICWS-2009
Table 1. Tabular Representation of Figure 6
Traffic Path N1 N2 N3 N4 N5 N6
T1 Path1 * * *
T2 Path2 *
T3 Path3 * *
T4 Path4 * *
Therefore in order to protect the connection from getting intercepted or analysed, we divert
the traffic via random routes as directed by directory server. Directory server has a list of
available server nodes on should take, thus reducing the possible chances of the attacker
getting complete information about the traffic and henceforth the data.
Therefore in order to protect the connection from getting intercepted or analysed, we divert
the traffic via random routes as directed by directory server. Directory server has a list of
available server nodes on should take, thus reducing the possible chances of the attacker
getting complete information about the traffic and henceforth the data.
5 Conclusion
Since web server security is hardened attackers are targeting client side. Now attackers are
trying to hack the sensitive data by intruding into the connection between client and server.
The most common way to secure the connection is to use HTTPS. But it provides security
only at top layers of OSI ignoring lower layer security. Hence attackers can use lower layers
of OSI to gain access to the connection through MITM (Man in The Middle) attack. Users on
LAN as well as on WAN are vulnerable to MITM attack. Thus in this paper we proposed an
approach for LAN and WAN to protect the connection from MITM attack which ensures that
the data is secured in the lower layers also. In case of LAN, we feel that use of network
topology, along with IDS to monitor change in Static ARPs can reduce the chances of ARP
Poisoning and hence prevent MITM attack on LAN. In case of WAN, we can divide the
traffic and ensure that each segment chooses a different optimized path, so that we can
minimize the risk of traffic getting analyzed.
References
[1] The Evolution of Cross-Site Scripting Attacks by David Endler, http://www.cgisecurity.com/lib/XSS.pdf
[2] Analysis of SSL 3.0 protocol, http://www.schneier.com/paper-ssl.pdf
[3] SSL Man-in-the-Middle Attacks by Peter Burkholder,
http://www.sans.org/reading_room/whitepapers/threats/480.php
[4] Security3 by Nick Parlente, http://www.stanford.edu/class/cs193i/handouts2002/39Security3.pdf
[5] IETF, RFC2616: Hypertext Transfer Protocol -- HTTP/1.1, http://www.ietf.org/rfc/rfc2616.txt
[6] IETF, RFC2109: HTTP State Management Mechanism, http://www.ietf.org/rfc/rfc2109.txt
[7] The Open Web Application Security Project, Cross Site Scripting,
http://www.owasp.org/asac/input_validation/css.shtml
[8] The Open Web Application Security Project, Session Hijacking, http://www.owasp.org/asac/auth-
session/hijack.shtml

Video Streaming Over Bluetooth

M. Siddique Khan Rehan Ahmad
DCE, Zakir Husain College of DCE, Zakir Husain College of
Engineering & Technology Engineering & Technology
Aligarh Muslim University Aligarh Muslim University
Aligarh-202002, India Aligarh-202002, India
siddiquekhan@zhcet.ac.in rehanahmad@zhcet.ac.in
Tauseef Ahmad Mohammed A. Qadeer
DCE, Zakir Husain College of DCE, Zakir Husain College of
tauseefahmad@zhcet.ac.in maqadeer@zhcet.ac.in

Abstract

The Bluetooth speciation describes a robust and powerful technology for
short-range wireless communication. Unfortunately, the speciation is immense
and complicated, presenting a formidable challenge for novice developers.
This paper is concerned with recording video from handhelds (Mobile Phones)
to desktop computers, and playing video on handhelds from servers using real-
time video streaming. Users could be able to record huge data and store it in
the computers within its range of the Bluetooth dongle. The videos on the
server can be played on handhelds through real-time streaming by exploiting
Bluetooth network. We can create a Bluetooth PAN (Piconet) in which mobile
computers can dynamically connect to master and communicate with other
slaves. Dynamically we can select any mobile computer and transfer data into
it. Handhelds have limited storage capacity with respect to computers,
Therefore computers are preferred to store recorded data.
1 Introduction
1.1 Problem Statement
A Bluetooth network has no fixed networking infrastructure [Bluetooth.com]. It consists of
multiple mobile nodes which maintain network connectivity through wireless
communication, and it is completely dynamic, so such networks are easily deployable.
Mobile has limited storage with respect to computer. Therefore Efforts have been put up to
transfer recording video to computer and also play prerecorded video from PC on Handhelds
by real-time streaming through Bluetooth.
1.2 Motivation
The widespread use of Bluetooth and mobile devices has generated the need to provide
services which are currently possible only in the wired networks. The services that are
provided with wired network needs to be explored for Bluetooth. We expect that in near
future we will have Bluetooth PAN providing all the services. Mobile Phones are very
210 Video Streaming Over Bluetooth
Copyright ICWS-2009
common gadgets and most of them have Camera and audio recording facility. So nowadays it
could be use as a multipurpose, but it has limited storage therefore data could be transfer to
the computer and can record for many hours and also the recorded data can be played on
Mobile Phones. This increases the usability of Mobile Phones.
1.3 Approach
To transfer video data between mobile device and Personal computer is a difficult work. In
java j2me [Prabhu and Reddi, 2004] is used on mobile side which continuously take the
camera input for video and microphone for recording audio. The recorded audio and video is
converted into byte array and byte stream is directed in the output stream using Bluetooth. In
the PC side j2se [Deitel and Deitel, 2007] is used for server programming. This server
program opens a input stream and connect to the client output stream and whatever data is
written on server input stream is in byte form and redirect to a file and later the file is save in
any video format you want. Later, this saved data can be redirected by real-time streaming for
playing on mobile in same manner.
2 Mobile System Architecture
2.1 Overview
The convergence of computing, multimedia and mobile communications is well underway.
Mobile users are now able to benefit from a broad spectrum of multimedia features and
services including capturing, sending and receiving images, videos and music. To deliver
such data-heavy, processing-intensive services, portable handheld systems must be optimized
for high performance but low power, space and cost. So, there are several processors in the
market which are being used in the mobile phones today, out of which, the STn8815
processor platform from STMicroelectronics is a culmination of breakthroughs in video
coding efficiency, inventive algorithms and chip implementation schemes and is being used
in most of the NOKIA mobile phones and PDAs. It enables smart phones, wireless PDAs,
Internet appliances and car entertainment systems to play back media content, record pictures
and video clips, and perform bidirectional audio-visual communication with other systems in
real time. The general architecture of a mobile device using such a processor is shown in
figure 2.

Fig. 1: typical system architecture using STn8815
Video Streaming Over Bluetooth 211
Copyright ICWS-2009
Traditional video streaming over wired/wireless networks typically has band-width, delay
and loss requirements due to its real-time nature. More over, there are many potential reasons
including time-varying features, out-of-range devices, and interference with other devices or
external sources that make Bluetooth links more challenging for video streaming. To address
these challenges for video streaming over Bluetooth links, recent research has been
conducted. To present various issues and give a clear picture of the field of video streaming
over Bluetooth, we discuss three major areas, namely video compression, Quality -of Service
(QoS) control and intermediate protocols [Xiaohang]. Each of the areas is one of the basic
components in building a complete architecture for streaming video over Bluetooth. The
relations among them can be illustrated in Figure.3. Figure 3 shows functional components
for video streaming over Bluetooth links [Xiaohang]. Moreover, the layer/ layers over which
a component works is also indicated. The aim of video compression is to remove redundant
information form a digitized video sequence. Raw data must be compressed before
transmission to achieve efficiency. This is critical for wireless video streaming since the
bandwidth of wireless links is limited. Upon the clients request, the media sever retrieves
compressed video and the QoS control modules adapts the media bit-streams, or adjusts
transmission parameters of intermediate layer based on the current link status and QoS
[Xiaohang] requirements. After the adaptation, compressed video stream are partitioned into
packets of the chosen intermediate layer (e.g., L2CAP, HCI, IP), where packets are
packetized and segmented. It then sends the segmented packets to Bluetooth module for
transmission. On the receiving side, the Bluetooth module receives media packets from air,
reassembles them in the intermediate protocols, and sends them to decoder for
decompression.

As shown in figure 3, QoS control can be further categorized into congestion control and
error control [Feamstear and Balakrishnan]. Congestion control in Bluetooth is employed to
prevent packet loss and reduce delay by regulating transmission rate or reserving bandwidth
according to changing link status and QoS requirements. Error control, on the other hand, is
to improve video quality in the presence of packet loss.
Copyright ICWS-2009
4 USB Programming
For USB port programming, we have to use an open source API called jUSB API since no
API is available for USB programming in any JAVA SDK, even not in j2sdk1.5.0.02. The
design approach to implement the usb.windows package for the Java USB API is separated
into two parts. One part deals with the enumeration and monitoring of the USB while the
other part looks into the aspects of communicating with USB devices in general. Both parts
are implemented using Java Native Interface (JNI) to access native operation on the Windows
operating system. The jUSB dynamic link library (DLL) provides the native functions that
realize the JNI interface of the Java usb.windows package.

Fig. 3: Architecture for Streaming Over Bluetooth
Communication with an USB device is managed by the jUSB driver. The structures and
important aspects of the jUSB driver are introduced in section 5. The chapter itself is a
summary and covers only some fraction of the driver implementation. A lot of useful
information about driver writing and the internal structures can be looked up in Walter
Oneys book Programming the Microsoft Driver Model [Oneys]. What we have explained
is clearly shown in Figure 5, since the original USB driver stack is as shown in Figure 4 but
the other drivers can not be accessed by the programmer Once the JAVA USB API is
installed you are ready to program your own USB ports to detect USB devices as well as read
and write through these devices. This JAVA USB API is actually an open source project
carried out at Institute for Information Systems, ETH Zrich by Michael Stahl. For details of
how to write code for programming, refer to the Java USB API for Windows by Michael
Stahl [Stahl]. Basic classes used in this API are listed below DeviceImpl Class: basic
methods used are
Open Handel
Close Handel
Get friendly Device Name
Get Attached Device Name
Copyright ICWS-2009
Get Num Ports
Get Device Description
Get Unique Device ID
jUSB class : basic methods used are :-
JUSBReadControl
getConfigurationBuffer
doIntrruptTransfer

Fig. 4: USB driver stack for Windows Figure 5: Java USB API layer for Windows
5 Design

Fig. 6: Architectural design and data flow diagram
Copyright ICWS-2009

Fig. 7: Interface Design
6 Conclusion
In this paper, we have shown a system of compressing and streaming of live videos over
networks, with an objective to design an effective solution for mobile access. We developed a
j2me application on mobile side and j2se application on PC side. Three major aspects are to
be taken into consideration namely video compression, Quality of Service (QoS) control and
intermediate protocols. Video compression is to remove redundancy to achieve efficiency in
a limited bandwidth network. QoS includes congestion control and error control. It is to
check packet loss, reduce delay and improving video quality. The server side requires USB
port to be programmed for enumeration, monitoring and communicating with USB devices.
7 Future Enhancement
The developed application uses the Bluetooth as media. In future, the EDGE/GPRS [Fabri et
al., 2000] or Wi-Fi can also be used as media. Although, EDGE/GPRS provides with lesser
bandwidth while Wi-Fi provides much more bandwidth than Bluetooth
References
[1] [Bluetooth.com] Specification of Bluetooth System Core vol.1,ver1.1 www.bluetooth.com
[2] [Chia and Salim, 2002] Chong Hooi Chia and M. Salim Beg, MPEG-4 video transmission over Bluetooth
links, Proc. IEEE International Conf. on Personal Wireless Communication, New Delhi 15-18 Dec 2002
[3] [Deitel and Deitel, 2007] Deitel & Deitel, JAVA How to Program, sixth edition, Prentice Hall (2007)
[4] [Fabri et al., 2000] Simon N. Fabri, Stewart Worrall, Abdul Sadka, Ahmet Kondoz, Real-Time Video
Communications over GPRS, 3G Mobile Communication Technologies, Conference Publication No. 471,
0 IEE 2000
[5] [Feamstear and Balakrishnan] Nick Feamster and Hari Balakrishnan, Packet Loss Recovery for Streaming
Video http://nms.lcs.mit.edu/projects/videocm/
[6] [Johansson et al., 2001] P. Johansson, M. Kazantzidls, R. Kapoor and M. Gerla, Bluetooth: An Enabler for
Personal Area Networking, Network, IEEE, Vol. 15, Issue 5, Sept.-Oct. 2001, p.p. 28-37
Copyright ICWS-2009
[7] [Lansford and Stephens, 2001] J. Lansford, A. Stephens, R. Nevo, Wi-Fi (802.11b) and Bluetooth:
enabling coexistence, Network, IEEE, vol. 15, issue 5, Sept.-Oct. 2001, p.p. 20-27
[8] [Oneys] Walter Oneys, Programming the Microsoft Driver Model
[9] [Prabhu and Reddi, 2004] C.S.R. Prabhu, A. Prathap Reddi, BLUETOOTH TECHNOLOGY and its
Application with Java and J2ME, Prentice Hall India (2004)
[10] [Stahl] Michael Stahl, Java USB API for Windows
[11] [Xiaohang] Wang Xiaohang, Video Streaming over Bluetooth: A Survey

Role of SNA in Exploring and Classifying Communities
within B-Schools through Case Study

Dhanya Pramod Krishnan R. Manisha Somavanshi
IIM, Pune IIM, Pune IIM, Pune
dhanyas@mailcity.com gckrish@hotmail.com manisha.somavanshi@gmail.com

Abstract

The facets of organizational behavior have changed since the advent of
internet. World Wide Web has become not only a platform for communication
but also facilitates knowledge sharing. This paper focuses on how people
within an academic organization behave as communicators. Social network
analysis (SNA) has emerged as a powerful method for understanding the
importance of relationships in networks. This paper presents a contemplation
that examines the mode and frequency of communication, use of web
technology for communication within and between departments of academic
institutes. Few studies are done among the B-Schools current organization
models and the use of Social network to improve the relationship and
communication among the members and between different communities. Here
we have identified communities and main roles in B-School using SNA. The
formal and informal communications are found having great influence on the
social network.
1 Introduction
We have created a model to describe the relationship among all the different departments in
the B-school. Their communities and how these communities affect on the productivity and
the working environment in the organization. This model considers different types of
relationships among different members of different communities. These relations are
competition, communication, exchange of information, education.
Social network analysis means the understanding flows of communication between people,
groups, organizations and other information or knowledge processing entities. Social network
is one of the most important true-life networks in our real world scenarios. A typical feature
of the social network is the dense structure which is essential for understanding the networks
internal structure and function. Traditional social network analysis usually focuses on the
principle of centralization and the power of a single individual or entity, however, in peoples
daily life, a group or an organization often holds a more influential position and plays a more
important role. Therefore, in this paper, we first present a scenario where Academic Institutes
social networks are useful for investigating, identifying structure of the institute. A typical
academic institute consists of departments like administration, accounts, library, examination,
placements and canteen. Cross functional flow of information between the departments are
very essential for the smooth functioning of the institute and to provide updated information
on various day to day matters. There are various ways by which people communicate with
each other viz. emails, intranet, meetings or MIS reports. The number, size, and connections
Role of SNA in Exploring and Classifying Communities within B-Schools through Case Study 217
Copyright ICWS-2009
among the sub-groupings in a network can tell us a lot about the likely behavior of the
network as a whole. How fast will things move across the actors in the network?
The rest of this section includes the flow of communication in B-Schools, where different
categories are discussed. In section 3 SNA applied to a case study and findings are reported.
Finally we proposed a Social Network Model, touched upon related work and end up the
paper with conclusion and future work.
2 Study of Communication Flow in B-Schools
We have studied the organizational structure and processes of some Top B-Schools in Pune
and analyzed the communication flow. According to the communication pattern,
categorization is done A to F. Following parts of the section describes the various types of
communication flows.

Fig. 1: Category A
In the category A organization the power of decision making is centralized at the top of the
organizations structure and subsequently delegated to various departmental heads who inturn
communicate to the respective departments under their perview. [Fig 1] shows the
organization charts for this category. Various departments are divided under 7 main
responsibility centers. Viz; Administration, Academics, Library, Training & Placements,
Hostel, Gymkhana and Alumni. The means of communication are meetings, emails,
interoffice memo, intraoffice memo and phone calls. Regular feedback is communicated to
the concerned departmental needs in the form of reports which is then communicated to the
top level management in meetings. Since the activities of various departments are
interdependent horizontal flow of information exists between departmental heads.
In Category B the policies and rules governing the functioning of the organization are
decided by the management and the director is in-charge of planning, execution & control of
the process. Academics and administration responsibilities are delegated to HOD and
registrar respectively while library and placement activities are carried out separately. In this
category of organization all the administrative responsibilities are with registrar and
academics. HOD has to coordinate with him for daily functioning of the departments. The
Category C institute does not have separate academic head. Director is involved in academic
activities. Category F organization has more vertical levels as each and every type of
responsibilities has a coordinator officer for monitoring. All academic related activities like
seminars, workshops, faculty development programs etc. has different coordinators who
inturn delegate work to faculty members. Horizontal levels are more in category E and
vertical communication are coordinated by separate cells.
218 Role of SNA in Exploring and Classifying Communities within B-Schools through Case Study
Copyright ICWS-2009

Fig. 2: Category F
Category F organization follows the principal of decentralization where since every
department has its own head and procedure for working in organizations which have diverse
activities and requires skilled workforce for respective tasks. After analyzing communication
pattern among all above A to F organization structures we found Category F [Fig 2]
organization is having more decentralized organizational structure and hence social network
is more suited for such kind of organizations. Therefore we have considered Category F types
of organization for our case study.
3 Social Network Model
For our case study we had considered B-School which falls under category F and having IT
and Management departments. In this organization faculty community has various sub
communities for handling activities. To list a few reception committees, hall committee, food
committees, technical committee, transport committee etc. for different events. To handle day
to day activities coordinator and staff liaison with each other and thus form a community.
Learning Facilitator community, class teacher or mentors etc are other communities found.
These kinds of organization end up with lots of communities. Interesting to note that a person
is part of many communities we found that there is a tremendous scope for a social network
in this kind of organization. So that they can communicate in a standard and common
platform and maintain the information exchanged for further reference. It will also enable
easy handling over of responsibility when a person leaves the organization.
3.1 Data Analysis
The specified organization IT department has 14 faculty members and management
department has 20 faculty members. Management department is further divided into 5
specialization having 5, 4, 6, 3, 2 faculty members respectively. The organization having very
strong communication among faculty members within each department according to the
analysis done using the SNA We have calculated the inside communication degree and inside
communication strength with an additional parameter frequency of communication.
Copyright ICWS-2009
Inside communication degree (ICD) = No. of edges / No. Vertices * (No. of Vertices 1)
Inside communication Strength (ICS)= Weight of edges/No. Vertices* (No. of Vertices 1)
Where weight of edges = Frequency of communication (Average no of communication/
month)
Outside communication degree (OCD) = No. of edges between A & B/No. Vertices A * No.
of Vertices B
Outside communication Strength (OCS) = Weight of edges Between A & B / No. Vertices
A * No. of Vertices B
Where weight of edges = Frequency of communication (Average no of communication /
month)
For the IT department [Table 1]
ICD = 173 / 182 = 0.95
ICS = 443 / 14 * (14-1) = 443 / 182 = 2.43
For the Management department [Table 2]
ICD = 380 / 380 = 1
ICS = 801 / 20 * (20-1) = 801 / 380 = 1.80
Interdepartmental Communication between IT and Management [Table 3]
OCD = 12 / 280 = 0.04
OCS = 180 / 280 = 0.64
OCS of IT & Admin = 123 / 42 = 2.92
OCS of IT & Placement = 60 / 28 = 2.14
OCS of IT & Library = 17 / 28 = 0.60
OCS of IT & Lab = 65 / 42 = 1.54
Table 1: IT Department Communication Chart

Copyright ICWS-2009

Fig. 3: IT Department Social Network

Fig. 4: It Dept- Central Nodes Community
Copyright ICWS-2009

Fig. 5: IT-Mgt-Community

Fig. 6: Mgt Dept- Community

Copyright ICWS-2009

Fig. 7: IT-Mgt- Community Density
Table 2: Management Department
Sub Community Node 1 Node 2 Edge Density
ACADEMICS MD 1 MD 2 E 22 100
MD 3 E 23 60
MD 6 E 24 80
MD 5 E 25 15
MD 4 E 26 10
MKTG MD 2 MD 9 E 27 10
MD 10 E 28 10
MD 11 E 29 10
MD 12 E 30 10
FIN MD 3 MD 18 E 31 10
MD19 E 32 5
MD 20 E 33 0
ECO MD 5 MD 17 E 34 9
IT MD 7 MD 8 E 35 30
HR MD 4 MD 13 E 36 6
MD 14 E 37 10
MD 15 E 38 6
MD 16 E 39 20
MD 21 E 40 0
Table 3: Interdepartmental Communication Strength
Library Admin Placement Lab Mgt Dept
IT 17 123 60 65 180
Copyright ICWS-2009
According to the above figures, it is clear that within a specific department, strength of
communication is more compared to inter departmental communication. The various types of
interdepartmental communities are shown in [Table 3] the data we have considered include
all communication Medias, like email, phone calls, memo etc. The various factors that affect
interdepartmental communication is drawn in [Table 4].
Table 4: Community wise communication
Community Contribution %
Common event
organizers
7.69
Friendship 46.15
Common Interest 46.15
3.2 Web based communication Analysis
We have identified the percentage utilization of web as the communication media [Table 5]
and also the formal [Table 6] and informal [Table 7] communication that happens on the web.
Table 8 shows the communication of IT department with the other departments on the web.
Table 5: Web Based Communication Analysis
Total Communities on
Web
% of web based
communication
IT department 30
Management department 15
Interdepartmental 30

Table 6: Formal Web Based Communication
Departments
% of formal web based
communication
IT department 20
Management department 10

Table 7: Informal Web Based Communication
Analysis
Departments
% of Informal web
based communication
IT department 10
Management
department
5

Table 8: Interdepartmental Web Based Analysis

Library Admin Placement
IT 80% 10% 90%
3.3 Major Findings from the Above Statistical Analysis
SNA proved to be very powerful in identifying the centrality [Fig 7] of the social
network existing in the B-School and roles identified from the network, Node 1 as
HOD, Node 14 as Director, Node 15 Deputy Director.
The bridge between IT and Management Department is the edge between the Node 1
and 15 [Fig 7] i.e. IT HOD and Mgt. Dept. Deputy Director.
Node 1, 14 and 15 holds a dense communication in the whole network.
The informal communication between peer departments (IT & Management) is strong
due to friendship and common interest.
Web based communication is less within the department.
IT and Management Department utilizes web for 50% of total communication.
IT department to nonacademic department web based communication is strong.
The Informal web based communication is slightly larger due to common interest in
community that exists across the department.
Copyright ICWS-2009
4 Proposed Model
We have proposed a social network model where the communication can happen in much
structured and powerful manner. This model would also provide archives of exchanged
information and thus enhances traceability.
We have identified three categories of communities
Role based community: Every academic organization there exist communities of
people who play same role. In the above mentioned case study Directors, Learning
facilitators, class teachers, coordinators etc fall in this category. The same community
may exist throughout the lifetime of an organization even though the members of
community changes as and when people leave/join the organization. Thus the
proposed model allows role based communities to be created. At any point of time a
community can be evolved as the organization undergoes structural changes. This
comes under formal community.
Friends: Friendship, as in any other organization can be the reason for
communication. This community is useful to understand the kinds of information
people share. This is an informal community. A new community can be evolved at
any time.
Special/common Interest Groups: The organization may have group of faculty
members who teach similar subjects and thus share knowledge. This could be a
formal community or an informal one.
5 Related Work
Zuoling Chen and Shigeyoshi Walanabe in their paper A Case Study of Applying SNA to
Analyze CSCL Social Network [7] discuss and prove that group structure, members
physical location distribution and members social position has a great impact on web based
social network. Enhanced Professional Networking and its impact on Personal Development
and Business Success by J Chen [3] describes how professional networking events cultivate
new cross divisional business collaborations and help to improve individual skills. Chung-Yi
Weng & Wei Ta Chu Conducted an analysis of social network existing in movies in paper
titled Movie Analysis based on Roles Social Network [2]. They have proved that based on
roles social network communities, storyline can be detected. The framework they have
proposed for determining the leading roles and identifying community does not address role
recognition of all characters but efficient in community identification.
There is a tremendous scope for SNA to identify the patterns of communication within an
academic institute. The current mode of communication is not easily archive able as it is not
used much in the organization wide processes. We have come to a conclusion that informal
communication should be encouraged as it is less in the organization so that more knowledge
sharing and unanimity can be achieved. Our proposed social network model provides a
common standardized communication framework for the organization. Our work will be
further extended to analyze how individual productivity is affected and influence
Copyright ICWS-2009
accomplishment of individual goals and organizational goals. Implementation of the
framework will be done subsequently.
References
[1] [Breslin, 2007]The Future of Social Networks on the Internet: The Need for Semantics Breslin, J.; Decker,
S.; Internet Computing, IEEE Volume 11, Issue 6, Nov.-Dec. 2007 Page(s):86 90
[2] [Chung, 2007] Movie Analysis Based on Roles' Social Network Chung-Yi Weng; Wei-Ta Chu; Ja-Ling
Wu; Multimedia and Expo, 2007 IEEE International Conference on 2-5 July 2007 Page(s):1403 1406
[3] [Chen, 2006]Enhanced Professional Networking and its Impact on Personal Development and Business
Success J. Chen1; C.-H. Chen-Ritzo1
[4] http://domino.research.ibm.com/cambridge/research.nsf/
[5] [Hussain, 2007] Terrorist Networks Analysis through Argument Driven Hypotheses Model Hussain, D. M.
Akbar; Availability, Reliability and Security, 2007. ARES 2007. The Second International Conference on
10-13 April 2007 Page(s):480 492
[6] [Jamali, 2006] Different Aspects of Social Network Analysis Jamali, M.; Abolhassani, H.; Web
Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on 18-22 Dec. 2006 Page(s):66
72
[7] [Saltz, 2007] Increasing Participation in Distance Learning Courses Saltz, J.S.; Hiltz, S.R.; Turoff, M.;
Passerini, K.; Internet Computing, IEEE Volume 11, Issue 3, May-June 2007 Page(s):36 44
[8] [Zuoliang 2007] A Case Study of Applying SNA to Analyze CSCL Social Network Zuoliang Chen;
Watanabe, S.; Advanced Learning Technologies, 2007. ICALT 2007. Seventh IEEE International
Conference on 18-20 July 2007 Page(s):18 20
Smart Medium Access Control (SMAC)
Protocol for Mobile Ad Hoc Networks
Using Directional Antennas

P. Sai Kiran
School of Computer Science & Informatics, SreeNidhi Institute of Science and Technology
Hyderabad, Andhra Pradesh, India
psaikiran@hotmail.com

Abstract

This paper proposes a Smart Medium Access Control (SMAC) Protocol for
Mobile Ad Hoc Networks (MANET) using Directional Antennas. SMAC
protocol exploits the Directional Transmission and sensing capability of the
Directional Antennas there by increasing the Performance of MANET. SMAC
protocol proposes a Dual Channel Approach for Data and Control Information
to overcome deafness and Hidden terminal problems. SMAC protocol
proposes a new Node Mobility update Model used for addressing Node
Mobility. SMAC protocol also uses alternate method to backoff timer,
processing the data packets in the queue ready for transmission in other
directions when the data packet in the front of the queue to be transmitted
finds channel busy in its direction. SMAC has the advantage over other MAC
protocols proposed for Directional Antennas as it addresses all the issues like
Mobility of Node, Deafness, Hidden Terminal etc.
1 Introduction
According to the definition of IEEE 802.11: A network composed solely of stations within
mutual communication range of each other via the wireless medium (WM).
Traditional MAC Protocols such as IEEE 802.11 DCF (Distributed Coordination Function)
and IEEE 802.11 Enhanced DCF designed for Omni-directional antennas and cannot achieve
high throughput in ad hoc networks as they waste a large portion of the network capacity. On
the other hand, smart antenna technology can improve spatial reuse of the wireless channel,
which allows nodes to communicate simultaneously without interference.
The capabilities of directional antennas were not exploited when using conventional MAC
protocols, such as IEEE 802.11. In fact, the network performance may even deteriorate due to
issues specific to directional antennas. There are many protocols proposed that exploits the
directional antenna capabilities while addressing the specific issues of the MAC protocols
using directional antennas. We want to propose a MAC protocol using directional antennas
that will not only concentrate on spatial reuse but also on throughput and performance of the
protocol.
This Paper is organized as Follows: Section II deals with Design Considerations for a
proposed SMAC Protocol, Section III Gives the working of the proposed SMAC protocol,
Section IV Concludes the Paper.
Smart Medium Access Control (SMAC) Protocol for Mobile Ad Hoc Networks Using Directional Antennas 227
Copyright ICWS-2009
2 Design Considerations for SMAC Protocol
2.1 Antenna Model
The Best Design consideration for a MAC protocol using Directional Antennas is Smart
Antenna. This Protocol although considers the smart antenna as the choice for designing the
protocol, also supports nodes with other directional antenna models like switched beam
antennas.
2.2 Directionality
If the Antenna Model is smart antenna, the transmission would be very accurate in the
direction of transmission. In this protocol, the 3600 coverage is divided into number of
segments based on the directionality in the clock wise direction.
If the Antenna type is switched beam antenna, then the number of segments would be equal
to the number of directional antennas in the switched beam and the numbering of the
segments would be given in the clock wise direction.
If we consider an example of switched beam antenna with 6 directional antennas, the
segments numbering would be as indicated in the Figure 1.
Figure 2 considers the usage of Smart Antennas where the number of segments would be
based on the level of spatial reuse required and to reduce the interference.
2.3 Sensing the Medium
The Design Consideration for sensing the medium in this protocol is directional carrier
sensing. If the data is to be transmitted to a segment, then this protocol requires not only the
segment in the directional of the target to be free or idle but also the immediate neighboring
segments should also be free from transmission.
For example, if we consider the Figure 1 and if the direction of intended transmission is
segment 2 then the node would initiate transmission only when the segments 1, 2 and 3 are
found to be free of transmission. This Design option considers the node mobility (source or
destination), that was neglected by many of the previous protocols for MANET using
directional Antennas.

1
2
3
4
5
6

1
2
3 8
7
6
5
4

Fig. 1: Segment Numbering for Switched Beam
Antenna
Fig. 2: Segment Numbering for Smart Antenna
228 Smart Medium Access Control (SMAC) Protocol for Mobile Ad Hoc Networks Using Directional Antennas
Copyright ICWS-2009
2.4 Information at Each Node
Every node maintains information indicating the direction of neighboring nodes as well as the
status of the segments whether the segments can be used for transmission and free from any
transmission in that direction.
This information will be maintained in two tables at each node.
Table 1: Neighbor Node Information
Segment Number Node
Source Destination
Status
2.4.1 Neighboring Node Information Table
This table would indicate the direction of the nodes in the form of segment numbers and this
information will be used to communicate with the nodes using the segment numbers
indicated. The Source Segment number indicates the segment being used by the source to
reach the destination node. The destination segment number indicates the segment through
which the destination node receives this packet. The status would indicate the state of the
node whether the node is busy in transmission or not. This status information is very much
needed and used in this protocol so as to avoid the Deafness problem. This would indicate
that the node is busy in transmission, so it is not responding to the RTS request of a node.
An Entry of the node in this table would indicate that the node is within the coverage Area of
the current node.
2.4.2 Segment Table
This table will have the information about the segment and the status of the segment
(Busy/Idle).
The structure of the table is as shown in Table 2.
Table 2: Segment Table
Segment Number Status Waiting
The Segment number will indicate the segment and the status will be either Busy or Idle.
Waiting field is just a single bit indicating 0 or 1. Bit 1 indicate that one or more packets are
waiting in the backoff queue for the segment to be free.
Need for Two Tables: We are maintaining two different tables one to indicate the node status
and another to indicate the Segment status. Segment table is needed as we need to check for
the availability of the segments free from transmission. On the other hand waiting bit in the
table would reduce the time needed to search if any packets are waiting for the segment to be
free or not.
In Figure 1, as mentioned if we are using segment 2 for transmission we will also block any
transmission from segment 3 and a and thus will indicate that segment 1,2 and 3 to be busy in
the segment table.
2.5 Channel
The Channel will be divided in to two sub channels Control channel and Data channel. The
Control channel will be used to send the control information through node updates for
maintaining the tables.
Copyright ICWS-2009
The Data channel will be used to transmit RTS-CTS-Data etc. The Node can simultaneously
transmit in both channels without any interference.
2.6 Mobility Updates
All the nodes in this network will transmit a update message to all its neighboring one hop
nodes in directional mode.
The update packets will be transmitted to one segment at a time. The update packet format is
as shown in figure 3.
The Fields in the packet are
Source node ID: The Node ID transmitting the packet.
Segment Number: The Segment Number used by the sender node to transmit the update
packet.
Status: This field is used by the node to inform the one-hop neighbor about the status of
transmission in different segments of the node. If the sender node is transmitting the data
using the segment 1 in a 6 segment node. As we want to reserve the segments 2 and 6 i.e
immediate segments, the status would indicate that the segments 1,2, and 6 are busy and
would indicate by 1 in those bit positions.
2.6.1 Sensing the Medium
A node will sense the medium periodically in a particular segment direction and would
transmit the update packet in that direction. If the Medium is found to be busy, then it would
switch on to the next segment after waiting in omni-directional mode for a certain period of
time. The node shifts to omni directional mode to listen to other update messages if any
transmitted by other nodes through any other segments.
A node maintains single bit information for each segment, whether it had transmitted update
packet in a particular segment direction or not. If it had already transmitted the update packet
in a segment direction then it would make the bit position of the segment number to be 1. If it
had failed to sense the medium to be idle then it would keep the bit value to zero.
Source Node ID Segment Number No of Segments Status
M bits to represent N segments N bits to represent status of N
segments
Fig. 3: Update Packet Format
2.6.2 Transmission
When the node finds the medium to be idle in a particular direction, then it would transmit
the request-to-transmit (RTS) packet in that direction. Here the node would not address any
node in that direction as it will not be sure about the identity of the nodes in that direction.
Any node receiving this RTS packet would respond with a Clear to Send (CTS) packet with
its ID. The RTS packet will have the sender ID. The receiving node will update neighbor
node table about the sender node location. The CTS packet transmitted by the receiving node
will have its ID for the sender node to update its table about the node location. The sender
node after receiving the CTS would transmit the update packet about the status of the node in
that Direction.
Copyright ICWS-2009
The receiving node after receiving the update packet would piggyback the Acknowledgement
with the Update Message to the sender. Thus, the two nodes would update their tables when
one node successfully accesses the channel.
2.6.3 Delivery of Update Messages
A node for each segment would maintain single Bit to indicate the transmission of update
packet. The Node will update the Bit to 1 after transmitting update packet. The transmission
can be by the node initiating the transmission of update packet in that segment direction or by
responding to the update packet transmitted by piggybacking. Before sensing another
segment in increasing order, the node would check the status of update message delivery for
any previous segments and would give another chance. A new round starting from segment 1
initiated only after successfully transmitting update messages in all the directions.
2.6.4 Timeouts and Retransmission
When a node senses the medium in a particular direction or segment and finds to be free, it
transmits RTS packet. A node after transmitting a RTS packet will not receive a CTS packet
if there is no node in that direction or the node in that direction does not respond because of
deafness problem. The sender node would timeout before receiving any CTS packet from the
other node. The node would retransmit the RTS message for three consecutive times before
giving up. After three attempts it would make the update bit set to 1.
Because of the design option that a node not only initiates the transmission of updates packet
but also piggybacks the received update packet form another nodes. The probability of update
packet missing transmission in a particular direction is very less.
2.7 Queue Model
SMAC protocol uses different queues apart from a Ready queue that will maintain the
packets ready to transmit. SMAC maintains a queue called nodenotfound to store the packets
for which the destination node not found. SMAC also maintains N number of backoff queues
where N is the number of segments for a node. For example, Node with Figure1 Antenna
would maintain 6 backoff queues apart from nodenotfound queue and ready queue.
3 Working of Proposed Protocol
The proposed SMAC uses Directional transmission scheme. SMAC protocol uses alternative
method [1] to back off, when the direction in which the packet to be transmitted is found
busy.
3.1 Carrier Sensing and Backoff
The SMAC protocol uses alternate method that processes the data packets in the queue ready
for transmission in other directions when the data packet in the front of the queue to be
transmitted finds channel busy in its direction.
Initially a node read the packets that are ready for transmission from the ready queue. It
would read the Destination Node ID from the packet header and see for the existence of the
node in the neighbor node table. SMAC will process the packet further if it finds node in the
table. SMAC will place the packet in nodenotfound queue if it does not find the node in the
table. The packets placed in nodenotfound queue will get a chance two more times for
Copyright ICWS-2009
transmission before informing the routing protocol about the unavailability of the node. For
this, single bit is added to the header part of the packet and initialized to 0.
If SMAC finds the destination node information, it will get the destination node segment.
SMAC will find the status of the segment and the neighboring segments for transmission. If
the status of the segments is found idle then the SMAC will proceed with packet transmission
after RTS, CTS exchange. If the segments are found busy then the packet will be placed in to
the Backoff queue for that segment. The waiting Bit will be updated to 1 by the SMAC after
placing the packet in the Backoff queue.
SMAC will then proceed to the next packet ready in the queue. It will check whether the next
packet in the queue belongs to the same destination of previous packet. If the previous packet
was transmitted successfully then this packet will also be transmitted and if the packet was
placed in to Backoff queue then this packet will be placed in to the Backoff queue without
further processing.
3.2 Data Transmission
Once the carrier is sensed to be idle in the intended direction, the sender node will transmit a
RTS (Request to Send) message indicating the transmission. The Destination node would
update its segment table about the transmission and respond with a CTS message. The Sender
before transmitting an RTS message would update its segment table with the segment for
transmission. Once the sender receives the CTS message, it would start data transmission.
Sender and destination nodes will update the segment tables after completing the
transmission.
3.3 Processing Backoff Queues
SMAC protocol would process Backoff queues when the next packet in the ready queue is
having different destination node to that of previous packet. SMAC would check the segment
table to see if any segment waiting bit set to 1 and status is idle. This would mean that the
segment, which was busy, previously is now idle and there are packets in the Backoff queue
for that segment.
SMAC protocols would identify those segments and would process the packets in that
segment back off queue and transmit all the packets of the segment. Before transmitting the
packets, the SMAC would check the neighboring node table to see whether the node is still in
that segment or moved to a new segment. If SMAC does not find the node information in the
neighboring node table then the packets for that node will be shifted to nodenotfound queue.
If the node is moved to a new segment because of node mobility then the segment to which
the node moved will be verified to see if the status is idle or not. If the status of the segment
is idle then the packets will be transmitted in that segment direction. The packets will be
placed to the New segment backoff queue if the status of the segment is busy. Once all the
packets in the segment backoff queue are processed, waiting bit of the segment will be set to
0.
3.4 Processing Nodenotfound Queue
SMAC protocol will also process the nodenotfound queue after processing the Backoff
queues. A packet in the nodenotfound queue will be tried for transmission twice before
Copyright ICWS-2009
initiating the node not found message to the routing protocol. All the packets in the
nodenotfound queue will be processed before leaving the queue. If the packet destined to a
node is found in the neighbor node table then the packet will either be transmitted or placed
in the backoff queue for that segment depending on the status of the destination node. If the
node information is still not available then the node bit is modified to 1 if it is zero. If the
node bit is already 1 indicating that the packet is being tried for transmission for second time,
the packet will be discarded from the nodenotfound queue and informs the network layer that
node is not available.
4 Conclusion
This paper proposed a new MAC protocol using Directional Antenna. We tried to exploit the
Directional transmission capabilities of the directional antennas to improve throughput,
spatial reuse and overall performance of communication. This paper introduced Node
Mobility model using control information transmitted through control channel. The node
mobility model is designed to be used not only for this SMAC but also for routing protocols
and transport layer services like TCP and Quality of Service connection establishment and IP
Address Configuration etc,. This paper included a method alternate to the backoff timer to
increase the throughput. This protocol overcomes the various problems like Hidden Terminal
problem, Deafness, Information Staleness and Node Mobility problem that are specific to the
MAC protocols for MANET using directional antennas.
This Protocol can be further extended to include service differentiation for providing Quality
of Service problem.
References
[1] [P. Sai Kiran, 2006] P. Sai Kiran, Increasing throughput using directional antennas in Wireless Ad Hoc
Networks, IEEE ICSCN06.
[2] [P. Sai Kiran, 2006] P. Sai Kiran, A survey on mobility support by MAC protocols using directional
antennas for wireless ad hoc networks, in Proc, IEEE ISAHUC2006.
[3] [P. Sai Kiran, 2007] P. Sai Kiran, Statefull Addressing Protocol (SAP) for Mobile Ad Hoc Networks, in
proc, IASTED CIIT07,
[4] [C. Siva Ram Murthy, 2004] C. Siva Ram Murthy & B. Manoj; Ad Hoc Wireless Networks, Architectures
and Protocols, Prentice Hall, 2004
[5] [Hongning Dai, 2006] Hongning Dai, Kam-Wing Ng and Min-You Wu, An Overview of MAC Protocols
with Directional Antennas in Wireless ad hoc Networks, ICWMC 2006, Bucharest, Romania, July 29-31,
2006
[6] [Masanori Takata, 2005]Masanori Takata, Masaki Bandai and Takashi Watanabe, Performance Analysis
of a Directional MAC for Location Information Staleness in Ad Hoc Networks, ICMU 2005 pp.82-87
April 2005.
[7] [Ram Ramanathan, 2005] Ram Ramanathan, Jason Redi, Cesar Santivanez, David Wiggins, and Stephen
Polit, Ad Hoc Networking with Directional Antennas: A Complete System Solution IEEE
Communications, March 2005.
[8] [Jungmin So, 2004] Jungmin So, Nitin Vaidya, Multi-Channel MAC for Ad Hoc Networks: Handling
Multi-Channel Hidden Terminals Using A Single Transceiver, in proc, MobiHoc04, May 2426, 2004.
[9] [Romit Roy Choudhury, 2004] Romit Roy Choudhury and Nitin H. Vaidya, Deafness: A MAC Problem in
Ad Hoc Networks when using Directional Antennas, in proc ICNP04.
[10] [Tetsuro Ueda, 2004] Tetsuro Ueda, Shinsuke Tanaka, Siuli Roy, Dola Saha and Somprakash
Bandyopadhyay, Location-Aware Power-Efficient Directional MAC Protocol in Ad Hoc Networks Using
Directional Antenna, IEICE 2004.
Copyright ICWS-2009
[11] [Michael Neufeld] Michael Neufeld and Drik Grunwald, Deafness and Virtual Carrier Sensing with
Directional Antennas in 802.11 Networks, Technical Report CU-CS-971-04, University of Colorado.
[12] [Ajay Chandra V, 2000] Ajay Chandra V. Gummalla and John O. Limb, Wireless Medium Access Control
Protocols, IEEE Communications Surveys and Tutorials, Second Quarter 2000
[13] [Romit Roy Chowdary,2002] Romit Roy Chowdary, Xue Yang, Ram Ramanathan and Nitin. H Vidya,
Using Directional Antennas for Medium Access Control in Ad Hoc Networks, MOBICOM 02 Sep 23-
28 2002
[14] [Z. Huang, 2002] Z. Huang, C.-C. Shen, C. Srisathapornphat and C. Jaikaeo, A busy-tone based
directional MAC protocol for ad hoc networks, in Proc. IEEE Milcom, 2002.
Implementation of TCP Peach
Protocol in Wireless Network

Rajeshwari S. Patil Satyanarayan K. Padaganur
Dept of CSE Dept of CSE Dept of ECE
Bagalkot Bagalkot B.L.D.E.As CET
Bijapur, Karnataka

Abstract

Throughput or the Good put improvement in an IP network is one of the most
significant issues towards data communication and networks. Various IP
congestion control protocols are being suggested over the years. But in the
networks where the Bit Error Rate very high due to link failures, like the one
in the case of satellite network or the ADHOC network, network layer
controlling is not suggestible. In those networks transport layer control on
frame transfer rate is adopted. One of the preliminary classes of transmission
control adopted to improve goodput in such networks is TCP-Peach.
The objective of this work is to improve TCP-Peach congestion control
scheme, and extend it to TCP-Peach+ to further improve the goodput
performance for satellite IP networks. In TCP-Peach+, two new algorithms,
Jump Start and Quick Recovery, are proposed for congestion control in
networks.
These algorithms are based on low priority segments, called NIL segments,
which are used to probe the availability of network resources as well as error
recovery. The objective is to develop a simulation environment to test the
algorithm.
Keywords: TCP-Peach, Jump start, Quick recovery
1 Introduction
The work implements a strategy to improve the performance of networks through transport
layer rate control, in a network, which are more vulnerable to the link failures. The cause of
link failures could be fading, noise, interference or energy loss or the external signals. Thus
we will discuss about the Network architecture and the transport layer in detail in this
segment.
1.1 End-to-End Network Concept
The data communication requirements of many advanced space missions involve seamless,
transparent connectivity between space-based instruments, investigators, ground-based
instruments and other spacecraft. The key to an architecture that can satisfy these
requirements is the use of applications and protocols that run on top of the Internet Protocol
(IP). IP is the technology that drives the public Internet and therefore draws billions of dollars
Implementation of TCP Peach Protocol in Wireless Network 235
Copyright ICWS-2009
annually in research and development funds. Most private networks also utilize IP as their
underlying protocol. IP provides a basic standardized mechanism for end-to-end
communication between applications across a network. The protocol provides for automated
routing of data through any number of intermediate network nodes without affecting the
endpoints.
2 Role of TCP
As it is being discussed, TCP provides good flow control through the acknowledgements and
frame processing. But due to un necessary bandwidth requirement due to these causes a lot of
channel delay. This in terms becomes the cause for congestion in the network. All the
protocols is the TCP employs common policies for performance improvement for congestion,
delay loss and congestions. Hence sometimes, it there is a data loss due to link failure, the
node attempts re transmission of the packet, considering that the loss is caused by congestion.
The window size management is purely based on assumption theory. Bandwidth and link
status feed back is not employed. Therefore when the probability of error is high, the
performance of the network degrades. Hence we are going to utilize the NULL segments of
the TCP packets to carry suitable information about the link status and adjust the transmission
rate based on the link status calculation through channel condition estimation. Normally at
the beginning of the transmission, the window size is small and it is increased slowly. Even If
enough bandwidth is available, Nodes can not utilize them appropriately. Hence sudden
increase in the bandwidth is facilitated. Under link error situations, by monitoring the channel
status, lost packet re-transmission is prohibited. Because if the packets are re transmitted,
they would still be lost. Instead a quick recovery by decrementing the window size is
proposed. Overall the problem can be stated as Simulation of TCP-Peach+ for better
Network Resource Management and Performance improvement for high loss probable
networks.
3 Methodology
For any protocol implementation it is important to implement a network achitecture. Our
Network architecture contains N number of static nodes spread over 400x400 sqm area. One
source and destination is randomly selected and a shortest path is obtained. All the nodes,
participating in the routings are allocated with the available bandwidth. The more the
available channel the less the probability of error. Hence the probability of error can be
considered as a state function of link bandwidth between the nodes.
Conceptually the transport layer and the network layers are implemented. The transport layer
will prepare the queue where the packet buffers would be stored. Transmitting node would
create a transmission window or the congestion window and the buffer of the size of the
window would be transferred. Routing is done through the Network layer and the
acknowledgement and retransmission policies are managed in the transport layer.
The receiver would acknowledge about the reception about the packet. Once the packet
acknowledgement is received, it is removed from the buffer. Through the Tcp-Peach+, some
null segments are made low priority nil segment where a transmitter can ask the receiver to
transmit the channel state. If there are enough bandwidth left, the receiver notifies the
236 Implementation of TCP Peach Protocol in Wireless Network
Copyright ICWS-2009
transmitter about the state else it is not notified. Based on the reply of nill segment, the
transmitting nodes adjusts its window size.
Tcp-Peach and Tcp-Peach+ Protocol will be implemented in order to provide QOS support to
the transmission control protocol. Jump Start and Quick Recovery are implemented in the
Tcp-Peach to extend the work. For implementing the protocols, we have adopted data
structure protocol packets. A priority Queue is implemented with status for data buffering.
Custom data length and link error probability is given as the input from the user. The
simulation model depicts the suage of Nill segment in the case of jump start and quick
recovery and finally the loss, retransmission and delay is calculated. Load v/s throuput and
the link error v/s throughput is plotted for both the cases in order to analyze their behevior.
4 TCP-Peach
TCP protocols have performance problems in satellite networks because:
1. The long propagation delays cause longer duration of the Slow Start phase during
which the TCP sender may not use the available bandwidth.
2. The congestion window, cwnd, cannot exceed a certain maximum value, rwnd,
provided by the TCP receiver. Thus, the transmission rate of the sender is bounded.
Note that the higher the round trip time, RTT, the lower is the bound on the
transmission rate for the sender.
3. The TCP protocol was initially designed to work in networks with low link error
rates, i.e., all segment losses were mostly due to network congestions. As a result, the
TCP sender decreases its transmission rate. However, this causes unnecessary
throughput degradation if segment losses occur due to link errors.
TCP-Peach is a new flow control scheme for satellite networks. The TCP-Peach is an
end-to-end solution whose main objective is to improve the throughput performance
in satellite networks. The TCP-Peach assumes that all the routers on the connection
path apply some priority mechanism. In fact, it is based on the use of low priority
segments, called dummy segments. TCP-Peach senders transmit dummy segments to
probe the availability of network resources on the connection path in the following
cases:
a. In the beginning of a new connection, when the sender has no information about
the current traffic load in the network.
b. When a segment loss is detected, and the sender has no information about the
nature of the loss, i.e., whether it is due to network congestion or link errors.
c. When the sender needs to detect network congestion before it actually occurs. As
dummy segments have low priority, their transmission do not affect the
transmission of traditional data segments. TCP-Peach contains the following
algorithms: Sudden Start, Congestion Avoidance, Fast Retransmit, Rapid
Recovery and Over Transmit. Sudden Start, Rapid Recovery and Over Transmit
are new algorithms; Fast Retransmit is the same as in TCP-Reno and Congestion
Avoidance has some modifications.
Implementation of TCP Peach Protocol in Wireless Network 237
Copyright ICWS-2009
5 System Design

6 Result

238 Implementation of TCP Peach Protocol in Wireless Network
Copyright ICWS-2009
7 Conclusion
The performance of the TCP-Peach+ depicts that the algorithm would perform very well
under trying network conditions like high link error probability as that in the case of satellite
network or the ADHOC networks. Normally TCP does not provide independent rules for the
congestion and the link errors. Hence the recovery from the link break downs are very slow.
But our proposed technique shows that the protocol performs pretty well and recovery from
failure is very fast. The re-transmission depicts that under severe link break and segment
losses, the nodes do not retransmit the lost segments due to the identification of failures.
Hence the resources are not unnecessarily wasted. Congestion control algorithms like the
sliding windows are integrated with the proposed protocol, we may obtain better results as
par as overall performance is concerned.
References
[1] J. Broch, D. A. Maltz, D. B. Johnson, Y-C. Hu, J. Jetcheva, A Performance Comparison of Multi-Hop
Wireless Ad Hoc Network Routing Protocols, MOBICOM 1998, 85-97.
[2] L. Zhou, Haas Securing ad hoc networks [referred 29.03.2004]
http://www.ee.cornell.edu/haas/publications/network99.ps
[3] Ad hoc On-Demand distance vector RFC 0827, IETF Network working Group. July 2003
[4] Routing Protocols Security Requirements, Internet Draft, IETF Network working Group. June 1995
[5] David B. Johnson. Routing in ad hoc networks of mobile Hosts. In Proceedings of the IEEE workshop on
mobile computing systems and applications (WMCSA94), pages 158-163, December 1994
[6] S. Tanenbaum, Computer Networks, Third Edition. Prentice Hall, Englewood Clis, Chapter 5, pp. 357{358,
1996).
[7] Mobile Ad hoc Networks (MANET) Charter. MANET Homepage.
http://www.ietf.org/html.charters/manetcharter. html.
[8] S. Corson and J. Macker, Mobile Ad hoc Networking (MANET): Routing Protocol Performance Issues and
Evaluation Considerations, RFC2501, January 1999.
[9] L. Zhou and Z.J. Haas, Securing Ad Hoc Networks, IEEE Network Magazine, vol. 13, no.6, December
1999.
[10] E. Royer and C-K. Toh, A Review of Current Routing Protocols for Ad-Hoc Mobile Wireless Networks.
IEEE Personal Communications Magazine, April 1999, 46-55.
E- References
[11] www.interhack.net
[12] www.iana.com
[13] www.cisco.com
[14] www.scit.wlv.ac.uk
[15] www.blogs.krify.com
A Polynomial Perceptron Network
for Adaptive Channel Equalisation

Gunamani Jena

R. Baliarsingh

G.M.V. Prasad
BVC Engg College, (JNTU) CSE, NIT, Rourkela BVCEC
g_jena@rediffmail.com rbsingh@nitrkl.ac.in

Abstract

Application of artificial neural network structures (ANN) to the problem of
channel equalisation in a digital communication system has been considered in
this paper. The difficulties associated with channel nonlinearities can be
overcome by equalisers employing ANN. Because of nonlinear processing of
signals in an ANN, it is capable of producing arbitrarily complex decision
regions. For this reasons, the ANN has been utilized for the channel
equalisation problem. A scheme based on polynomial perceptron network
(PPN) structures has been proposed for this task. The performance of the
proposed network PPN along with other ANN structure (FLANN) has been
compared with the conventional LMS based channel equaliser. Effect of
eigenvalue ratio of the input correlation matrix on the performance of
equalisers has been studied. From the simulation results, it is observed that the
performance of the proposed PPN based equaliser outperforms the other two
in terms of bit error rate (BER) and attainable MSE level over a wide range of
eigenvalue spread, signal to noise ratio and channel nonlinearities.
Keywords: PPN: polynomial perceptron network, FLANN: functional link artificial neural
network, MSE, SNR and EVR
1 Introduction
The rapidly increasing need for digital communication has been met primarily by higher
speed data transmission over the wide spread Network of voice bandwidth channels. The
channels deliver at the receiver the corrupted and transform versions of their input
waveforms. The corruption of the waveform may be statistically additive, multiplicative or
both, because of possible background thermal noise, impulse noise and fade. Transformations
performed in the channels are frequency translation, non-linear, harmonic distortion and time
dispersion. By using adaptive channel equaliser at the front end of the receiver, the noise
introduced in the channel gets nullified and hence signal to noise ratio of the receiver
improves. This paper deals with the design of an adaptive channel equaliser based on
polynomial perceptron network(PPN) architecture and the study of its performance. The
performance of the linear channel equaliser employing a linear filter with FIR or lattice
structure and using a least mean square (LMS) or recursive least (RLS) algorithm is limited
especially when the nonlinear distortion is severe. In such cases nonlinear equaliser structures
maybe conveniently employed with added advantages in terms of lower bit error rate (BER),
lower mean square error (MSE) and higher convergence rate than those of a linear equaliser.
Artificial neural network (ANNs) can perform complex mapping between its input and output
240 A Polynomial Perceptron Network for Adaptive Channel Equalisation
Copyright ICWS-2009
space and are capable of firming complex regions with non-linear decision boundaries.
Further, because of nonlinear characteristics of ANNs these networks of difference
architecture have found successful application in channel equalisation problem. One of the
earliest applications of ANN in digital communication channel equalisation is reported by Siu
et. al [7]. They have proposed a multilayer perceptron (MLP) structure for channel
equalisation with decision feedback and have shown that the performance of this network is
superior to that of a linear equaliser trained with LMS algorithm. Using MLP structures in the
problem of channel equalisation, quite satisfactory results have been reported for Pulse
Amplitude Modulation (PAM) and Quadrature Amplitude Modulation (QAM) signals [1-5].
The PPN structure is good for developing an efficient equaliser for nonlinear channel. The
performance of PPN equaliser has been obtained through computer simulation for different
nonlinear channels. The convergence characteristics and bit error rate (BER) are obtained
through simulation study and results are analyzed. It is observed that PPN equaliser offers
superior performance in terms of BER compared to its LMS counterpart and outperforms the
LMS equaliser particularly for nonlinear channels with respect to higher convergence rate
and lower mean square error (MSE).
2 Data Transmission System
Consider a synchronous data communication link with 4 QAM signal constellations to
transmit a sequence of complex valued symbols t(k) = x 2 +jx2 +1 at time kT, Where 1/T
denotes symbol rate, x = 0,1,., represents the un modulated information sequence
having statistical independent and equiprobable values {1, -1}. The transmitted symbols t(k)
may be written in terms of its in phase and quadrature components as t (k) = tk, I + jtk, Q. A
discrete time model for the digital transmission system with equaliser is shown in fig. 1. The
combined effect of the transmitter filter; the transmission medium and other components are
included in the channel. A widely used model for a linear dispersive channel is a FIR
model whose output at time instant k may be written as
( ) ( ) ( )
=
=
1
0
. k a
h
N
i
i k t i h
(1)

Fig. 1: Digital Transmission system with Equaliser
A Polynomial Perceptron Network for Adaptive Channel Equalisation 241
Copyright ICWS-2009
Where h (i) is the channel tap values and Nh is the length of the FIR channel. If the nonlinear
distortion caused by the channel is to be considered, then the channel model should be treated
as nonlinear and its output may be expressed as
a'(k) = (t(k), t(k-1), ., t(k-Nh+1); h(0), h(1), , h(Nh-1)) (2)
Where (.) is some nonlinear function generated by NL block. The channel output is
corrupted with additive white Gaussian noise q(k) of variance
2
to produce r(k), the signal
received at the receiver. The received signal r(k) may be represented by its in phase and
quadarture components as r (k) = rk,I + jrk,Q. The purpose of the equaliser is to recover the
transmitted symbol t(K) or T ( K - ) from the knowledge of the received signal samples
without any error, where is the transmission delay associated with the physical channel.
3 Channel Equaliser as a Pattern Classifier
The channel output is passed through a time delay to produce the equaliser input vector as
shown in Fig. 1.
For this section consider a K ary PAM system with signal constellation given by s
i
=2i-K-1;
1 iK. The arguments for the channel equaliser as pattern classifier may be extended for
QAM signal constellation. The equaliser input at the kth time instant is denoted as U
k
and is
given by
U
k
= [u
1
u
2
.u
M
]
T

where ui
1
= r(k-i+1) for i = 1,2, , M,. The equaliser order is denoted by
M and [.]
T
denotes matrix transpose. The ANN utilize this information to produce an output,
( ) k y which is an estimate of the desired output y (k) = t (k-),. The delay parameter of the
equaliser is denoted by . Depending on channel output vector U
k
, the equaliser tries to
estimate an output which is close to one of the transmitted values, s
i
, for i = 1,2, , K. In
other words, the equaliser seeks to classify the vector U
k
into any one of the K classes.
Depending on values of M and N
h
, and current and past J 1 transmitted symbol the
classification decision of the equaliser is effected. The J dimensional transmitted symbol
vector at the time instant k is given by
k
= [t(k) t (k-1) .. t(k-J+1)]
T
(3)
Where J = M + N
h
-1. The total number of possible combinations of T
k
is given by N
t
= K
J
.
When the additive Gaussian noise q(k) is zero, let the M dimensional channel output vector is
given by
B
k
= [b(k) b(k-1) B(k-M+1)]
T
(4)
Corresponding to each transmitted symbol vector
k
there will be one channel output vector
B
k
. Thus, B
k
will also have N
t
number of possible combinations, called desired channels
states. These N
t
states are to be partitioned into K classes C
i
, I = 1,2,., K depending on
the value of the desired signal y(k). The states belonging to class C
i
are given by B
k
C
i
if y(k)
= S
i
.
When the white Gaussian noise is added to the channel, B
k
becomes equal to U
k
, which is a
stochastic vector. Since each of s
i
is assumed to be equiprobable, the number of channel
states in each class is given by N
t
/K. The observation vectors form clusters around the desired
channel sates and thus, the statistical means of these data clusters are the desired states.
Copyright ICWS-2009
Therefore, determining the transmitted symbol t(k) with knowledge of the observation vector
U
k
is basically a classification problem. For this purpose, a decision function may be formed
as follows [8]
DF (U
k
)=w
0
+w
1
u
1
+w
2
u
2
+ + w
M
u
M
(5)
Here, wi, I = 0, 1, ., M are the weight parameters. Ignoring the time index k, the decision
function may be written as DF(U)=W
T
u, where U is the current channel observation vector
augmented by 1 and W = [w
0
w
1
.. w
M
]
T
is the weight parameter vector. For the K classes,
K numbers of decision functions are found with the property.
DF
i
(U) = W
i
T
U 0 if U C
i
< 0 otherwise (6)
For i = 1,2, ., K. Here, Wi is the weight vector associated with the I th decision function.
A generalized nonlinear decision function is needed to take care of many practical linearly
non separable situations and this can be formed as.
( ) ( )
=
=
f
N
0 i
U wii U DF
(7)
Where the {
I
(U)}, I = 1,2,, N
f
are real, single valued functions of the input pattern U,
0

(U) = 1 and N
f
is the number of terms used in the expansion of the decision function DF (U).
Let us define a vector U* with each of the components are the functions i(U) and given by
U* = [1
1
(U)
2
(U) .
Nf
(U)]
T
.
The decision function ma be expressed as
DF(U) = W
T
. U
*
(8)
Thus using {
i
(U)}, the (M+1) dimensional augmented channel observation vector U may
be transformed into a (N
f
+ 1) dimensional vector U
*
. Using this decision function, complex
decision boundaries can be formed to carry out nonlinear classification problems. This may
be achieved by employing different ANN structures for the channel equalisation problem,
which is described in the following sections.
4 ANN Structures for Equalisation
In this paper we have employed two ANN structures (FLANN and PPN) for the channel
equalisation problem and their performance in terms of convergence speed, MSE level, and
BER is compared by taking different linear and nonlinear channel models. Brief description
of each of the ANN structures is given below.
4.1 Functional Link ANN
The FLANN is a single layer network in which the hidden layers are removed. In contrast to
the linear weighting of the input pattern by the linear links of an MLP, the functional link acts
on an element of a pattern or on the entire pattern itself by generating a set of linearly
independent functions, and then evaluating these functions with the pattern as the argument.
Thus, separability of input patterns is possible in the enhanced space [5]. Further, the FLANN
structure offers less computational complexity and higher convergence speed than those of
MLP because of its single layer structure. The FLANN structure considered for the channel
Copyright ICWS-2009
equalisation problem is depicted in Fig. 3. Here, the functional expansion block makes used
of a functional model comprising of a subset of orthogonal sin and cos basis functions and the
original pattern along with its outer products. For example, considering a two dimensional
input pattern X = [x
1
x
2
]
T
, the enhanced pattern is obtained by using the trigonometric
functions as X
*
= [x
1
cos (x
1
cos (x
1
) x
2
cos (x
2
) sin (x
2
) sin (x
2
) . x
1
x
2
]
T
which is then used by the network for the equalisation purpose. The BP algorithm, which is
used to train the network, becomes very simple because of absence of any hidden layer.

Fig. 2: The FLANN Structure
4.2 The PPN Structure
Weierstrass approximation theorem states that any function, which is continuous in a closed
interval, can be uniformly approximated within any prescribed tolerance over that interval by
some polynomial. Based on this, the PPN structure was proposed and shown in Fig. 3. Here,
the original input pattern dimension is enlarged and this enhancement is carried out by
polynomial expansion in which higher other and cross product terms of the elements of the
input pattern are used. This enhanced pattern is then used for channel equalisation. It is a
single layer ANN structure possessing higher rate of convergence and lesser computational
load than those of an MLP structure.

Fig. 3: The PPN Structure
The behavior and mapping ability of a PPN and its application to channel equalisation is
reported by Xiang et. Al. [10]. A PPN of degree d with a weight vector W produces an output
Copyright ICWS-2009
y
given by
( ) ( ) X P y
d
w
=
(9)
Where is the nonlinear tanh function. X is the n dimensional input pattern vector given by
X = [x
1
,x
2
,., x
n
]
T
. The
d
w
P
is a polynomial of degree d with the weight vector W =
[w
0
w
1
w
2
.] and is given by
=
+ =
n
i
i i
d
w
x w w P
1
0
1
1 1
.
+
= =
+
n
i
n
i i
i i i
x x w
1
1 1 2
2 1 1
....... .
+
( )

= = =
n
i
n
i i
n
d i i
i i i i i i
d
d d
x x x w
1 1
.......
1 1 2
2 1 2 1
...... . ...... ..........
(10)
When d , p
d
w
(X) becomes the well known volt era series, A structure of a PPN with
degree, d = 2 and pattern dimension n = 2 is shown in Fig. 3.
The same BP algorithm may be used to train the network. However, in this network the
number of terms needed to describe a polynomial decision function grows rapidly as the
polynomial order and the pattern dimension increases which in turn increases the
computational complexity.

Fig.4: The generalized FLANN Structure.

Fig. 5:. Structure of an ANN based Equaliser
Copyright ICWS-2009
5 Computational Complexity
A comparison of computational complexity between FLANN and PPN structure, all trained
by the BP algorithm is presented. The computational complexity of the FLANN and the PPN
is similar if the dimension of the enhanced pattern is same for both cases. However, in the
case of the FLANN with trigonometric functions, extra computations are required for
calculating the sin and cos functions; where as in the PPN, for computation of higher order
and outer product terms, only multiplications are needed. Consider an L layer MLP with n
1

number of nodes (excluding the threshold unit) in layer l (1 = 0, 1, ., L), where n
o
and n
L

are the number of nodes in the input layer and output layer, respectively. Three basic
computations, i.e., addition, multiplication and computation of tanh(.) are involved for
updating the weights of an MLP and PPN. Major computational burden on the MLP is due to
error propagation for calculation of square error derivative of each node in all hidden layers.
In one iteration, all computations in the network, take place in three phases, i.e.,
a. Forward calculation to find the activation value all nodes of the entire network;
b. Back error propagation for calculation of square error derivatives;
c. Updating of the weights of the entire network.
The total number of weights to be updated in one iteration in an MLP structure is given by
( ) ( )

=
+
+
1 L
0 1
1 1 1
n 1 n
. Where as, in FLANN and PPN it is given by (n
0
+ 1). Since hidden layer
does not exist in FLANN and PPN, the computational complexity is drastically reduced than
MLP structure. Comparison of computational complexity between FLANN and a PPN using
BP algorithm in one iteration is provided in Table-1
Table 1
Operations FLANN PPN
Addition 2n
1
.n
0
+
n
1
2n
1
.n
0
+
+ n
1
Multi 3n
1
.n
0
+
+ n
0
3n
1
.n
0
+
+n
0

Tanh(.) n
1
n
2

Cos(.)/ sin(.) n
0
-
n
0
+
= n
0
+ 1
6 ANN-Based Channel Equalisation
An equalisation scheme for 4-QAM signals is shown in Fig. 6. Each of the in-phase and the
quadrate component of the received signal at time instant k, r
k, I
and r
k, Q
passes through a
tapped delay line. These delay signals constitute the input pattern to the ANN and is given by
U(k) = [u
1
u
2
u
M 1
]
T
= [r
k, I
r
k, Q
r
k 1, I
r
k 1, Q
]
T
. At the time instant k, U(k) is applied to
the ANN and the network produces two outputs 1
y
and 2
y
corresponding to estimated value
of the in-phase and quadrature component of the transmitted symbol or its delayed version,
respectively. For the equalisation problem, two ANN structures, i.e., a PPN and a FLANN
structure along with a linear one trained with LMS algorithm are employed for the simulation
studies. The BP algorithm is employed for all ANN-based equalisers. Further, in all the ANN
structures, all the nodes except those of the input layer have tanh(.) nonlinearity, as activation
function is basically a four-category classification problem. Since equalisation is basically
four nonlinear boundaries can be formed by using an ANN, it may be conveniently employed
to form discriminate functions to classify the input pattern into any one of the four categories.
Copyright ICWS-2009

Fig 6.a: Equalisers for Channel 6 at SNR of 15dB, NL=0 Fig 6.b: Equalisers for Channel 6 at SNR of
15dB, NL=1
7 Simulation Studies
Simulation studies have been carried out for the channel equalisation problem as described by
Fig. 1, using the two discussed ANN structures (PPN, FLANN) with BP algorithm and a
linear FIR equaliser with LMS algorithm. Here impulse response of the channel considered is
given by [4]
( ) ( ) , 2 i
2
cos 1 ,
2
1
i h
)
`
=
i = 1, 2, 3 (11)
0 otherwise
The parameter
determines the Eigen value ratio (EVR) of the input correlation matrix, r =
E[U(k)U
T
(k)] where E is the expectation operator. The EVR is defined
as
min max min max
and where /
is the largest and the smallest Eigen value of R, respectively;
The digital Message was with 4 QAM signal constellation and in the form
{ } j1 1
in
which each symbol was obtained from a uniform distribution. To the channel output a zero
mean Gaussian noise was added, the received signal power is normalized to unity so as to
make the SNR equal to the reciprocal of noise variance at the input of the equaliser. To study
the performance of the equaliser under different EVR conditions of the channel, the
parameter
was varied from 2.9 to 3.5 in steps of 0.2. The value of EVR is given by 6.08,
11.12, 21.71 and 46.82 for
value of 2.9, 3.1, 3.3 and 3.5 respectively. Selection of the

detail structure of the ANNs the various parameter values including the learning rate
, the
momentum rate
, the polynomial order and the number of functions used in the FLANN and
PNN were determined by numerous experiments to give the best result in the respective ANN
structures. The polynomial and trigonometric functions were used for functional expansion of
the input pattern in the equalisers based on PPN and FLANN, respectively. To have a fair
comparison between the PNN and FLANN based equalisers, in both of the cases input pattern
was expanded to a 18 dimensional pattern from r(k) and r(k 1). Thus both the FLANN and
the PPN have 19 and two number of nodes in the input and output layer, respectively.
Copyright ICWS-2009
Further, a linear FIR equaliser of order eight trained with LMS algorithm was also simulated.
In the case of FLANN and PPN,
and
values were 0.3 and 0. 5 respectively. The MSE

floor corresponds to the steady state value of the MSE, which was obtained after averaging
over 500 independent runs each consisting of 3000 iterations. To study the BER performance,
each of the equaliser structures was trained with 3000 iterations for optimal weight solution.
After completion of the training, iterations of the equaliser were carried out. The BER was
calculated over 100 independent runs each consisting of 10
4
data samples. Six different
channels were studied with the following normalized transfer function given in z-transform
form:
CH = 1: 1.0,
CH = 2: 0.447 + 0.894
1
z
,
CH = 3: 0.209 + 0.995
1
z
+ 0.209
2
z
,
CH = 4: 0260 + 0.930
1
z
+ 0.260
2
z
,
CH = 5: 0.304 + 0.903
1
z
+ 0.304
2
z
,
CH = 6 : 0.341 + 0.876
1
z
+ 0.341
2
z
. (12)
CH = 1 corresponds to a channel without any ISI since it has a unity impulse response. CH =
2 corresponds to a non-minimum phase channel [1]. CH = 3, CH = 4, CH = 5, and CH = 5,
and CH = 6 corresponds to A values of 2.9, 3.1, 3.3 and 3.5 respectively. Three different
nonlinear channel models with the following types of Nonlinearity were introduced NL
= 0: b(k) = a(k),
NL = 1: b(k) = tanh(a(k)),
NL = 2: b(k) = a(k) + 0.2a
2
(k) 0.1a
3
(k),
NL = 3: b(k) = a(k) + 0.2a
2
(k) 0.1a
3
(k) + 0.5cos (
a(k)). (13)
NL = 0 corresponds to a linear channel model. NL = 1 corresponds to a nonlinear channel
which may occur in the channel due to saturation of amplifiers used in the transmitting
system. NL = 2 and NL = 3 are two arbitrary nonlinear channels.
8 MSE Performance Study
Here, the MSE performances of the two ANN structures (FLANN and PPN) along with a
linear LMS-based equaliser are reported considering different channels with linear as well as
nonlinear channel models. The convergence characteristics for CH = 6 at SNR level of 15 dB
are plotted in Fig.6.a-6.d. It may be observed that the ANN based equalisers show much
better convergence rate and lower MSE floor than those of the linear equaliser for linear as
well as nonlinear channel models. Out of the two ANN structures, the PPN based equaliser
maintains its superior performance over the other two structures in terms convergence speed
and steady state MSE level for the linear and nonlinear channel models.
8.1 BER Performance Study
The bit error rate (BER) provides the true picture of performance of an equaliser. The
computation of BER was carried out for the channel equalisation with 4 QAM signal
constellations using the three ANN based and the linear LMS based structures.
Copyright ICWS-2009

Fig 6.c: Equalisers for Channel 6 at SNR of 15dB, NL=2 Fig 6.d: Equalisers for Channel 6 at SNR of
15dB, NL=3
8.2 Variation of BER with SNR
The BER performance for CH = 2 is plotted in Fig. 7.a-7.d The performance of the PPN
based equaliser is superior to FLANN based equaliser for both linear and nonlinear channel
models. Especially, for severe nonlinear channel model (NL = 3), the PPN based equaliser
outperforms other structures.
9 Effect of EVR on BER Performance
The BER was computed for channels with different EVR values for linear as well as
nonlinear channel models at SNR value of 12 dB. The results obtained are plotted in Fig. 8.a-
8.d as EVR increases the BER performance of all the three equalisers degrades. However, the
performance degradation due to increase in the EVR is much less in the ANN based equaliser
with comparison to the linear LMS based equaliser. The performance degradation is the least
in the PPN based equaliser with linear and the three nonlinear channel models for a wide
variation of EVR from 1 to 46.8.

Fig. 7.a: Fig. 7.b:
Copyright ICWS-2009

Fig.7.c Fig.7.d
Fig.7.A BER performance of FLANN based, PPN based and LMS based
Equalisers for Channel 2 with variation of SNR a) NL=0, b) NL=1, c) NL=2, d) NL=3

Fig.8.a Fig.8.b

Fig.8.c Fig.8.d
Fig.8. Effect of EVR on the BER performance of the three ANN based and linear LMS based equalisers for
CH=2 with variation of SNR (a) NL=0, (b) NL=1, (c) NL=2, (d) NL=3.
Copyright ICWS-2009
10 Conclusion
It is shown that performance of ANN based equalisers provides substantial improvement in
terms of convergence rate, MSE floor level and BER. In a linear equaliser the performance
degrades drastically with increase in EVR, especially when the channel is nonlinear.
However, it is shown that, in the ANN based equaliser the performance degradation with
increase in EVR is not so severe. A PPN based equaliser structure for adaptive channel
equalisation has been studied. Because of its single layer structure it offers advantages over
the other two. Out of the two ANN equaliser structures (PPN and FLANN), the performance
of the PPN is found to be the best in terms of MSE level, convergence rate, BER, effect of
EVR and computational complexity for linear as well as nonlinear channel models over a
wide range of SNR and EVR variations. Performance of PPN and MLP are similar but the
single layer PPN structure is preferable to FLANN as it offers less computational complexity
and may be used in other signal processing applications.
References
[1] [Chen, et. al., 1990] S. Chen, G. J. Gibson, C. F. N. Cowan, and P. M. Grant, Adaptive channel
equalisation of finite nonlinear channels using multiplayer perception, Signal Process, vol. 20, pp. 107-
119, 1990.
[2] [Soraghan, et. al.,1992] W.S. Gan, J.J. Soraghan, and T. S. Durrani, A new functional link based
equaliser, Electron. Letr. vol.28, pp. 1643-1645, Aug. 1992.
[3] [Gibson, et. al.,1991] G.J. Gibson, S. Siu, and C.F.N. Cowan, The application of nonlinear structures to the
reconstruction of binary signals, IEEE Trans. Signal processing, vol.39 pp. 1877-1884, Aug.1991.
[4] [Haykin, 1991] S. Haykin, Adaptive Filter Theory, 2nd ed. Englewood Cliffs, NJ: Prentice Hall, 1991.
[5] [Mayer, et. al.,1993] M. Meyer and G. Pfeiffer, Multilayer Perceptron based decision feedback equalisers
for channels with inter symbol interference Proc. IEE, vol. 140, pt. 1, pp. 420-424, Dec.1993.
[6] [Pao, 1989] Y.H. Pao, Adaptive Pattern Recognition and Neural Networks. Reading, M.A.: Addison
Wesley, 1989.
[7] [Siu, et. al., 1990] S. Siu, G. J. Gibson, and C.F.N. Cowan, Decision feedback equalisation using neural
network structures and performance comparison with standard architecture, Proc. Inst. Elect. Engg, vol.
137, pt. 1, pp. 221-225, Aug. 1990.
[8] [Tou, et. al.,1981] J. T. Tou and R.C. Gonzalez, Pattern Recognition Principles. Reading, MA: Addison
Wesley, 1981.
[9] [Widrow, et. al., 1990] B. Widrow and M.A. Lehr, 30 Years of adaptive neural networks: Perceptron,
madaline and back propagation, Proc. IEEE, vol. 78, pp. 1415-1442, Sept. 1990.
[10] [Xiang, et.,al., 1994] Z. Xiang, G. Bi, and T.L. Ngoc, Polynomial Perceptron and their applications to
fading channel equalisation and co-channel interference suppression, IEEE Trans. Signal Processing, vol.
42, pp. 2470-2479, Sept. 1994.

Implementation of Packet Sniffer
for Traffic Analysis and Monitoring

Arshad Iqbal Mohammad Zahid
Department of Computer Engineering Department of Computer Engineering
Zakir Husain College of Zakir Husain College of
arshadiqbal@zhcet.ac.in dzahid@zhcet.ac.in
Mohammed A Qadeer
Department of Computer Engineering, Zakir Husain College of Engineering & Technology
Aligarh Muslim University, Aligarh-202002, India
maqadeer@zhcet.ac.in

Abstract

Computer software that can intercept and log traffic passing over a digital
network or part of a network is better known as PACKET SNIFFER. The
sniffer captures these packets by setting the NIC card in the promiscuous
mode and eventually decodes them. The decoded information can be used in
any way depending upon the intention of the person concerned who decodes
the data. (i.e. in malicious purpose or in beneficial purpose). Depending on the
network structure (hub or switch) one can sniff all or just parts of the traffic
from a single machine within the network; however, there are some methods
to avoid traffic narrowing by switches to gain access to traffic from other
systems on the network. This paper focuses on the basics of packet sniffer and
its working, development of the tool like packet sniffer by an individual on
Linux platform. It also discusses the way to detect the presence of such
software on the network and to handle them in efficient way. Focus has also
been laid to analysis the bottle neck scenario arising in the network using this
self developed packet sniffer. Before the development of this indigenous
software, minute observation has been made on the working behavior of
already existing sniffer software as WIRESHARK (formerly known as
EHTEREAL), TCPDUMP, and SNORT, which serve as the base for the
development of our sniffer software. For the capture of the packets, a library
known as LIBPCAP has been used. The development of such software gives a
chance to the developer to incorporate the additional features that are not in
the existing one.
1 Introduction
Packet sniffer is a program running in a network attached device that passively receives all
data link layer frames passing by the devices network adapter. It is also known as Network
or Protocol Analyzer or Ethernet Sniffer. The packet sniffer captures the data that is
252 Implementation of Packet Sniffer for Traffic Analysis and Monitoring
Copyright ICWS-2009
addressed to other machines, saving it for later analysis. It can be used legitimately by a
network or system administrator to monitor and troubleshoot network traffic. Using the
information captured by the packet sniffer an administrator can identify erroneous packets
and use the data to pinpoint bottlenecks and help maintain efficient network data
transmission. Packet Sniffers was never made to hack or stole information. They had a
different goal, to make things secure. In figure 3 we have shown that how the data travel from
application layer to network interface card.

Fig. 1: Flow of packets
2 Library: LIBCAP
Pcap consists of an application programming interface (API) for capturing packets in the
network. UNIX like systems implements pcap in the libpcap library; Windows uses a port of
libpcap known as WinPcap. LIBPCAP is widely used standard packet capture library that
was developed for use with BPF (BERKELY PACKET FILTER) kernel device. BPF can be
considered as OS kernel extension. It is BPF, which enable communication operating system
and NIC. Libpcap is C language Library that extends the BPF library constructs. Libpcap is
used to capture the packets on the network directly from the network adapter. This library is
in built feature of operating system. It provides packet capturing and filtering capability. It
was originally developed by the tcpdump developers in the Network Research Group at
Lawrence Berkeley Laboratory [Libpcap]. If this library is missing in the operating system,
we can install them at the latter phases, as it is available as an open source.
3 Promiscuous Mode
The network interface card works in two modes
a. Non promiscuous mode (normal mode)
b. Promiscuous mode
When a packet is received by a NIC, it first compares the MAC address of the packet to its
own. If the MAC address matches, it accepts the packet otherwise filters it. This is due to the
network card discarding all the packets that do not contain its own MAC address, an
operation mode called non promiscuous, which basically means that each network card is
minding its own business and reading only the frames directed to it. In order to capture the
Implementation of Packet Sniffer for Traffic Analysis and Monitoring 253
Copyright ICWS-2009
packets, NIC has to set into the promiscuous mode. Packet sniffers which does sniffing by
setting the NIC card of its own system to promiscuous mode, and hence receives all packets
even they are not intended for it. So, packet sniffer captures the packets by setting the NIC
card into promiscuous mode. To set a network card to promiscuous mode, all we have to do
is issue a particular ioctl( ) call to an open socket on that card and the packets are passed to
the kernel.
4 Sniffer Working Mechanism
When the packets are sent from one node to another (i.e. from source to destination) on the
network. In the network, a packet has to pass through many intermediate nodes. A node
whose NIC is set into the promiscuous mode tends to receives the packet. The packet arriving
to the NIC are copied to the device driver memory, which is then passed to the kernel buffer
from where it is used by the user application. In Linux kernel, libpcap uses PF_PACKET
socket which bypasses most packet protocol processing done by the kernel [Dabir and
Matrawy, 2007].Each socket has two kernel buffers associated with it for reading and writing.
By default in Fedora core 6, the size of each buffer is 109568 bytes. In our packet sniffer, at
user level the packets are copied from the kernel buffer into a buffer created by libpcap when
a live capture session is created. A single packet is handled by the buffer at a time for the
application processing before next packet is copied into it [Dabir and Matrawy, 2007]. The
new approach taken in the development of our packet sniffer is to improve the performance
of packet sniffer using libpcap is to use same buffer space between kernel space and
application. The figure 2 and 3 shows interface of our packet sniffer while capturing packets.

Fig. 2: Packet sniffer while capturing session.

Fig. 3: Shows the details of selected packet
Copyright ICWS-2009
5 Basic Steps for the Development of Packet Sniffer on Linux Platform
We are going to discuss the basic step that we have taken during the development of our
packet sniffer. The rest of the steps only deal with interpreting the header and data
formatting. The steps which we have taken are as follows.
5.1 Socket Creation
Socket is a bi-directional communication abstraction via which an application can send and
receive data.
There are many type of socket:
SOCK_STREAM: TCP (connection oriented, guaranteed delivery)
SOCK_DGRAM: UDP (datagram based communication)
SOCK_RAW: allow access to the network layer. This can be build ICMP message or
Custom IP packet.
SOCK_PACKET: allows access to the link layer (e.g. Ethernet). This can be used to build
entire frame (for example to build a user space router).
When a socket is created, a socket stream, similar to the file stream, is created, through which
data is read. [Ansari et al., 2003]
5.2 To Set NIC in Promiscuous Mode
To enable the packet sniffer to capture the packet, the NIC of the node on which sniffer
software is running has to set into promiscuous mode. In our packet sniffer it was
implemented by issuing an ioctl ( ) call to an open socket on that card. The ioctl system call
takes three arguments,
a. The socket stream descriptor.
b. The function that the ioctl function is supposed to perform. Here, the macro used is
SIOCGIFFLAGS.
c. Reference to the ifreq member [Ansari et al., 2003]
Since this is a potentially security-threatening operation, the call is only allowed for the root
user. Supposing that ``sock'' contains an already open socket, the following instructions will
do the trick:
ioctl (sock, SIOCGIFFLAGS, & ethreq); ethreq.ifr_flags |= IFF_PROMISC; ioctl (sock,
SIOCGIFFLAGS, & ethreq);
The first ioctl reads the current value of the Ethernet card flags; the flags are then ORed with
IFF_PROMISC, which enables promiscuous mode and are written back to the card with the
second ioctl.
5.3 Protocol Interpretation
In order to interpret the protocol developer should have some basic knowledge of protocol
that he wishes to sniff. In our sniffer which we developed on Linux platform we interpreted
Copyright ICWS-2009
the protocols such as IP,TCP,UDP,ICMP protocols by including the headers as
<linux/tcp.h>,<linux/udp.h>,<linux/ip.h>and <linux/icmp.h>. In figure 4,5 and 6 below we
are showing some the packet header formats.
6 Linux Filter
As network traffic increases, the sniffer will start losing packets since the PC will not be able
to process them quickly enough. The solution to this problem is to filter the packets you
receive, and print out information only on those you are interested in. One idea would be to
insert an `ìf statement'' in the sniffer's source; this would help polish the output of the sniffer,
but it would not be very efficient in terms of performance. The kernel would still pull up all
the packets flowing on the network, thus wasting processing time, and the sniffer would still
examine each packet header to decide whether to print out the related data or not. The
optimal solution to this problem is to put the filter as early as possible in the packet-
processing chain (it starts at the network driver level and ends at the application level, see
Figure 7). The Linux kernel allows us to put a filter, called an LPF, directly inside the
PF_PACKET protocol-processing routines, which are run shortly after the network card
reception interrupt has been served. The filter decides which packets shall be relayed to the
application and which ones should be discarded.

Fig. 4: TCP protocol header fields

Fig. 5: UDP protocol header fields
Copyright ICWS-2009

Fig. 6: IP protocol header fields

Fig. 7: Filter processing chain
7 Methods to Sniff On Switch
An Ethernet environment in which the hosts are connected to a switch instead of a hub is
called a Switched Ethernet. The switch maintains a table keeping track of each computer's
MAC address and delivers packets destined for a particular machine to the port on which that
machine is connected. The switch is an intelligent device that sends packets to the destined
computer only and does not broadcast to all the machines on the network, as in the previous
case.
7.1 ARP Spoofing
As we know that ARP is used to obtain the MAC address of the destination machine with
which we wish to communicate. The ARP is stateless, we can send an ARP reply, even if one
has not been asked for and such a reply will be accepted. Ideally, when you want to sniff the
Copyright ICWS-2009
traffic originating from a machine, you need to ARP spoof the gateway of the network. The
ARP cache of that machine will now have a wrong entry for the gateway and is said to be
"poisoned". This way all the traffic from that machine destined for the gateway will pass
through your machine. Another trick that can be used is to poison a hosts ARP cache by
setting the gateway's MAC address to FF:FF:FF:FF:FF:FF(also known as the broadcast
MAC). There are various utilities available for ARP spoofing. An excellent tool for this is the
arpspoof utility that comes with the dsniff suite.
7.2 MAC Flooding
Switches keep a translation table that maps various MAC addresses to the physical ports on
the switch. As a result of this, a switch can intelligently route packets from one host to
another, but it has a limited memory for this work. MAC flooding makes use of this
limitation to bombard the switch with fake MAC addresses until the switch can't keep up. The
switch then enters into what is known as a `failopen mode', wherein it starts acting as a hub
by broadcasting packets to all the machines on the network. Once that happens sniffing can
be performed easily. MAC flooding can be performed by using macof, a utility which comes
with dsniff suite.
8 Bottleneck Analysis
With the increase of traffic in the network, the rate of the packets being received by the node
also increases. On the arrival of the packet at NIC, they have to be transferred to the main
memory for the processing. A single packet is transferred over the bus. As we know that the
PCI bus has actual transfer of not more than 40 to 50 Mbps because a device can have control
over the bus for certain amount of time or cycles, after that it has to transfer the control of the
bus. And we know that slowest component of PC is disk drive so, bottle neck is created in
writing the packets to disk in traffic sensitive network. To handle the bottle neck we can
make an effort to use buffering in the user level application. According to this solution, some
amount of RAM can be used as buffer to overcome bottleneck.
9 Detection of Packet Sniffer
Since the packet sniffer has been designed as a solution of many network problem. But one
can not ignore its malicious use. Sniffers are very hard to detect due to its passiveness but
there is always a way. And some of them are:
9.1 ARP Detection Technique
As we know that sniffing host receives all the packets, including those that are not destined
for it. Sniffing host makes mistakes by responding to such packets that are supposed to be
filtered by it. So, if an ARP packet is sent to every host and ARP packet is configured such
that it doest not have broadcast address as destination address and if some host respond to
such packets, then those host have there NIC set into promiscuous mode [Sanai]. As we know
that Windows is not an open source OS, so we cant analyze its software filter behavior as we
do in Linux.In Linux we can analyze the behavior of filter by examining the source code of
this OS. So, here we are presenting the some of addresses to do so with the Windows. They
are as follows.
Copyright ICWS-2009
a. FF-FF-FF-FF-FF-FF Broadcast address: The packet having this address is received by
all nodes and responded by them.
b. FF-FF-FF-FF-FF-FE fake broadcast address: This address is fake broadcast address in
which last 1 bit is missing. By this address we check whether the filter examines all
the bits of address and respond to it.
c. FF-FF-00-00-00-00 fake broadcast 16 bit address: In this address we can see those
first 16 bits are same as broadcast address.
d. FF: 00:00:00:00:00 fake broadcast 8 bits: This address is fake broadcast address
whose first 8 bits are same as the broadcast address [Sanai].
9.2 RTT Detection
RTT stands for Round Trip Time. It is the time that packet takes to reach the destination
along with the time which is taken by response to reach the source. In this technique first the
packets are sent to the host with normal mode and RTT is recorded. Now the same host is set
to promiscuous mode and same set of packets are sent and again RTT is recorded. The idea
behind this technique is that RTT measurement increases when the host is in promiscuous
mode, as all packets are captured in comparison to host that is in normal mode [Trabelsi et
al., 2004].
10 Future Enhancement
This packet sniffer can be enhanced in future by incorporating the features like-Making the
packet sniffer program platform independent, Filtering of the packets can be done using filter
table, Filtering the suspect content from the network traffic and Gather and report network
statistics.
11 Conclusion
A packet sniffer is not just a hackers tool. It can be used for network traffic monitoring,
traffic analysis, troubleshooting and other useful purposes. However, a user can employ a
number of techniques to detect sniffers on the network as discussed in this paper and protect
the data from being sniffed.
References
[1] [Ansari et al., 2003] S.Ansari, Rajeev S.G. and Chandrasekhar H.S, Packet Sniffing: A brief Introduction,
IEEE Potentials, Dec 2002- Jan 2003, Volume:21, Issue:5, pp:17 - 19
[2] [Combs, 2007] G.Combs, "Ethereal". Available at http://www.wireshark.org (Aug 15, 2007)
[3] [Dabir and Matrawy, 2007] A. Dabir, A. Matrawy, Bottleneck Analysis of Traffic Monitoring Using
Wireshark, 4th International Conference on Innovations in Information Technology, 2007, IEEE
Innovations '07, 18-20 Nov. 2007, Page(s):158 - 162
[4] [Drury, 2000] J. Drury, Sniffers: What are they and how to protect from the, November 11, 2000,
http://www.sans.org/infosecFAQ/switchednet/sniffers.htm
[5] [Kurose, 2005] Kurose, James & Ross, Keith, Computer Networking,Pearson Education, 2005
[6] [Libpcap] Libpcap, http://wikepedia.com
[7] [Sanai] Daiji Sanai, Detection of Promiscuous Nodes Using ARP Packet, http://www.securityfriday.com/.
[8] [Sniffing FAQ] Sniffing FAQ, http://www.robertgraham.com
Copyright ICWS-2009
[9] [Sniffer] Sniffer resources, http://packetstorm.decepticons.org
[10] [Stevens, 2001] Richard Stevens, TCP/IP Illustrated: Volume , 2001.
[11] [Stevens and Richard, 2001] Stevens, Richard, UNIX Network Programming, Prentice Hall India, 2001
[12] [Stones et al., 2004] Stones, Richard & Matthew, Neil, Beginning Linux Programming, Wrox
Publishers,2004.
[13] [Trabelsi et al., 2004] Zouheir Trabelsi, Hamza Rahmani, Kamel Kaouech, Mounir Frikha, Malicious
Sniffing System Detection Platform, Proceedings of the 2004 International Symposium on Applications
and the Internet (SAINT04), IEEE Computer Society

Implementation of BGP Using XORP

Quamar Niyaz S. Kashif Ahmad
Quamarniyaz@zhcet.ac.in syedkashif@zhcet.ac.in
Mohammad A. Qadeer
Department of Computer Engineering, Zakir Husain College of Engineering & Technology

Abstract

In this paper, we present an approach to implement BGP and discussion about
an open source routing software, XORP (eXtensible Open Routing Platform),
which we will use for routing in our designed networks. In creating our project
we will use the Linux based PCs, on which XORP will run, as a router. It is
required that each Linux based PC must have two or more NICs (network
interface cards). We will implement here BGP routing protocol, which is used
for routing between different Autonomous Systems.
1 Introduction
With the continuous growth of Internet, Efficient routing, Traffic Engineering, QoS (Quality
of Service), has become the challenge for Network Research Community. In the current
scenario routers, available by the established vendors like Cisco etc. are so much architecture
dependent that dont provide APIs that allow any third party applications to run on their
hardware.This arises the need, for Network Research Community, of having some open
source routing software which allow researchers to access the APIs and Documentation, to
develop their own EGP(Exterior Gateway Protocol) and IGP(Interior Gateway Protocol)
routing protocols and new techniques for QoS etc. XORP [xorp.org] is one such type of effort
in this direction. The goal of XORP is to develop an open source router platform that is stable
and fully featured enough for production use, and flexible and extensible enough to enable
network research. Currently XORP implements routing protocols for IPv4 and IPv6,
including BGP, OSPF, RIP, PIM-SM, IGMP, and a unified means to configure them. The
best part of XORP is that it also provides an extensible programming API. XORP runs on
many UNIX flavors. In our paper first we will discuss the design and architecture of XORP
and after that we will discuss implementation of routing protocols in our designed network
using XORP.
2 Architecture of XORP
The XORP design philosophy stresses extensibility, performance and robustness and
traditional router features. For routing and management modules, the primary goals are
Implementation of BGP Using XORP 261
Copyright ICWS-2009
extensibility and robustness. These goals are achieved by carefully separating functionality
into independent modules, running in separate UNIX processes, with well-defined APIs
between them.
2.1 Design Overview
XORP can be divided into two subsystems. The higher-level (user-space) subsystem
consists of the routing protocols and management mechanisms. The lower-level (kernel)
provides the forwarding path, and provides APIs for the higher-level to access. User-level
XORP uses a multi-process architecture with one process per routing protocol, and a novel
inter-process communication mechanism known as XORP Resource Locators (XRLs)[Xorp-
ipc]. XRL communication is not limited to a single host, and so XORP can in principle run in
a distributed fashion. For example, we can have a distributed router, with the forwarding
engine running on one machine, and each of the routing protocols that update that forwarding
engine running on a separate control processor system. The lower-level subsystem can use
traditional UNIX kernel forwarding, the Click modular router [Kohler et al.2000] or
Windows kernel forwarding (Windows Server 2003). The modularity and minimal
dependencies between the lower-level and user-level subsystems allow for many future
possibilities for forwarding engines. Figure 1 shows the processes in XORP which we
described in section 2.2, although it should be noted that some of these modules use separate
processes to handle IPv4 and IPv6. For simplicity, the arrows show only the main
communication flows used for routing information. [Handley et al. 2002]

Fig. 1: XORP High-level Processes
2.2 XORP Process Description
As shown in Figure 1 there are several process in the XORP system, some of which are
meant for routing protocols e.g. OSPF, BGP4+, RIP etc. and some of which are meant for
management and forwarding mechanism e.g. FEA, RIB, SNMP etc. Among these various
processes there are four core processes- FEA, RIB, Router Manager (rtrmgr), IPC finder,
which we will describe in the following section
262 Implementation of BGP Using XORP
Copyright ICWS-2009
2.2.1 FEA (Forward Engine Abstraction)
The role of the Forwarding Engine Abstraction (FEA) in XORP is to provide a uniform
interface to the underlying forwarding engine. It shields XORP processes from concerns over
variations between platforms. The FEA performs four distinct roles: interface management,
forwarding table management, raw packet I/O, and TCP/UDP socket I/O[Xorp-fea].
2.2.2 RIB (Routing Information Base)
The RIB process takes routing information from multiple routing protocols, stores these
routes, and decides which routes should be propagated on to the forwarding engine. The RIB
performs the following tasks:
Stores routes provided by the routing protocols running on a XORP router.
If more than one routing protocol provides a route for the same subnet, the RIB
decides which route will be used.
Protocols such as BGP may supply to the RIB routes that have a next-hop that is not
an immediate neighbor. Such next hops are resolved by the RIB so as to provide a
route with an immediate neighbor to the FEA.
Protocols such as BGP need to know routing metric and reachability information to
next hops that are not immediate neighbors. The RIB provides a way to register
interest in such routing information, in such a way that the routing protocol will be
notified if a change occurs [Xorp-rib].
2.2.3 rtrmgr (Router Manager)
XORP tries to hide from the operator the internal structure of the software, so that the
operator only needs to know the right commands to use to configure the router. The operator
should not need to know that XORP is internally composed of multiple processes, nor what
those processes do. All the operator needs to see is a single router configuration file that
determines the startup configuration, and a single command line interface that can be used to
configure XORP. There is a single XORP process that manages the whole XORP router - this
is called rtrmgr (XORP Router Manager). The rtrmgr is responsible for starting all
components of the router, to configure each of them, and to monitor and restart any failing
process. It also provides CLI (Command Line Interface) to change the router configuration
[Xorp-rtrmgr].
2.2.4 IPC (Inter Process Communication) Finder
The IPC finder is needed by the communication method used among all XORP components.
Each of the XORP components registers with the IPC finder. The main goal of XORPs IPC
scheme is:
To provide all of the IPC communication mechanisms that a router is likely to need,
e.g. sockets, ioctls, System V messages, shared, memory.
To provide a consistent and transparent interface irrespective of the underlying
mechanism used.
To provide an asynchronous interface.
To potentially wrapper communication with non-XORP processes, e.g. HTTP and
SNMP servers.
Copyright ICWS-2009
To be renderable in human readable form so XORP processes can read and write commands
from configuration files.
3 Implementation of BGP Routing Protocol
Routing Protocols are classified into two categories: Intra-Autonomous routing protocol and
Inter-Autonomous routing protocol also known as Exterior Gateway Protocol (EGP). An
AS(Autonomous System) corresponds to a routing domain that is under one administrative
authority, and which implements its own routing policies.BGP is a Inter-AS routing protocol
used for routing between different ASs. To implement and design the protocol we will use
Linux base PCs, on which XORP is installed, as a router. Each of these PCs must have two or
more NICs. To start XORP a configuration file is needed. The XORP router manager process
can be started by using the following command rtrmgr -b my config.boot, where my
config.boot is the configuration file. On startup, XORP explicitly configures the specified
interfaces and starts all the required XORP components like FEA, RIP, OSPF, BGP and etc.
Figure 2 shows the interaction between the configuration files, The Router Manager, FEA
etc. In the following section we will discuss the BGP protocol, network topology for the
protocol and syntax for it in the configuration file.

Fig. 2: Interaction Between Modules [Xorp-fea]
3.1 BGP (Border Gateway Protocol)
The Border Gateway Protocol is the routing protocol used to exchange routing information
across the Internet. It makes it possible for ISPs to connect to each other and for end-users to
connect to more than one ISP. BGP is the only protocol that is designed to deal with a
network of the Internet's size, and the only protocol that can deal well with having multiple
connections to unrelated routing domains [Kurose and Ross, 2007].
264 Implementation of BGP Using XORP
Copyright ICWS-2009
3.2 BGP Working
The main concept used in BGP is that of the Autonomous System (AS) which we described
earlier.
BGP is used in two different ways:
eBGP is used to exchange routing information between routers that are in different
ASs.
iBGP is used to exchange information between routers that are in the in the same AS.
Typically these routes were originally learned from eBGP.
Each BGP route carries with it an AS Path, which essentially records the autonomous
systems through which the route has passed between the AS where the route was originally
advertised and the current AS. When a BGP router passes a route to a router in a
neighbouring AS, it prepends its own AS number to the AS path. The AS path is used to
prevent routes from looping, and also can be used in policy filters to decide whether or not to
accept a route.
When a router receives a route from an iBGP peer, if the router decides this route is the best
route to the destination, then it will pass the route on to its eBGP peers, but it will not
normally pass the route onto another eBGP peer. This prevents routing information looping
within the AS, but it means that by default every BGP router in a domain must be peered with
every other BGP router in the domain.
Routers typically have multiple IP addresses, with at least one for each interface, and often an
additional routable IP address associated with the loopback interface1. When configuring an
IBGP connection, it is good practice to set up the peering to be between the IP addresses on
the loopback interfaces. This makes the connection independent of the state of any particular
interface. However, most eBGP peering will be configured using the IP address of the router
that is directly connected to the eBGP peer router. Thus if the interface to that peer goes
down, the peering session will also go down, causing the routing to correctly fail over to an
alternative path.
3.3 Network Topology
In our design we have created three Autonomous Systems AS65030, AS65020, and
AS65040. There are two end systems one is attached with AS 65020 and other is attached
with AS 65040.
Configuration on End Systems:
End System1
IP Address 45.230.20.2
subnet mask 255.255.255.0
End System 2
IP Address 45.230.20.2
subnet mask 255.255.255.0
Copyright ICWS-2009
In our topology as shown in figure 3 all the routers are simple PCs on which XORP processes
are running and make enable them to work as routers. The router in AS65030 has a BGP
identifier of 45.230.10.10, which is the IP address of one its interface. This router has two
BGP peering configured, with peer on IP addresses 45.230.10.20 and 45.230.1.10. These
peering are an eBGP connection because the peers are in a different ASs (65020 and 65040).
AS 65020
AS 65030
AS 65040
45.230.10.10/24
45.230.1.20/24
45.230.1.10/24
45.230.20.1/24
45.230.30.1/24
45.230.10.20/24
45.230.20.2/24
45.230.30.2/24

Fig. 3: Network Topology for BGP
References
[1] [Handley et al., 2002] M. Handley, O. Hodson, E. Kohler, XORP: An Open Platform for Network
Research, In Proc. of First Workshop on Hot Topics in Networks, Oct. 2002.
[2] [Kohler et al., 2000] E. Kohler, R. Morris, B. Chen, J. Jannotti, M. F. Kaashoek, The Click Modular
Router, ACM Trans. on Computer Systems, vol. 18, no. 3, Aug. 2000.
[3] [Kurose and Ross, 2007] James F. Kurose and Keith W. Ross, Computer Networking- A Top Down
Approach Featuring Internet, Pearson Education (2007).
[4] [Xorp-fea] XORP Forwarding Engine Abstraction XORP technical document. http://www.xorp.org/.
[5] [Xorp-ipc] XORP Inter-Process Communication library XORP technical document. http://www.xorp.org/
[6] [Xorp-rib] XORP Routing Information Base XORP technical document. http://www.xorp.org/.
[7] [Xorp-rtrmgr] XORP Router Manager Process (rtrmgr) XORP technical document. http://www.xorp.org/.
[8] [xorp.org] Extensible Open Routing Platform www.xorp.org
Voice Calls Using IP enabled Wireless Phones
on WiFi / GPRS Networks

Robin Kasana Sarvat Sayeed
robinkasana@zhcet.ac.in sarvatsayeed@zhcet.ac.in
Mohammad A Qadeer
Department of Computer Engineering
Zakir Husain College of Engineering & Technology

Abstract

A research on the related technology and implementation of IP phone based on
WiFi network is discussed in this paper; it includes the net structure of the
technology used in designing the terminal of IP phone. This technology is a
form of telecommunication that allows data and voice transmissions to be sent
across a wide range of interconnected networks. A WiFi enabled IP phone is
used which is preinstalled with the Symbian Operating System and a software
application is developed using J2ME which allows free and secured
communication between selected IP phones in the WiFi network. This
communication is done with the use of routing tables organized in the WiFi
routers. Using the free bandwidth of 2.4 GHz communication channels are
established. The communication channel being a free bandwidth is vulnerable
to external attacks and hacking. Thereby this challenge of creating a secure
communication channel is addressed by using two different encryption
mechanisms. The payload and header of the voice data packets are encrypted
using two different algorithm techniques. Hence the communication system is
made almost fully secure. Also the WiFi server can tunnel the calls to the
GPRS network using UNC. It is cost effective, it allows easier
communication, is great for international usages, and it can be very useful for
large corporations. In time this will become a cheap and secure way to
communicate and will have a large effect on university, business and personal
communication.
1 Introduction
As human started to get civilized, great need for more, advance equipments occurs. Most of
the things which in their early phase are considered to be part of leisure, have become one of
the most necessary things of daily life. One such invention is telephone. However the
Voice Calls Using IP enabled Wireless Phones on WiFi / GPRS Networks 267
Copyright ICWS-2009
development of conventional telephony systems is far behind the development of todays
Internet. Centralized architectures with dumb terminals make exchange of data very complex,
but provide very limited functions. Closed and hardware proprietary systems hinder the
enterprise in choosing products from different vendors and deploying a voice function to
meet their business needs. Consequently, Web-like IP phone distributed architecture
[Collateral] is proposed to facilitate enterprises and individuals to provide their own phone
services. The advent of Voice over Internet Protocol (VoIP) has fundamentally been
transforming the way telecommunication evolves [Yu et al., 2003]. This technology is a form
of telecommunication that allows data and voice transmissions to be sent across a wide
variety of networks. VoIP allows businesses to talk to other branches, using a PC phone, over
corporate Intranets. Driven by the ongoing deployment of broadband infrastructure and the
increasing demand of telecommunication service, VoIP technologies and applications have
led to the development of economical IP phone equipment for ever-rising VoIP
communication market [Metcalfe,2000] based on embedded systems, IP phone application
can satisfyingly provide the necessary interfaces between telephony signals and IP networks
[Ho et al.,2003]. Although IP phone communication over the data networks such as LAN
exists but these IP phones are fixed type. We have tried to implement wireless IP phone
communication using the WiFi network. This network being in the free bandwidth channel is
considered insecure and vulnerable to security threats and hacking. So the area of concern is
the security and running cost of a communication system. As a lot of sensitive information
can be lost because of insecure communication system, a lot of work is required to be done in
this field to fill the lacuna. The base idea is unifying voice and data onto a single network
infrastructure by digitizing the voice signals, convert them into IP packets and send them
through an IP network together with the data information, instead of using a separate
telephony network.
2 Related Work
The primary feature of a voice application is that it is extremely delay-sensitive rather than
error-sensitive. There are several approaches that have been developed to support delay-
sensitive applications on IP networks. In the transport layer, UDP can be used to carry voice
packets while TCP may be used to transfer control signals, as long delay is caused by TCP by
its retransmission and three-handshake mechanism. The Real-Time transport protocol (RTP)
[Casner et al.,1996] is a compensative protocol for real-time deficiency on packet networks
by operating on UDP and providing mechanisms for realtime applications to process voice
packets. The Real-Time Control protocol (RTCP) [Metcalfe, 2000] provides quality feedback
for the quality improvement and management of the real-time network. Several signaling
protocols have been proposed for IP phone applications. SIP is peer to- peer protocols. Being
simple and similar to HTTP, SIP [Rosenberg et al., 2002] will bring the benefits of WWW
architecture into IP telephony and readily run wherever HTTP runs. It is a gradual evolution
from existing circuit-switched networks to IP packet-switched network. A lot of work has
been done to implement IP phones over data networks, even on the internet (Skype), but
almost all work has been done mainly using secure communication channels and fixed IP
phones. Although a lot of work has been done to connect different heterogeneous networks
like UMA (Unified Mobile Access) technology allows the use of both GPRS network and
WiFi networks (indoors) for calling [Arjona and Verkasalo, 2007].
268 Voice Calls Using IP enabled Wireless Phones on WiFi / GPRS Networks
Copyright ICWS-2009
3 IP Phone Communication Over WiFi
IP based phone communication in a particular WiFi network is free. Moreover the
communication is secured as the existing WiFi network is used rather than using the services
of any other carrier. 128 bit encrypted voice communication takes place between authorized
and authenticated IP phone users. If the user wants to call to outside world then he has suffix
a symbol, in this case ' * '.Then the call is routed to the outside world. Also, if the user moves
out of the WiFi range, handover takes place the mobile unit again starts working on GPRS
network.
3.1 Architecture
IP enabled cell phones are the mobile units capable of accessing the WiFi network. WiFi
Routers have routing tables which are used to route the calls to the desired IP phone. A J2ME
application was developed which provide access to the IP phone in the WiFi network.
3.2 Connection Mechanism
IP phones registers its fixed IP on the WiFi route, where the router will update its
routing table with this IP phone being active (Figure 1).
The name and number of the phone with the particular IP are searched in the database
and IP is replaced with the name of the user in the WiFi routing table.
If the number starts with a special symbol say asterisk ' * ' then the router tunnels the
call to GPRS network using UNC.
When signal of WiFi fades out handover takes place and the mobile unit starts working on the
GPRS network.

Fig. 1: Registering of IP phone in the Routing table
3.3 Management of Call Between WiFi to WiFi
A number (user 2) is dialed using the J2ME application from user 1s mobile unit. The
application then sends the number in 128 bit encrypted form to the router, requesting a call to
be placed (Figure 2(a)). The router in the WiFi network searches its routing table for the
Copyright ICWS-2009
desired number and if the number is active then a packet of data signaling an incoming call is
sent to the corresponding IP on the WiFi network.The J2ME application on user 2s mobile
unit alerts the user of an incoming call. The routing table gets updated to both the IP mobile
units as being busy (Figure 2(b)). When the user 2 accepts the incoming call, real time
transfer of voice data packets starts between the two mobile units.The header of each packet
is encrypted in such a way that the router can decrypt it and route it to the required mobile
unit. While the actual voice data packets are encrypted in such a way that only the other
mobile unit can decrypt and it can not be decrypted at the router end. When the call is broken
down the routing table is again modified and the busy status is changed. If the user at the
other end doesnt want to take the call and presses the hang up button then the user at the first
end is send a message that the user dialed is busy.

Fig. 2(a): User 1 dialing User 2s number Fig. 2(b): User 2 receiving a call from User 1
3.4 Management of Call Between WiFi to Public Network
When a user wants to dial a call to the outside world (that is to the public network), he has to
suffix an ' * ' before the number he wants to call to. If he dials "*1234567890", Then the
WiFi router identifies the ' * ' and routes the call via broad band connection to the UNC
(UMA Network controller). Till UNC IP was being used to carry the voice data packets
(Refer to Figure 3). After that point it depends upon the UNC which technology is used to
carry the packets. Also if the call has to be routed to the outside world the packets have to be
decrypted as the UNC is unaware of the encryption used by the WiFi network. More-over the
packet has to be organized and decrypted according to the needs of UNC.
3.5 WiFi to GPRS Handover
In case of WiFi to GPRS handover, first the mobile unit has to detect that the WiFi signal has
completely faded out. Also now the WiFi service is no longer acceptable. At this stage the
mobile unit sends a handover request to a neighboring GPRS cell. The selection of mobile
cell depends upon the SIM card present in the mobile unit at that time. Then the core network
of the service provider has to handle the resource allocation procedure with the base station
controller (BSC) for the GPRS calls. Once the allocation is complete a signal is sent to the
mobile unit that the handover has taken place.

Copyright ICWS-2009

Fig. 3: Encryption Decryption mechanism of the channel
3.6 Implementation
A cell phone with Symbian 60 ver. 9 operating system, with Java capabilities, also it
should be equipped with WLAN 802.11 b/g capabilities.
J2ME software is required to place the calls and allow the encryption to take place for
a secured communication.
A router with the routing tables is required to route the calls to specific online users.
The router should be authenticated in the WiFi environment and should also be
WLAN 802.11 b/g supported.
3.7 Security
Security is one of the main areas of concern especially if we are communicating over the free
bandwidth of 2.4 GHz of the WiFi.This is taken into care by using two different encryption
methods. One is used for encrypting the header of the datapackets this can be broken down
by both the WiFi router and the mobile unit. While the payload is encrypted using a different
method which can only be broken down at the other mobile unit(Refer to Figure 3). There are
very less chances of the signals being tapped as this whole communication system is taking
place on a private network, with authentic and limited connectivity. Also the area of coverage
being limited the signals can not be tapped easily. It also handles the new upcoming threats
which the employer is facing specially in defense and other sensitive organizations related to
security and privacy of the organization because of the highly sophisticated mobile devices
capable of audio and video recording.This communication system solves problem by giving
the employees mobile units which are Java enabled and capable of accessing the WiFi
network, when the enter the organization and confiscating the employees mobile phones.
3.8 Cost Efficient
Cost involved in the setup and running of a communication system is a major issue. This
method of communication deals very effectively in this aspect. As the only major cost
involved is mainly in the setup of the communication system, which also comes out to be
very less than the conventional GPRS and CDMA networks. The running cost of the network
is only of the calls routed through UNC to the GPRS network, which is the cost levied by the
service provider while the calls made within the WiFi network are free of cost. Hence the

Copyright ICWS-2009
running cost can be assumed to be nil as compared to the running cost of GPRS and CDMA
networks. Hence this communication system is very cost effective and cost efficient system.
3.9 Coverage
The coverage area of the network depends upon the WiFi router coverage. Unlike GPRS
network we can not deploy a number of WiFi hotspots for increasing the coverage.It is
mainly due to the problem faced in handover. The mobile unit will not attempt a handoff until
the quality of signal deteriorates quiet considerably.This is a problem when continuous
coverage is built because the client will not attempt to change the cell even if the other cell is
providing better signal strength. This ends up in very late cell changes, poor voice quality and
dropped connections. Furthermore, in some cases even if the handover break is short, the
perceived voice quality can be very poor for several seconds due to low signal quality prior to
the handover. To address the handover time several methods have been developed but in
practice are not widely implemented nor supported by current devices [Arjona and Verkasalo,
2007][IEEE HSP5,1999][IEEE HSP2.4,1999][IEEE QoS,2004].

Fig. 4: WiFi Network access to Terrestrial and Cellular Network
3.10 Future Prospects
Through this paper we have tried to establish a new way of communication between two
wireless IP phones over the WiFi network. However there are many areas which remain
untouched and demand attention. There is a high potential for the development of
applications for this communication system which in turn will transform this system into a
full-fledged communication system. Applications like Short Messaging Service (SMS) can
also be developed. This service will function between two IP phones on the same WiFi
network or even a series of interconnected networks. Data exchange i.e. sharing and transfer
of information and files between two IP phones is another application waiting to be
developed. Again this service can function on the same WiFi network or a series of
interconnected networks. Accessing and surfing the internet on the wireless IP phone through
a single access point will be very cost efficient. Moreover acquiring a list of all the users that
are logged on the network a real time chatting application can be developed. Moreover the
interconnected IP phones can be linked to server like the Asterisk and dialing outside their
native network to the outside world will be possible. This will be quite preferable as only a
single line outside the network is needed which will allow access to all the connected IP
phones.

Copyright ICWS-2009
4 Conclusion
In this paper we have described a new way to provide communication within a specified area.
Here we have proposed to use IP enabled mobile units which will be able to communicate to
each other via the WiFi network. With the help of a simple Java application the allowed IP
phones can automatically log on in the network and can communicate among themselves.
The WiFi bandwidth of 2.4 GHz acts as communication channel between the mobile unit and
the router. The same bandwidth is used as a communication channel between the different
WiFi networks thereby treating the whole network as one and creating a huge data cloud.
Since the bandwidth of the WiFi network is free, the only cost involved in this
communication system is the initial setup cost, hence making it very much viable. Although it
limits the communication area but also provides the flexibility to dial calls to the outside
world by tunneling the calls through UNC to the public network (terrestrial, GPRS and
CDMA) network. At the same time it also addresses the security issues and is an answer to
the no mobile zones. These are basically the zones where the organizations have prohibited
the use of mobile phones because of certain security constraints such as the fear of leakage of
sensitive information outside a desired area. This security in the communication channel is
maintained as the data packets are 128 bit encrypted.
References
[1] [Arjona and Verkasalo, 2007] Andres Arjona, Hannu Verkasalo, Unlicensed Mobile Access (UMA)
Handover and Packet Data Analysis, Second International Confrenece on Digital Telecommunication (
ICDT'07 )
[2] [Casner et al.,1996] Schulzrinne H.,Casner, S.,Fredrick R.,and Jacobson V., RTP : A Transport Protocolfor
Real Time Applications, RFC 1889,January 1996
[3] [Collateral] Pintel Corp., Next Generation VoIP Services and Applications Using SIP and.Java Technology
Guide, http://www.pingtel.com/docs/collateral_techguide_final.pdf
[4] [Ho et al. 2003] Chian C. Ho,tzi-Chiang Tang,Chin-Ho Lee, Chih Ming Chen,Hsin-Yang Tu,Chin-Sung
Wu,Chao-His Chang,Chin-Meng Huan, H.323 VoIP Telephone Implementation Embedding A Low
Power SOC Processor,0-7803-7749-4/ 03 IEEE.,p.163-166.
[5] [IEEE HSP2.4,1999] IEEE, "Part 11 :Wireless LAN Medium Access Control (MAC) and Physical Layer
Specifications High Speed Physical Layerin the 2.4GHz band", IEEE Standard 802.11b,1999
[6] [IEEE HSP5,1999] IEEE, "Part 11 :Wireless LAN Medium Access Control (MAC) and Physical Layer
Specifications High Speed Physical Layerin the 5 GHz band", IEEE Standard 802.11a, 1999
[7] [IEEE QoS,2004] IEEE, "Part 11 :Wireless LAN Medium Access Control (MAC) and Physical Layer
Specifications Amendment: Medium Access Control (MAC) Enhancements of Quality of Service", IEEE
Standard P802.11e/D12.0,November 2004
[8] [Metcalfe 2000] B. Metcalfe, The Next Generation Internet , IEEE Internet Computing, vol.4, p. 58 -59,
Jan- Feb,2000
[9] [Rosenberg et al.,2002] Rosenberg J., Schulzrinne H., Camarillo G.,Jhonston A.,Peterson J.,Sparks R.,
Handley M. and Schooler E.,SIP:Session Initiation Protocol Protocol, RFC 2543, The Internet
Society,Feburary 21, 2002
[10] [Yu et al., 2003] Jia Yu, Jan Newmarch, Michael Geisler, JINI/J2EE Bridge for Large-scale IP Phone
Services ,Proceedings of the Tenth Asia-Pacific Software Engineering Confernce (APSEC03),1530-
1362/03

Internet Key Exchange Standard for: IPSEC

Sachin P. Gawate N.G. Bawane Nilesh Joglekar
G.H.R.C.E G.H.R.C.E. IIPL
Nagpur Nagpur Nagpur
sachingawate@gmail.com narenbawane@rediffmail.com

Abstract

This paper describes the purpose, history, and analysis of IKE [RFC 2409], the
current standard for key exchange for the IPSec protocol. We discuss some
issues with the rest of IPSec, such as what services it can offer without
changing the applications, and whether the AH header is necessary. Then we
discuss the various protocols of IKE, and make suggestions for improvement
and simplification.
1 Introduction
IPSec is an IETF standard for real-time communication security. In such a protocol, Alice
initiates communication with a target, Bob. Each side authenticates itself to the other based
on some key that the other side associates with it, either a shared secret key between the two
parties, or a public key. Then they establish secret session keys (4 keys, one for integrity
protection, and one for encryption, for each direction). The other major real-time
communication protocol is SSL [Roll, standardized with minor changes by the IETF as TLS.
IPSec is said to operate at layer 3 whereas SSL operates at layer 4. We discuss what this
means, and the implications of these choices, in section 1.2.
1.1 ESP vs. AH
There are several pieces to IPSec. One is the IPSec data packet encodings of which there are
two: AH (authentication header), which provides integrity protection. and ESP (encapsulating
security payload) that provides encryption and optional integrity protection. Many people
argue [FS99] that AH is unnecessary, given that ESP can provide integrity protection. The
integrity protection provided by ESP and AH is not identical, however. Both provide integrity
protection of everything beyond the IP header, but AH provide integrity protection for some
of the fields inside the IP header as well. It is unclear why it is necessary to protect the IP
header. If it were necessary, this could be provided by ESP in tunnel mode (where a new IP
header with ESP is pretended to the original packet, and the entire original packet including
IP header is considered payload, and therefore cryptographically protected by ESP).
Intermediate routers can not enforce AHS integrity protection, because they do not know the
session key for the Alice-Bob security association, so AH can at best be used by Bob to check
that the IP header was received as launched by Alice. Perhaps an attacker could change the
QOS fields, so that the packet would have gotten referential or discriminatory treatment
unintended by Alice, but Bob would hardly wish to discard a packet from Alice if the
contents were determined cryptographically to be properly received, just because it traveled
by a different path, or according to different handling, than Alice intended. The one function
274 Internet Key Exchange Standard for: IPSEC
Copyright ICWS-2009
that AH offers that ESP does not provide is that with AH, routers and firewalls know the
packet is not encrypted, and can therefore make decisions based on fields in the layer 4
header, such as the ports. (Note: even if ESP is using null encryption, there is no way for a
router to be able to know this conclusively on a packet-by-packet basis.) This feature of
having routers and firewalls look at the TCP ports can only be used with unencrypted IP
traffic, and many security advocates argue that IPSec should always be encrypting the traffic.
Information such as TCP ports does divulge some information that should be hidden, even
though routers have become accustomed to using that information for services like
differential queuing. Firewalls also base decisions on the port fields, but a malicious user can
disguise any traffic to fit the firewalls policy database (e.g., if the firewall allows HTTP,
then run all protocols on top of HTTP), so leaving the ports unencrypted for the benefit of
firewalls is also of marginal benefit. The majority of our paper will focus on IKE, the part of
IPSec that does mutual authentication and establishes session keys.
1.2 Layer 3 vs. Layer 4
The goal of SSL was to deploy something totally at the user level, without changing the
operating systems, whereas the goal of IPSec was to deploy something within the OS and not
require changes to the applications. Since everything from TCP down is generally
implemented in the OS, SSL is implemented as a process that calls TCP. That is why it is said
to be at the Transport Layer (layer 4 in the OS1 Reference Model). IPSec is implemented
in layer 3, which means it considers everything above layer 3 as data, including the TCP
header. The philosophy behind IPSec is that if only the OS needed to change, then by
deploying an IPSec-enhanced OS all the applications would automatically benefit from
IPSecs encryption and integrity protection services. There is a problem in operating above
TCP. Since TCP will not be participating in the cryptography, it will have no way of noticing
if malicious data is inserted into the packet stream. TCP will acknowledge such data and send
it up to SSL, which will discard it because the integrity check will indicate the data is bogus,
but there is no way for SSL to tell TCP to accept the real data at this point. When the real data
arrives, it will look to TCP like duplicate data, since it will have the same sequence numbers
as the bogus data, so TCP will discard it. So in theory, IPSec approach of cryptographically
protecting each packet independently is a better approach. However, if only the operating
system changes, and the applications and the API to the applications do not change, then the
power of IPSec cannot be fully utilized. The API just tells the application what IP address is
on a particular connection. It cant inform the application of which user has been
authenticated. That means that even if users have public keys and certificates, and IPSec
authenticates them, there is no way for it to inform the application. Most likely after IPSec
establishes an encrypted tunnel, the user will have to type a name and password to
authenticate to the application. So it is important that eventually the APIs and applications
change so that IPSec can inform the application of something more than the IP address of the
tunnel endpoint, but until they do, IPSec accomplishes the following:
It encrypts traffic between the two nodes. As with firewalls, IPSec can access a policy
database that specifies which IP addresses are allowed to talk to which other IP addresses.
Some applications do authentication based on IP addresses, and the IP address from which
information is received is passed up to the application. With IPSec, this form of
authentication becomes much more secure because one of the types of endpoint identifiers
Internet Key Exchange Standard for: IPSEC 275
Copyright ICWS-2009
IPSec can authenticate is an IP address, in which case the application would be justified in
trusting the IP address asserted by the lower layer as the source.
2 Overview of IKE
IKE is incredibly complex, not because there is any intrinsic reason why authentication and
session key establishment should be complex, but due to unfortunate politics and the
inevitable result of years of work by a large committee. Because it is so complex, and
because the documentation is so difficult to decipher, IKE has not gotten significant review.
The IKE exchange consists of two phases. We argue that the second phase is unnecessary.
The phase 1 exchange is based on identities such as names, and secrets such as public key
pairs, or pre-shared secrets between the two identities. The phase 1 exchange happens once,
and then allows subsequent setup of multiple phase 2 connections between the same pair of
identities. The phase 2 exchanges relies on the session key established in phase 1 to do
mutual authentication and establish a phase 2 session key used to protect all the data in the
phase 2 security association. it would certainly be simpler and cheaper to just set up a
security association in a single exchange, and do away with the phases, but the theory is that
although the phase 1 exchange is necessarily expensive (if based on public keys), the phase 2
exchanges can then be simpler and less expensive because they can use the session key
created out of the phase 1 exchange. This reasoning only makes sense if there will be
multiple phase 2 setups inside the same phase 1 exchange. Why would there be multiple
phase 2-type connections between the same pair of nodes? Here are the arguments in favor of
having two phases:
It is a good idea to change keys periodically. You can do key rollover of a phase 2
connection by doing another phase 2 connection setup, which would be cheaper than
restarting the phase 1 connection setup.
You can set up multiple connections with different security properties, such as
integrity-only, encryption with a short (insecure, snooper-friendly) key, or encryption
with a long key.
You can set up multiple connections between two nodes because the connections are
application-to application, and youd like each application to use its own key, perhaps
so that the IPSEC layer can give the key to the application.
We argue against each of these points:
If you want perfect forward secrecy when you do a key rollover, then the phase 2
exchange is not significantly cheaper than doing another phase 1 exchange. If you are
simply rekeying, either to limit the amount of data encrypted with a single key, or to
prevent replay after the sequence number wraps around, then a protocol designed
specifically for rekeying would be simpler and less expensive than the IKE phase 2
exchanges.
It would be logical to use the strongest protection needed by any of the traffic for all
the traffic rather than having separate security associations in order to give weaker
protection to some traffic. There might be some legal or performance reasons to want
to use different protection for different forms of traffic, but we claim that this should
be a relatively rare case that we should not be optimizing for. A cleaner method of
Copyright ICWS-2009
doing this would be to have completely different security associations rather than
multiple security associations loosely linked together with the same phase 1 security
association.
This case (wanting to have each application have a separate key) seems like a rare
case, and setting up a totally unrelated security association for each application would
suffice. In some cases, different applications use different identities to authenticate. In
that case they would need to have separate Phase 1 security associations anyway. In
this paper we concentrate on the properties of the variants of Phase 1 IKE. Other than
arguably being unnecessary, we do not find any problems with security or
functionality with Phase 2 IKE.
3 Overview of Phase I IKE
There are two modes of IKE exchange. Aggressive mode accomplishes mutual
authentication and session key establishment in 3 messages. Main mode uses 6 messages,
and has additional functionality, such as the ability to hide endpoint identifiers from
eavesdroppers, and negotiate cryptographic parameters. Also, there are three types of keys
upon which a phase 1 IKE exchange might be based: pre-shared secret key, public encryption
key, or public signature key. The originally specified protocols based on public encryption
keys were replaced with more efficient protocols. The original ones separately encrypted
each field with the,other sides public key, instead of using the well known technique of
encrypting a randomly chosen secret key with the other sides public key, and encrypting all
the rest of the fields with that secret key. Apparently a long enough time elapsed before
anyone noticed this that they felt they needed to keep the old-style protocol in the
specification, for backward compatibility with implementations that. might have been
deployed during this time. This means there are 8 variants of the Phase 1 of IKE! That is
because there are 4 types of keys (old, style public encryption key, new-style public
encryption key, public signature key, and pre-shared secret key), and for each type of key, a
main mode protocol and an aggressive mode protocol. The variants have surprisingly
different characteristics. In main mode there are 3 pairs of messages. In the first pair Alice
sends a cookie (see section 4.2) and requested cryptographic algorithms, and Bob responds
with his cookie value, and the cryptographic algorithms he will agree to. Message 3 and 4
consist of a Diffie-Hellman exchange. Messages 5 and 6 are encrypted with the Diffie-
Hellman value agreed upon in messages 3 and 4, and here each side reveals its identity and
proves it knows the relevant secret (e.g., private signature key or pre-shared secret key). In
aggressive mode there are only 3 messages. The first two messages consist of a Diffie-
Hellman exchange to establish a session key, and in the 2nd and 3rd messages each side
proves they know both the Diffie-Hellman value and their secret.
3.1 Key Types
We argue one simplification that can be made to IKE is to eliminate the variants based on
public encryption keys. Its fairly obvious why in some situations the pre-shared secret key
variants make sense. Secret keys are higher performance. But why the two variants on public
key? There are several reasons we can think of for the signature-key-only variant: Each side
knows its own signature key, but may not know the other sides encryption key until the other
side sends a certificate. If Alices encryption key was escrowed, and her signature key was
Copyright ICWS-2009
not, then using the signature keys offers more assurance that youre talking to Alice rather
than the escrow agent. In some scenarios people would not be allowed to have encryption
keys, but it is very unlikely that anyone who would have an encryption key would not also
have a signature key. But there are no plausible reasons we can come up with that would
require variants based on encryption keys. So one way of significantly simplifying IKE is to
eliminate the encryption public key variants.
3.2 Cookies
Stateless cookies were originally proposed in Photuris [K94] as a way of defending against
denial of service attacks. The server, Bob, has finite memory and computation capacity. In
order to prevent an attacker initiating
connections from random IP addresses, and using up all of the state Bob needs in order to
keep track of connections in progress, Bob will not keep any state or do any significant
computation unless the connect request is accompanied by a number, known as a cookie,
that consists of some function of the IP address from which the connection is made and a
secret known to Bob. In order to connect to Bob, Alice first makes an initial request, and is
given a cookie. After telling Alice the cookie value, Bob does not need to remember anything
about the connect request. When Alice contacts Bob again with a valid cookie, Bob will be
able to verify, based on Alices IP address, that Alices cookie value is the one Bob would
have given Alice. Once he knows that Alice can receive from the IP address she claims to be
coming from, he is willing to devote state and significant computation to the remainder of the
authentication. Cookies do not protect against an attacker, Trudy, launching packets from IP
addresses at which she can receive responses. But in some forms of denial of service attacks
the attackers choose random IP addresses as the source, both to make it harder to catch them,
and to make it harder to filter out these attacking messages. So cookies are of some benefit. If
computation were the only problem, and Bob had sufficient state to keep track of the
maximum number of connect requests that could possibly arrive within the time window
before he is allowed to give up and delete the state for the uncompleted connection, it would
not be necessary for the cookie to be stateless. But memory is a resource at Bob that can be
swamped during a denial of service attack, so it is desirable for Bob not to need to keep any
state until he receives a valid cookie. OAKLEY [098] allowed the cookies to be optional. If
Bob was not being attacked and therefore had sufficient resources, he could accept
connection requests without cookies. A round trip delay and two messages could be saved. In
Photuris the cookie (and the extra two messages) was always required. The idea behind the
OAKLEY stateless cookies is:

In the main mode variants, none of the Ike variants allows from being forced to do a
significant amount of computation. However, IKE requires Bob to keep state from the first
Copyright ICWS-2009
message, before he knows whether the other side would be able to return a cookie. It would
be straightforward to add two messages to IKE to allow for a stateless cookie. However, we
claim that stateless cookies can be implemented in IKE main mode without additional
messages by repeating in message, 3 the information in message 1. Furthermore, it might be
nice, in aggressive mode, to allow cookies to be optional, turned on only by the server when
it is experiencing a potential denial of service attack, using the OAKLEY technique.
3.3 Hiding Endpoint Identities
One of the main intentions of main mode was the ability to hide the endpoint identifiers.
Although its easy to hide the identifier from a passive attacker, with some key types it is
difficult to design a protocol to prevent an active attacker from learning the identity of one
end or the other. If it is impossible to hide one sides identity from an active attacker, we
argue it would be better for the protocol to hide the initiators identity rather than the
responders (because the responder is likely to be at a fixed IP address so that it can be easily
found while the initiator may roam and arrive from a different IP address each day). Keeping
that in mind, well summarize how well the IKE variants do at hiding endpoint identifiers. In
all of the aggressive mode variants, both endpoint identities are exposed, as would be
expected. Surprisingly, however, we noticed that the signature key variant of aggressive
mode could have easily been modified, with no technical disadvantages, to hide both
endpoint identifiers from an eavesdropper, and the initiators identity even from an active
attacker! The relevant portion of that protocol is:

The endpoint identifiers could have been hidden by removing them from message 1 and 2
and including them,encrypted with the Diffie-Hellman shared value,in messages 2 (Bobs
identifiers) and 3 (Alice identifiers). In the next sections we discuss how the main mode
protocols hide endpoint identifiers
3.3.1 Public Signature Keys
In the public signature key main mode, Bobs identity is hidden even from an active attacker,
but Alices identity is exposed to an active attacker impersonating Bobs address to Alice.
The relevant part of the IKE protocol is the following:

Copyright ICWS-2009
An active attacker impersonating Bobs address to Alice will negotiate a Diffie-Hellman key
with Alice and discover her identity in msg 5.The active attacker will not be able to complete
the protocol since it will not be able to generate Bobs signature in msg 6.
The protocol could be modified to hide Alices identity instead of Bobs from an active
attacker. This would be done by moving the information from msg 6 into msg 4. This even
completes the protocol in one fewer message. And as we said earlier, it is probably in practice
more important to hide Alices identity than Bobs.
3.3.2 Public Encryption Keys
In this variant both sides identities are protected even against an active attacker. Although
the protocol is much more complex, the main idea is that the identities (as well as the Diffie-
Hellman values in the Diffie-Hellman exchange) are transmitted encrypted with the other
sides public key, so they will be hidden from anyone that doesnt know the other sides
private key. We offer no optimizations to the public encryption key variants of IKE other
than suggesting their removal.
3.3.3 Pre-Shared Key
In this variant, both endpoints identities are revealed, even to an eavesdropper! The relevant
part of the protocol is the following:

Since the endpoint identifiers are exchanged encrypted, it would seem as though both
endpoint identifiers would be hidden. However, Bob has no idea who he is talking to after
message 4, and the key with which messages 5 and 6 are encrypted is a function of the pre
shared key between Alice and Bob. So Bob cant decrypt message 5, which reveals Alices
identity, unless he already knows, based on messages 1-4, who he is talking to!
The IKE spec recognizes this property of the protocol, and specifies that in this mode the
endpoint identifiers have to be the IP addresses! In which case, theres no reason to include
them in messages 5 and 6 since Bob (and an eavesdropper) already knows them!
Main mode with pre-shared keys is the only required protocol. One of the reasons youd
want to use IPSec is in the scenario in which Alice, an employee traveling with her laptop,
connects into the corporate network from across the Internet. IPSec with pre-shared keys
would seem a logical choice for implementing this scenario. However the protocol as
designed is completely useless for this scenario since by definition Alices IP address will be
unpredictable if shes attaching to the Internet from different locations. It would be easy to fix
the protocol. The fix is to encrypt messages 5 and 6 with a key which is a function of the
shared Diffie-Hellman value, and not also a function of the pre-shared key. Proof of
knowledge of the pre-shared key is already done inside messages 5 and 6. In this way an
active attacker who is acting as a man-in the middle in the Diffie-Hellman exchange would
be able to discover the endpoint identifiers, but an eavesdropper would not. And more
Copyright ICWS-2009
importantly than whether the endpoint identifiers are hidden, it allows use of true endpoint
identifiers, such as the employees name, rather than IP addresses. This change would make
this mode useful in the scenario (road warrior) in which it would be most valuable.
4 Negotiating Security Parameters
IKE allows the two sides to negotiate which encryption, hash, integrity protection, and Diffie-
Hellman parameters they will use. Alice makes a proposal of a set of algorithms and Bob
chooses. Bob does not get to choose 1 from column A, 1 from column B, 1 from column C,
and 1 from column D, so to speak. Instead Alice transmits a set of complete proposals. While
this is more powerful in the sense that it can express the case. where Alice can only support
certain combinations of algorithms, it greatly expands the encoding in the common case
where Alice is capable of using the algorithms in any combination. For instance, if Alice can
support 3 of each type of algorithm, and would be happy with any combination, shed have to
specify 81 (34) sets of choices to Bob in order to tell Bob all the combinations she can
support! Each choice takes 20 bytes to specify-- 4 bytes for a header and 4 bytes for each of
encryption, hash, authentication, and Diffie-Hellman.
5 Additional Functionality
Most of this paper dealt with simplifications we suggest for IKE. But in this section we
propose some additional functionality that might be useful.
5.1 Unidirectional Authentication
In some cases only one side has a cryptographic identity. For example, a common use case
for SSL is where the server has a certificate and the user does not. In this case SSL creates an
encrypted tunnel. The client side knows it is talking to the server, but the server does not
know who it is talking to. If the server needs to authenticate the user, the application typically
asks for a name and password. The one-way authentication is vital in this case because the
user has to know he is sending his password to the correct server, and the protocol also
ensures that the password will be encrypted when transmitted. In some cases security is
useful even if it is only one-way. For instance, a server might be disseminating public
information, and the client would like to know that it is receiving this information from a
reliable source, but the server does not need to authenticate the client. Since this is a useful
case in SSL, it would be desirable to allow for unidirectional authentication within IPSec.
None of the IKE protocols allow this.
5.2 Weak Pre-shared Secret Key
The IKE protocol for pre-shared secrets depends on the secret being cryptographically strong.
If the secret were weak, say because it was a function of a password, an active attacker
(someone impersonating one side to the other) could obtain information with which to do an
off-line dictionary attack. The relevant portion of the IKE protocols is that first the two sides
generate a Diffie- Hellman key, and then one side sends the other something which is
encrypted with a function of the Diffie-Hellman key and the shared secret. If someone were
impersonating the side that receives this quantity, they know the Diffie-Hellman value, so the
encryption key is a function of a known quantity (the Diffie-Hellman value) and the weak
Copyright ICWS-2009
secret. They can test a dictionary full of values and recognize when they have guessed the
users secret. The variant we suggest at the end of section 4.3.3 improves on the IKE pre-
shared secret protocol by allowing identities other than IP addresses to be authenticated, but it
is still vulnerable to dictionary attack by an active attacker, in the case where the secret is a
weak secret. Our variant first establishes an anonymous Diffie-Hellman value, and then sends
the identity, and some proof of knowledge of the pre-shared secret, encrypted with the Diffie-
Hellman value. Whichever side receives this proof first will be able to do a dictionary attack
and verify when theyve guessed the user secret. There is a family of protocols [BM92],
[BM94], [Jab96], [Jab97], [Wu98], [KPOl], in which a weak secret, such as one derived from
a password, can be used in a cryptographic exchange in a way that is invulnerable to
dictionary attack, either by an eavesdropper or someone impersonating either side. The first
such protocol, EKE, worked by encrypting a Diffie-Hellman exchange with a hash of the
weak secret, and then authenticating based on the strong secret created by the Diffie-Hellman
exchange. The ability to use a weak secret such as a password in a secure way is very
powerful in the case where it is a user being authenticated. The current IKE pre-shared secret
protocol could be replaced with one of these protocols at no loss in security or performance.
For instance, a 3-message protocol based on EKE would look like:

The user types her name and password at the client machine, so that it can compute W. Alice
sends her name, and her Diffie-Hellman value encrypted with W. Bob responds with his
Diffie-Hellman value, and a hash of the Diffie-Hellman key, which could only agree with the
one computed by Alice if Alice used the same W as Bob has stored. In the third message,
Alice authenticates by sending a different hash of the Diffie-Hellman key. This protocol does
not hide Alices identity from a passive attacker. Hiding Alices identity could be
accomplished by adding two additional messages at the beginning, where a separate Diffie-
Hellman exchange is done, and the remaining 3 messages encrypted with that initially
established Diffie-Hellman key.
References
[1] [BM92] S. Bellovin and M. Merritt, Encrypted Key Exchange: Password- based protocols secure against
dictionary attacks, Proceedings of the IEEE Symposium on Research in Security and Privacy, May 1992.
[2] [BM94] S. Bellovin and M. Merritt, Augmented Encrypted Key Exchange: a Password-Based Protocol
1994
[3] [FS99] Ferguson, Niels, and Schneier, Bruce, A Cryptographic Evaluation of IPSec,
http://www.counterpane. com, April 1999.
[4] [Jab961] D. Jablon, Strong password-only authenticated key exchange, ACM Computer Communications
Review, October 1996.
[5] [Jab971] D. Jablon, Extended Password Protocols Immune to Dictionary Attack, Enterprise Security
Workshop, June 1997.
[6] [K94] Karn, Phil, The Photuris Key Management Protocol, Internet Draft draft-kam-photuris-OO.txt,
December1994.
[7] [KPOI] Kaufman, Charlie, and Perlman, Radia, PDM: A New Strong Password-Based Protocol, Usenix
Security Conference, 2001.
Copyright ICWS-2009
[8] [098] Orman, Hilarie, The OAKLEY Key Determination Protocol, RFC 2412, Nov 1998.
[9] [PKOO] Perlman, R. and Kaufman, C. Key Exchange in IPSec: Analysis of IKE, IEEE Inetemet
Computing, Nov/Dec 2000.
[10] [ROI] Rescorla,Eric, SSL and TLS: Designing and Building Secure Systems, Addison Wesley, 2001.
[11] [RFC2402] Kent, Steve, and Atkinson, Ran, 1P Authentication Header, RFC 2402, Nov 1998.
[12] [RFC2406] Kent, Steve, and Atkinson. Ran, IP Encapsulating Security Payload (ESP), RFC 2406, Nov
1998.

Autonomic Elements to Simplify
and Optimize System Administration

K. Thirupathi Rao K.V.D. Kiran
Department of Computer Department of Computer
Science and Engineering, Science and Engineering,
Green Fields.India-522502 Green Fields.India-522502
ktr.klce@gmail.com kvd_kiran@yahoo.com
S. Srinivasa Rao D. Ramesh Babu
Department of Computer Department of Computer
Science and Engineering, Science and Engineering,
Green Fields.India-522502 Green Fields.India-522502
srinu_mtech05@yahoo.co.in rameshdamarla@yahoo.co.in
M. Vishnuvardhan
Department of Computer, Science and Engineering,
Koneru Lakshmaiah College of Engineering, Green Fields.India-522502
vishnumannava@gmail.com

Abstract

Most computer systems become increasingly large and complex, thereby
compounding many reliability problems. Too often computer systems fail,
become compromised, or perform poorly. To improve the system reliability,
one of the most interesting methods is the Autonomic Management which
offers a potential solution to these challenging research problems. It is inspired
by nature and biological system, such as the autonomic nervous system that
have evolved to cope with the challenges of scale, complexity, heterogeneity
and unpredictability by being decentralized, context aware, adaptive and
resilient. Today, a significant part of system administration work specifically
involves the process of reviewing the results given by the monitoring system
and the subsequent use of administration or optimization tools. Due to the
sustained trend toward ever increasingly distributed applications, this process
is much more complex in practice than it appears in theory. Each additional
component increases the number of possible adjustments enabling optimal
implementation of the services in terms of availability and performance. To
master this complexity in this paper we presented a model that describes the
chain of actions/reactions to achieve desirable degree of automation through
an autonomic element.
1 Introduction
With modern computing, consisting of new paradigms such as planetary-wide computing,
pervasive, and ubiquitous computing, systems are more complex than before. Interestingly,
when chip design became more complex we employed computers to design them. Today we
are now at the point where humans have limited input to chip design. With systems becoming
284 Autonomic Elements to Simplify and Optimize System Administration
Copyright ICWS-2009
more complex it is a natural progression to have the system to not only automatically
generate code but build systems, and carryout the day-to-day running and configuration of the
live system. Therefore autonomic computing has become inevitable and therefore will
become more prevalent.
To deal with the growing complexity of computing systems requires autonomic computing.
The autonomic computing, which is inspired by biological systems such as the autonomic
human nervous system [1, 2] and enables the development of self-managing computing
systems and applications. The systems and applications use autonomic strategies and
algorithms to handle complexity and uncertainties with minimum human intervention. An
autonomic application or system is a collection of autonomic elements, which implement
intelligent control loops to monitor, analyze, plan and execute using knowledge of the
environment. A fundamental principle of autonomic computing is to increase the intelligence
of individual computer components so that they become self-managing, i.e., actively
monitoring their state and taking corrective actions in accordance with overall system-
management objectives. The autonomic nervous system of the human body controls bodily
functions such as heart rate, breathing and blood pressure without any conscious attention on
our part. The parallel notion when applied to autonomic computing is to have systems that
manage themselves without active human intervention. The ultimate goal is to create
Autonomic Management computer systems that will become self-managing, and more
powerful; users and administrators will get more benefits from computers, because they can
concentrate their works with little conscious intervention. The paper is organized as follows.
Section 2 deals with the characteristics of autonomic computing system, Section 3 deals with
architecture for autonomic computing, section 4 deals with the autonomic elements for
simplifying and optimizing system administration and concluded in section 5 followed by
References.
2 Characteristics of Autonomic Computing System
The new era of computing is driven by the convergence of biological and digital computing
systems. To build tomorrows autonomic computing systems we must understand working
and exploit characteristics of autonomic system. Autonomic systems and applications exhibit
following characteristics. Some of these characteristics are discussed in [3, 4].
Self Awareness: An autonomic system or application knows itself and is aware of its state
and its behaviors.
Self Configuring: An autonomic system or application should be able configure and
reconfigure itself under varying and unpredictable conditions without any detailed human
intervention in the form of configuration files or installation dialogs.
Self Optimizing: An autonomic system or application should be able to detect suboptimal
behaviors and optimize it self to improve its execution.
Self-Healing: An autonomic system or application should be able to detect and recover from
potential problems and continue to function smoothly.
Self Protecting: An autonomic system or application should be capable of detecting and
protecting its resources from both internal and external attack and maintaining overall system
security and integrity.
Autonomic Elements to Simplify and Optimize System Administration 285
Copyright ICWS-2009
Context Aware: An autonomic system or application should be aware of its execution
environment and be able to react to changes in the environment.
Open: An autonomic system or application must function in an heterogeneous world and
should be portable across multiple hardware and software architectures. Consequently it must
be built on standard and open protocols and interfaces.
Anticipatory: An autonomic system or application should be able to anticipate to the extent
possible, its needs and behaviors and those of its context, and be able to manage it self
proactively.
Dynamic: Systems are becoming more and more dynamic in a number of aspects such as
dynamics from the environment, structural dynamics, huge interaction dynamics and from a
software engineering perspective the rapidly changing requirements for the system. Machine
failures and upgrades force the system to adapt to these changes. In such a situation, the
system needs to be very flexible and dynamic.
Distribution: systems become more and more distributed. This includes physical distribution,
due to the invasion of networks in every system, and logical distribution, because there is
more and more interaction between applications on a single system and between entities
inside a single application.
Situated ness: There is an explicit notion of the environment in which the system and entities
of the system exist and execute, environmental characteristics affect their execution, and they
often explicitly interact with that environment. Such an (execution) environment becomes a
primary abstraction that can have its own dynamics, independent of the intrinsic dynamics of
the system and its entities. As a consequence, we must be able to cope with uncertainty and
unpredictability when building systems that interact with their environment. This situated
ness often implies that only local information is available for the entities in the system or the
system itself as part of a group of systems.
Locality in control: When Computing systems and components live and interact in an open
world, the concept of global flow of control becomes meaningless. So Independent
computing systems have their own autonomous flows of control, and their mutual
interactions do not imply any join of these flows. This trend is made stronger by the fact that
not only do independent systems have their own flow of control, but also different entities in
a system have their own flow of control.
Locality in interaction: physical laws enforces locality of interactions automatically in a
physical environment.. In a logical environment, if we want to minimize the conceptual and
management complexity we must also favors modeling the system in local terms and limiting
the effect of a single entity on the environment. Locality in interaction is a strong requirement
when the number of entities in a system increases, or as the dimension of the distribution
scale increases. Otherwise tracking and controlling concurrent and autonomously initiated
interactions is much more difficult than in object-oriented and component-based applications.
The reason for this is that autonomously initiated interactions imply that we can not know
what kind of interaction is done and we have no clue about when a (specific) interaction is
initiated.
Need for global Autonomy: the characteristics described so far, make it difficult to understand
and control the global behaviors of the system or a group of systems. Still, there is a need for
Copyright ICWS-2009
coherent global behaviors. Some functional and non functional requirements that have to be
solved by computer systems are so complex that a single entity can not provide it. We need
systems consisting out of multiple entities which are relatively simple and where the global
behavior of that system provides the functionality for the complex task
3 Architecture for Autonomic Computing
Autonomic systems are composed from autonomic elements and are capable to carry out
administrative functions, managing their behaviors and their relationships with other systems
and applications by reducing human intervention in accordance with high-level policies.
Autonomic Computing System can make decisions and manage themselves in three scopes.
In detail these scopes are discussed [6].
Resource Element Scope: In resource element scope, individual components such as servers
and databases manage themselves.
Group of Resource Elements Scope: In group of resource elements scope, pools of grouped
resources that work together perform self-management. For example, a pool of servers can
adjust work load to achieve high performance. Business Scope: overall business context can
be self-managing. It is clear that increasing the maturity levels of Autonomic Computing will
affect on level of making decision.
3.1 Autonomic Element
Autonomic Elements (AEs) are the basic building blocks of autonomic systems and their
interactions produce self managing behavior. Each AE has two parts: Managed Element
(ME) and Autonomic Manager (AM) as shown in figure.

Fig. 1: Building Blocks of Autonomic Systems
Sensors retrieve information about the current state of the environment of ME and then
compare it with expectations that are held in knowledge base by the AE. The required action
is executed by effectors. Therefore, sensors and effectors are linked together and create a
control loop.
The description of the Figure -1 is as follows
Managed Element: It is a component from system. It can be hardware, application software,
or an entire system.
A

Knowledge

Analyze Plan
Knowledge
Monitor
Execute

Sensors Effectors
Managed Element
Monitor

Copyright ICWS-2009
Autonomic Manager: These execute according to the administrator policies and implement
self-management. An AM uses a manageability interface to monitor and control the ME. It
has four parts: monitor, analyze, plan, and execute.
Monitor: Monitoring Module provides different mechanisms to collect, aggregate, filter,
monitor and manage information collected by its sensors from the environment of a ME
Analyze: The Analyze Module performs the diagnosis of the monitoring results and detects
any disruptions in the network or system resources. This information is then transformed into
events. It helps the AM to predict future states.
Plan: The Planning Module defines the set of elementary actions to perform accordingly to
these events. Plan uses policy information and what is analyzed to achieve goals. Policies can
be a set of administrator ideas and are stored as knowledge to guide AM. Plan assigns tasks
and a resource based on the policies, adds, modifies, and deletes the policies. AMs can
change resource allocation to optimize performance according to the policies.
Execute: It controls the execution of a plan and dispatches recommended actions into ME.
These four parts provide control loop functionality.
3.2 AC Toolkit
IBM assigns autonomic computing maturity levels to its solutions. There are five levels total
and they progressively work toward full automation [5].
Basic Level: At this level, each system element is managed by IT professionals. Configuring,
optimizing, healing, and protecting IT components are performed manually.
Managed Level: At this level, system management technologies can be used to collect
information from different systems. It helps administrators to collect and analyze
information. Most analysis is done by IT professionals. This is the starting point of
automation of IT tasks.
Predictive Level: At this level, individual components monitor themselves, analyze changes,
and offer advices. Therefore, dependency on persons is reduced and decision making is
improved.
Adaptive Level: At this level, IT components can individually or group wise monitor, analyze
operations, and offer advices with minimal human intervention. Autonomic Level: At this
level, system operations are managed by business policies established by the administrator. In
fact, business policy drives overall IT management, while at adaptive level; there is an
interaction between human and system.
Although computers are one of the main drivers for the automation and acceleration of almost
all business processes, maintaining such computer systems is mostly manual labors. As this
seems ironic new approaches are being presented in order to make the machine take care of
such support tasks itself, i.e. automatically. Automation means that predefined actions are
independently executed under specific conditions by the machine. Since carrying out
specified actions like scripts and programs is the primary task of most computer systems, the
challenge obviously is defining the conditions. The essence of such conditions is a set of
Copyright ICWS-2009
logical rules referring to measurement data. Yet, which values are to be collected, which
relationships must be represented and how will this create an automatically maintained and
documented IT landscape? These are the key questions for automation.
4.1 Present Status
Today, automation is left in the hands of the technician responsible for a specific system. The
focus of such an administrator is to relieve him of repetitive tedious tasks. Anyone who has
repaired the same minor item on a computer 50 times will write an appropriate script. That
this results in a positive reduction of workload is indisputable. But the actual benefit from
such private action is usually difficult to plan. Normally it cannot be transferred to other
systems or environments and it is rarely documented. Consequently, this procedure cannot be
considered automation in the conventional sense. Although such scripts may show good
results it is certainly not possible to develop an IT strategy upon them.
4.2 Ideal Process
To achieve a desirable degree of automation, first the terms of the automation environment,
the results of the automation and the remaining manual labor must all be defined. The
processes in an automated environment can be described as follows:
1. A measurement process constantly monitors the correct functioning of the IT system.
2. Should a problem occur, a set of rules is activated that classifies the problem.
3. This rule set continues to initiate actions and analyze their results in combination with
the measurement data until either:
a. The problem is resolved, or
b. The set of rules can no longer initiate any action and passes the result of the work
done up to that point on to an intelligent person who then attempts to solve the
problem. The steps listed above are challenging. And just starting with their
execution means to determine the proper functionality of the IT systems through
measurement first.
4.3 The Model
For the administration of servers, networks, applications, etc. tools have been available from
the very beginning to take care of traditional administration tasks. In a technical environment
these are divided into monitoring/debugging tools and administration/optimization
applications. Monitoring tools monitor the services and functionalities of the respective
servers, or aspects of them. They provide the administrator with information about the status
and performance of the services and processes. The information flow for administration and
optimization tools is usually reversed. The administrator decides on the actions he wants to
take. He will make his decisions based on, among other things, observations derived from
monitoring. These procedures can then be applied to the services using the available tools.
Today, a significant part of system administration work specifically involves this process of
reviewing the results given by the monitoring system and the subsequent use of
administration or optimization tools. Due to the sustained trend toward ever increasingly
Copyright ICWS-2009
distributed applications, this process is much more complex in practice than it appears in
theory. Each additional component increases the number of possible adjustments enabling
optimal implementation of the services in terms of availability and performance. To master
this complexity, it is necessary to clarify the dependencies between machines, applications,
resources and services. This makes it possible to identify the correct points for intervention
and to estimate the likely consequences of changes (Figure -2 -- M-A-R-S model diagram). If
such a dependency model is adequately defined, it is possible to significantly optimize the
tasks involved in IT operations using a new class of applications referred to here as
Autonomic Elements.
As a rule, the function of todays tools is unidirectional, i.e. the tool either informs the
administrator about the need for intervention or the administrator initiates appropriate actions
in the target system via another tool. Autonomic Elements have the advantage, just like the
human administrator himself, of possessing a model of dependencies. For example, they can
use the information from monitoring to determine which possible intervention options are
appropriate and which areas are potentially affected. A preliminary selection like that saves
the administrator a good part of his day-to-day work making it possible to achieve a faster
response, and the time saved can be well-used in other areas. In addition, such a rule set
enables the definition of standard actions that make manual intervention in acute situations
completely unnecessary.

Fig. 2: M-A-R-S Model
4.4 Case Study
A mail server receives emails from applications, saves them in interim storage and dispatches
them to the Internet. This process produces a large number of log files that document the
servers processing. Due to the size of the incoming files and the necessity of archiving them,
they are automatically transferred to an archive server during the night but remain on the
server itself for research purposes in case of user questions.
Should the available space in the log partitions of the server reach a critical value, there is
hopefully a monitoring system that informs the administrator.
Copyright ICWS-2009
The administrator then checks which of the logs have been successfully transferred to the
archive server and removes them from the mail server using an appropriate tool. In our
example, we conduct two monitoring events (available space and transferred logs) for one
administrative event. Because it is based on a consistent pattern, the same action will be
necessary as soon as the level of available space reaches a critical value. Figure-3 description
is the described chain of actions/reactions can be fully automated through an autonomic
element. This element quasi bridges the monitoring and administration tools and thus can
access the complete monitoring information (The level of available space and transferred log
files) and all administrative options (The removal of files) plus complete information about a
model that describes the dependencies between servers, services etc. The autonomic element
knows the demand to retain as many log files as possible on the server and knows the
archiving conditions for these files on the archive server (and the status of the transfer). It is
therefore in the position to intelligently correlate the two monitoring events and automatically
delete the required number of log files that have already been transferred to the level to the
server.
The administrator first becomes involved when something in this chain of actions does not
function as defined. For example an error occurs in the archiving or the deletion process fails.

Fig. 3: Automated Problem Solving
As it is well-known that in IT maintenance nothing is as constant as change itself, the system
administration team wins valuable time applying the described approach this gained time can
be invested into the optimization of the dependency model and set of rules in turn. Under
ideal conditions, this would lead to a continuous improvement in IT services without
demanding a great effort of the administration team.
5 Conclusion
In this paper, we have presented the essence of the autonomic computing and development of
such systems. It gives the reader a feel for the nature of these types of systems. A significant
part of system administration work specifically involves the process of reviewing the results
Copyright ICWS-2009
given by the monitoring system and the subsequent use of administration or optimization
tools. The model described in this paper simplifies and optimizes the system administration.
This makes it possible to identify the correct points for intervention and to estimate the likely
consequences of changes. The case study presented uses autonomic element which informs
the administrator when critical values are met.
References
[1] S. Hariri and M. Parashar. Autonomic Computing: An overview. In Springer-Verlag Berlin Heidelberg,
pages 247259, July 2005.
[2] Kephart J. O., Chess D. M.. The Vision of Autonomic Computing. Computer, IEEE, Volume 36,Issue 1,
Pages 41-50, January 2003.
[3] Sterritt R., Bustard D.Towards an Autonomic Computing Environment. University of Ulster,Northern
Ireland.
[4] Bantz D. F. et al. Autonomic personal computing. IBM Systems Journal, Vol 42, No 1, January 2003Bigus
J. P. et al. ABLE: A toolkit for building multi agent autonomic systems. IBM Systems Journal, Vol. 41, No.
3, August 2002.

Image Processing
A Multi-Clustering Recommender System
Using Collaborative Filtering

Partha Sarathi Chakraborty
University Institute of Technology, The University of Burdwan, Burdwan
psc755@gmail.com

Abstract

Recommender systems have proved really useful in order to handle with the
information overload on the Internet. Many web sites attempt to help users by
incorporating a recommender system that provides users with a list of items
and/or web pages that are likely to interest them. Content-based filtering and
collaborative filtering are usually applied to predict these recommendations.
Hybrid of these two above approaches has also been proposed in many
research works. In this work clustering approach is proposed to group users as
well as items. For generating prediction score of an item, similarities between
active user and all other users in the same user cluster is calculated
considering only items belongs to the same item cluster as the target item. The
proposed system was tested on the Movie Lens data set, yielding
recommendations of high accuracy.
1 Introduction
In many markets, consumers are faced with a wealth of products and information from which
they can choose. To alleviate this problem, many web sites attempt to help users by
incorporating a recommender system [Resnick and Varian, 1997] that provides users with a
list of items and/or WebPages that are likely to interest them. Once the user makes her
choice, a new list of recommended items is presented.
E-commerce recommender systems can be classified into three categories: the content
filtering based; the collaborative filtering based; the hybrid content filtering and collaborative
filtering based [Ansari et al, 2001]. The first one produces recommendations to target users
according similarity between items. And the second, however, provides recommendations
based on the purchase behaviors (preferences) of other like-minded users.
Clustering, on the other hand, is a method by which large sets of data is grouped into clusters
of smaller sets of similar data. It is a useful technique for the discovery of some knowledge
from a dataset. K-means clustering is one of the simplest and fastest algorithms, and is
therefore widely used. It is a non-hierarchical algorithm that starts by defining k points as
cluster centres, or centroids in the input space. The algorithm clusters the objects of a dataset
by iterating over the objects, assigning each object to one of the centroids, and moving each
centroid towards the centre of a cluster. This process is repeated until some termination
criterion is reached. When this criterion is reached, each centroid is located at a cluster centre,
and the objects that are assigned to a particular centroid form a cluster. Thus, the number of
centroids determines the number of possible clusters.
296 A Multi-Clustering Recommender System Using Collaborative Filtering
Copyright ICWS-2009
In this paper, we consider a collaborative filtering approach where items and users are
clustered separately. Neighbors of an active user chosen from the user cluster to which the
active user belongs. On the other hand, similarity between two users are calculated not
considering the whole set of items, rather taking only items of a particular item cluster.
The rest of the paper is organized as follows. Section 2 provides a brief overview of
collaborative filtering. In Section 3, we present related works. Next, we describe the details of
our approach in Section 4. We present the experimental evaluation that we employ in order to
compare the algorithms and we end the paper with conclusions and further research in
Section 5 and 6 respectively.
2 Collaborative Filtering
Collaborative filtering systems are usually based on user-item rating data sets, whose formats
are shown in Table 1. Ui is ID of the user involving in a recommender system; and Ij is ID of
the item rated by users. There are two general classes of collaborative filtering algorithms:
memory-based methods and model-based methods [Breese et al, 1998]. Memory based
algorithms use all the data collected from all users to make individual predictions, whereas
model-based algorithms first construct a statistical model of the users and then use that model
to make predictions.
Table-1 User-Item Rating Matrix

One major step of collaborative filtering technologies is to compute the similarity between
target user and candidate users so as to offer nearest neighbors to produce high-quality
recommendations. Two methods often used for similarity computation are: cosine-based and
correlation-based [Sarwar, 2001].
Vector Cosine method computes user similarity as the scalar product of rating vectors:
(1)
in which, s(a,u) is the similarity degree between user a and user u, R(a,u) is the set of items
rated by both user a and user u, r
x,i
is rating that user x gives to items i.
A Multi-Clustering Recommender System Using Collaborative Filtering 297
Copyright ICWS-2009
Pearson Correlation method is similar to Vector Cosine method, but before the scalar product
between two vectors is computed, ratings are normalized as the difference between real
ratings and average rating of user:
(2)
in which, r
x
is the average rating of user x
Once the nearest neighbors to a target user u is obtained, the following formula[ Breese et
al,1998] is used for calculating prediction score
(3)
where r
a,i
denotes the correlation between the active user Ca and its neighbors Ci who have
rated the product Pj. P
Ca
denotes the average ratings of customer Ca, and P
Ci,j
denotes the
rating given by customer Ci on product Pj.
3 Related Works
Two popular model-based algorithms are the clustering for collaborative filtering [Kohrs and
Merialdo, 1999] [[Ungar and Foster] and the aspect models [Hofmann and Puzicha,1999].
Clustering techniques have been used in different ways in designing recommender systems
based on collaborative filtering. [Sarwar et al] first partitions the users using clustering
algorithm and then applies collaborative filtering by considering the whole partition as
neighborhood for a user to which he belongs. In another paper [Zhang and Chang, 2006] a
genetic clustering algorithm is introduced to partition the source data, guaranteeing that the
intra-similarity will be high but the inter-similarity will be low whereas [Yang et al,2004]
uses CURE (Clustering Using Representatives) to transform the original user-product matrix
into a new user cluster product matrix which is much more dense and has much fewer rows
than the original one. One another attempt [Zhang et al, 2008] partition the users, discovered
the localized preference in each part and using the localized preference of users to select
neighbors for prediction instead of using all items. The paper [Khanh Quan et al,2006]
written by Truong Khanh Quan, Ishikawa Fuyuki, Honiden Shinichi propose a method of
clustering items, so that inside a cluster, similarity between users does not change
significantly. After that, when predicting rating of a user towards an item, we only aggregate
ratings of users who have high similarity degree with that user inside the cluster to which that
item belongs.
4 Our Approach
Usually, in collaborative filtering approach of recommender system design the whole set of
items are considered in computing similarity between users. As have been stated in [ Khanh
Copyright ICWS-2009
Quan et al, 2006] we also think that this process does not provide a good result because the
number and type of items that is offered by an online store is very large. As a result, a set of
user can be said similar to each other for one type of item, but they may not be similar to that
extend when we consider a different type of items.
So, in our approach, we partition items into several groups using clustering algorithm. Items
which are rated similarly by different users are placed under the same cluster. We also
partition users into several groups. The user partitioning as well as item partitioning both
have been done using k-means algorithm. The clustering is done offline and clustering
information is stored in database. For an active user, the system first determines the cluster to
which he belongs. Neighbors of this active user will be chosen from this cluster only. For
generating the prediction score of an item for the active user, we first determine the cluster to
which the item belongs and consider only those items in calculating similarity between the
active user and other users belongs to the same user cluster as active user. We then take first
N neighbors and calculate prediction score for that item using formula (3)
The algorithm for calculating prediction score is as follows-
1. Apply clustering algorithm to produce p partitions of users using the training data set.
Formally, the data set A is partitioned in A
1
, A
2
,...,A
p
, where A
i
A
j
= , for 1 i, j
p; and A
1
A
2
... Ap = A.
2. Apply the clustering algorithm to produce q partitions of items using the training data
set. Formally, the data set A is partitioned in B
1
, B
2
,...,B
q
, where B
i
B
j
= , for 1 i,
j q; and
1. B
1
B
2
... Bq = A.
2. For a given user, find the cluster to which he/she belongs. Suppose, Am.
3. For calculating the prediction score Ra,j for a customer c
a
on product p
j

a. Find the cluster to which the item belongs, suppose, Bn
b. Compute similarity between the given user and all other users belongs to cluster
Am considering only items belongs to cluster Bn.
c. Take first N users with highest similarity values as neighbor.
d. Calculate prediction score using formula (3)
5 Experimental Evaluation
In this section, we report results of an experimental evaluation of our proposed techniques.
We describe the data set used, the experimental methodology, as well as the performance
improvement compared with traditional techniques.
5.1 Data Set
We are performing experiments on a subset of movie rating data collected from MovieLens
web-based recommender (movielens.umn.edu). MovieLens is web-based research
recommender system that debuted in Feb 1997. The data set used contained 100,000 ratings
from 943 users and 1682 movies (items), with each user rating at least 20 items. The item
sparsity is easily computed as 0.9369, which is defined as
A Multi-Clustering Recommender System Using Collaborative Filtering 299
Copyright ICWS-2009
(4)
The ratings in the MovieLens data are integers ranging from 1 to 5 which entered by users.
And we selected 80% of the rating data set as the training set and 20% of the data as the test
data.
5.2 Evaluation Metric
Mean Absoluter Error (MAE) [Herlocker et al, 2004] is the most commonly applied evaluation metric
for collaborative filtering, it evaluate the accuracy of a system by compare the numerical
recommendation scores against the actual user ratings for the user-item pairs in the test dataset. In our
experiment, we use MAE as our evaluation. We assume {p1,p2,.pM} is set of the ratings for the
given active users and {q1,q2,.,qM} is the actual ratings set of the active users, and the MAE
metrics is formula as:
(5)
5.3 Experimental Results
Table2 shows our experimental results. It can be observed that though for neighborhood
size10 and 20 the result of our approach is not satisfactory but for neighborhood size 30 our
approach shows better result than collaborative filtering without clustering and collaborative
filtering with only user clustering. The same result can also be seen from the graph shown in
figure 1 where C identifies clustering of users and IC identifies clustering of items.
Table 2 Result of Multi Clustering
Neighborhood Without No of User Clusters=10 No of User Clusters=10
Size Clustering No of Item Clusters=0 No of Item Clusters=10
10 0.7746 0.7874 0.9488
20 1.0194 0.7821 0.8214
30 1.0426 0.8033 0.8024
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30
Neighbourhood Size
M
A
E
C=0
C=10
C=10 IC=10

Fig. 1: Comparing Result of Multi Clustering with Other Cases
Copyright ICWS-2009
In our approach we have shown that for generating prediction score of an item for the active
user if we consider only the items which belongs to the same item cluster as the target item
(item for which score is calculated) for measuring similarity between active user and other
users belongs to the same user cluster it can produce better result that collaborative filtering
without clustering or collaborative filtering with only user clustering. But we have used basic
k-means clustering algorithm for clustering users as well as items. In future studies we will
try to improve the quality of prediction by investigating and using more sophisticated
clustering algorithms.
References
[1] [Ansari et al, 2001] S Ansari, R Kohavi, L Mason, Z Zheng, Integrating Ecommerce and data mining:
architecture and challenges. In: Proceedings The 2001 IEEE International Conference on Data Mining.
California, USA: IEEE Computer Society Press, 2001, pages 27-34.
[2] [Breese et al,1998] J Breese, D Hecherman, C Kadie. Empirical analysis of predictive algorithms for
collaborative filtering. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence. San
Francisco: organ Kaufmann, 1998. pages 43-52.
[3] [Herlocker et al,2004] J. Herlocker, J. Konstan, L. Terveen, and J. Riedl. Evaluating Collaborative Filtering
Recommender Systems. ACM Transactions on Information Systems 22 (2004), ACM Press, pages 5-53.
[4] [Hofmann and Puzicha, 1999] T. Hofmann and J. Puzicha, Latent Class Models for Collaborative Filtering.
In Proceedings of the 16
th
International Joint Conference on Artificial Intelligence, 1999, pages 688-693.
[5] [Khanh Quan et al, 2006] Truong Khanh Quan, Ishikawa Fuyuki and Honiden Shinichi, Improving
Accuracy of Recommender System by Clustering Items Based on Stability of User Similarity, International
Conference on Computational Intelligence for Modelling Control and Automation, and International
Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06),2006
[6] [Kohrs and Merialdo,1999 ] A. Kohrs and B. Merialdo. Clustering for Collaborative Filtering Applications.
In Proceedings of CIMCA'99. IOS Press, 1999.
[7] [Resnick and Varian, 1997] P. Resnick and H. R. Varian. Recommender systems. Special issue of
Communications of the ACM, pages 5658, March 1997.
[8] [Sarwar, 2001] B Sarwar, G Karypis, J Ried. Item-based collaborative filtering recommendation algorithms.
In: Proceedings of the 10th international World Wide Web conference. 2001. pages 285-295.
[9] [Sarwar et al], Badrul M. Sarwar, George Karypis, Joseph Konstan, and John Ried, Recommender Systems
for Large-scale E-Commerce: Scalable Neighborhood Formation Using Clustering
[10] [ Ungar and Foster] L. H. Ungar and D. P. Foster. Clustering Methods for Collaborative Filtering. In Proc.
Workshop on Recommendation Systems at the 15th National Conf. on Artificial Intelligence. Menlo Park,
CA: AAAI Press.
[11] [Yang et al, 2004] Wujian Yang, Zebing Waug and Mingyu You, An improved Collaborative Filtering
Method for Recommendiations Generation, In Proceedings of IEEE International Conierence on Systems,
Man and Cybernetics,2004
[12] [Zhang and Chang, 2006] Feng Zhang, Hui-you Chang, A Collaborative Filtering Algorithm Employing
Genetic Clustering to Ameliorate the Scalability Issue, In Proceedings of IEEE International Conference on
e-Business Engineering (ICEBE'06), 2006
[13] [Zhang et al, 2008] Liang Zhang, Bo Xiao, Jun Guo, Chen Zhu, A Scalable Collaborative Filtering
Algorithm Based On Localized Preference, Proceedings of the Seventh International Conference on
Machine Learning and Cybernetics, Kunming, 2008

Digital Video Broadcasting in an Urban
Environment an Experimental Study

S. Vijaya Bhaskara Rao K.S. Ravi N.V.K. Ramesh
Sri Venkateswara University K.L. College of Engg. K.L. College of Engg.
Tirupati 517 502 Vijayawada Vijayawada
drsvbr@rediffmail.com sreenivasaravik@yahoo.co.in
J.T. Ong G. Shanmugam Yan Hong

Nanyang Technological Nanyang Technological Nanyang Technological
University, Singapore University, Singapore University, Singapore

Abstract

Singapore is the first country to have island wide DVB-T Single Frequency
Network (SFN). Plans are being made to extend DTV services for portable
and fixed reception. However for the planning of future fixed and portable
services, the reception of signals and other QoS (Quality of Service)
parameters have to be defined. An initial measurement campaign was
undertaken in 47 sectors to study digital TV coverage in Singapore. The
measurements set-up consists of different devices like a spectrum analyzer and
EFA receiver interfaced to a Laptop computer with lab-view program. A GPS
receiver on board is used to locate the measurement point. Using Geographic
Information System (GIS) data-base 100mX100m pixels are identified along
all the routes and the average field strength is estimated over that area.
Measurement values are compared with proprietary prediction software. The
detailed clutter data-base has been developed and used with the software. Field
strengths at different percentages of probability were estimated to establish
good and acceptable coverage. In general, it is found that the signals in the
majority of sectors exhibit a log-normal behavior with a standard deviation of
about 6 dB.
1 Introduction
Singapore is the first country to have island wide DVB-T Single Frequency Network (SFN).
Plans are being made to extend DTV services for portable and fixed reception. However for
the planning of future fixed and portable services, the reception of signals and other QoS
(Quality of Service) parameters have to be defined. Hence a series of measurements were
conducted with this in mind. The initial results of the analyses for mobile DTV reception are
presented in this paper.
The main objective of this experimental campaign is to characterize the behavior of fixed
DTV signals in different environments in Singapore. Initial measurements are made with
antennas at 2m above the ground because it is logistically more difficult to conduct
measurements at 10m; also more measurements could be made quickly in a moving vehicle.
These measurements are made to tune the prediction model and also to improve the
measurement procedures and analytical techniques. Earlier studies have reported that
302 Digital Video Broadcasting in an Urban Environment an Experimental Study
Copyright ICWS-2009
analogue TV propagation models assume log-normal signal spatial variation. The standard
deviation parameter for the distributions varies from 9 to 15 dB [ITU-recommendation
PN.370-5]. Broadband signals have been measured in countries like UK, Sweden and France
under the [VALIDATE, 1998] program and the shape of their statistical variation has been
reported to behave log-normally. The standard deviations observed for such signals are small
- typically 2.5 to 3 dB depending on the environment surrounding the receiver location. But
all these measurements are based on reception from a single transmitter. ITU-R P.370 gives a
standard deviation for wide band signals of 5.5 dB. These previous measurements are limited;
the methods as well as the procedures for measuring and estimating the standard deviation of
the measured signals have not been clearly detailed.
2 Experimental Set-up
The measurement campaign was carried out using the equipment set-up shown in the
figure.1. The video camera, GPS receiver antenna and an omni-directional dual-polarized
antenna were fixed on the top of the vehicle. The video camera provides a view of the
surrounding area i.e. the land-use or clutter information around the measurement location. A
GPS receiver was used to synchronize location to the measured parameters. The omni-
directional antenna was connected to one or more of instruments for measuring Quality of
Service parameters (spectrum analyzer, R&S EFA TV test receiver, Dibcom DV3000
evaluation kit, etc.). The measuring instruments were interfaced to a lap-top computer which
has a Lab-View data logging program preinstalled. Only measurements using the spectrum
analyzer were discussed in this paper. The notebook was programmed to sweep through the 8
MHz DTV spectrum and to store the data once every second. It sweeps through the 8 MHz
spectrum eight times in one second. The notebook records 401 field strength points for one 8
MHz sweep. The program also computes the minima, maxima and average of 401 points.
This provides a record of the individual TV channel spectrum. The information on the road
view, location and field strength were monitored in real-time and recorded.
Measurements were carried out in 47 sectors covering the entire island for Channel 37
(Mobile TV) at 602 MHz. The DVB-T standard is COFDM-2K and the modulation is QPSK.
The main transmitter is located in Bukit Batok and the transmitting antenna (horizontally
polarized) is at an elevation of 214 meters above MSL. There are 10 repeaters (vertically
polarized) connected in a single frequency network (SFN). In our measurements the field
strength recorded is inclusive of SFN gain. More details about the clutter data-base,
measurement pixels and software predictions are presented in the next section.

Fig. 1: Block diagram of the experimental set-up.
Digital Video Broadcasting in an Urban Environment an Experimental Study 303
Copyright ICWS-2009
3 Methodology
Details about the sectors are given in figure.2. The sectors are chosen to provide a good
representation of the different land-use or clutter environments for Singapore. The analysis of
the data has been carried out in two ways. Firstly the recorded field strength values along
each sector are processed separately to study the log-normal distribution of the slow fading.
The fast fading signals are removed by averaging the received signals over a period or a
specified window. The window widths are selected as per Lees method [Lee, 1985]. A log-
normal fit was made for each sector and the coverage for 50, 70, 90, 95 & 99% of probability
are obtained. The distributions of the raw measured data with both the slow and fast fading
are also analyzed. Along two sectors, which are approximately radial with reference to the
transmitter, the measured field strength was compared with the Hata [Hata, 1980] and
Walfisch-Bertoni [Walfisch and Bertoni, 1988] models. In the second method the measured
field strengths are compared with the software predictions over pixels of 100mX100m.
Using ArcGIS environment a hydrologically correct Digital Elevation Model (DEM) at 100
m resolution was created from the 10m contour data. The clutter factor grid map was created
with a 100m-grid interval, using the newly modeled PR-Clutter information. Later these
datasets were converted to the prediction software working file for the signal strength
prediction and converted back to grid files with the predicted signal strength values for
further analysis in the GIS environment. Measured field strength is averaged over
100mX100m pixels and then compared to the software predictions. In an SFN environment,
the software predicts and considers the maximum signal strength available over the grid, i.e.
it could be from the main transmitter or from the repeater. It also provides the information
about the same.

Fig. 2: Details of the Sectors Where the Measurements are Carried Out
The field measurements were taken using our measurement vehicle, integrated with GPS.
Upon completion, these log files were plotted in the GIS environment for analysis. Since the
predicted value maps are in 100m grid raster maps, the GPS measurement files were also
grided taking the average value within the 100m grids, as shown in figure 3.
Now that both the predicted signal strength and the measured signal strength maps are in the
same 100m topographical grid files, the difference between the measured and the software
predicted field strengths can be analyzed using GIS. The proprietary prediction software is
Copyright ICWS-2009
merely a shell; terrain and clutter data are required for reliable predictions. To compare the
measured field strength with the software prediction we have developed a clutter model
Singapore based on the plot ratios designated by the Urban Redevelopment Authority of
Singapore (URA).

Fig. 3: 100 m pixels with the measurement points
Considering the true land use and the plot ratio, we classified the whole land area into 7
categories, which are used in our prediction model. This clutter model is continuously being
fine-tuned with additional information like the openness of the area, etc.; the model will be
changed to best suite to predict the signal strength in Singapore.
4 Results
As mentioned earlier, the measured field strength values are processed to study the log-
normal distribution along all the sectors. It is found that almost all sectors show the log-
normal distribution. A typical plot of log-normal distribution is shown in figure.4. A running
average method is used to fit the measured values with log-normal distribution. From the
figure it can be seen that the curve best fits a log-normal with a standard deviation of 6.62 dB.
Cumulative Distribution Function (CDF) of the raw data with both fast fading and slow
fading components is also shown in the figure. There is little difference between the two
distributions. This could be due to the sampling rate and the speed of the vehicle. In one
second a vehicle traveling with 40km/h covers approximately 11 m, corresponding to 22 at
600 MHz.

Fig. 4: Log-normal fit for sector-4
Copyright ICWS-2009
The route distance for the sector-4 is 5.9 km. There are two repeater stations near to this
route. The mean value observed is 60.5 dBuV/m and the standard deviation is 6.6 dB. The
standard deviations and log-normal distribution for all the sectors were similarly fitted.
Figure 5 shows the mean and standard deviation observed in all 47 sectors.

Fig. 5: Mean and Standard Deviations Observed for All the Sectors
It can be seen from the figure that the sectors with ID 8, 27 and 38 have high mean and
standard deviation values. The obvious reason for this is that the sector 8 is very close to the
repeater at Alexandra point and sector 27 is very close to the main transmitter. Sector 38
(Clementi Ave 6) is partially open for about 1.2 km and remaining 800m is through the HDB
estates. Hence the field strength drops from 80 dBuV/m to about 55 dBuV/m resulting high
standard deviation The variation of standard deviation from the mean standard deviation for
all the sectors (5.36dB) is shown in the figure.6.

Fig. 6: Variation of the standard deviation in each sector from the mean.
It can be seen from the figure that sector IDs 8, 12, 15, 16, 27 have a variations from the
mean of 2.5 dB. However sector 38 has a large deviation from the mean of 5.4 dB. The
coverage at 95 and 99% of probability is obtained from the log-normal fit and from the
standard deviation.
Copyright ICWS-2009
Figure 7 shows the field strength computed at 95 and 99% of probability at 2m level along
with the minimum recommended value (for fixed reception) according to Chester 97 [6]. In
the measurements it is observed that the minimum field strength required for reception of a
good picture is about 40dBuv/m (fixed location). This threshold value is also shown in the
figure. There are few sectors in which the 99% values are within 2 or 3 dB of the minimum
threshold. In this study it was observed that there is no height gain from 2m to 10 m in a
location where there is no line-of-sight to the transmitter. However when there is fairly open
ground and path clearance to the transmitter, a gain of about 8-9 dB is observed. With
hindsight this is to be expected, hence extrapolation between 2m and 10m measurements
should be carried out with great care.

Fig. 7: Coverage at 95 and 99% of probability observed for all the sectors
5 Summary of Observations
Measurements carried out in all 47 sectors show that the slow fading can be modeled with a
log-normal distribution. The mean standard deviation observed is 5.36 dB. In some sectors
the standard deviation observed is higher and is about 10 dB. It is observed that when ever
the vehicle passes through the HDB environment, large variations in signal strength are
observed, resulting in high standard deviations. Large standard deviations of about 14 dB
were obtained inside an HDB (building heights of about 50m) environment. This confirms
the need to characterize the variation of standard deviation with reference to the exact clutter.
A standard deviation of 5.5 dB could be for an ideal situation and for medium city
environment. Hence for further DTV planning in different clutter types, an optimum value of
standard deviation has to be established. The coverage at 95 and 99% of probability shows
the mobile DTV reception in Singapore is good.
Comparison with the empirical models shows large deviations as these models do not employ
local clutter data. The Walficsh-Bertoni model assumes an uniform building height for
prediction. In the Singapore HDB environment, as the spacings between the buildings are not
uniform and large variations and errors would occur in the predictions. Comparison with the
proprietary software predictions is reasonably good.
This initial measurement campaign highlighted the many challenging problems encountered
in the measurements, prediction and analyses of quality of service parameters for the
reception of digital TV reception in built-up areas in Singapore. Therefore future work will
concentrate on very careful detailed smaller scale spatial measurements in HDB areas using
transmissions from one individual transmitter at a time.
Copyright ICWS-2009
References
[1] [CEPT, 1997] The Chester 1997 Multilateral Coordination Agreement relating to Technical criteria,
coordination principles and procedures for the introduction of Terrestrial Digital video Broadcasting (DVB-
T). ITU-recommendation PN.370-5)
[2] [Lee,W.C.Y.,1985] Estimation of local average power of a mobile radio signal, IEEE. Trans.Veh. Technol.,
Vol.VT-34, no.1, pages 2-27.
[3] [M. Hata, 1980] Empirical Formula for Propagation Loss in Land Mobile Radio Services, IEEE Trans. on
Veh. Technol., Vol. VT-29, no. 3, pages 317-325.
[4] [VALIDATE, 1998] Final project report.
[5] [J. Walfisch and H.L. Bertoni, 1988] A Theoretical model of UHF propagation in Urban Environments,
IEEE Trans. Antenna & Propagation, Vol.36, No.12, pages 1788-1796.
Gray-level Morphological Filters for Image
Segmentation and Sharpening Edges

G. Anjan Babu Santhaiah
Dept of Computer Science Dept of Computer Science
S.V. University, Tirupati ACET, Allagadda, Kurnool

Abstract

The aim of the present study is to propose new method for tracking of edges of
images. The present study involves edge detection and morphological
operations for sharpening edges. The detection criterion expresses the fact that
important edges should not be missed. It is of paramount important to
preserve, uncover or detect the geometric structure of image objects. Thus
morphological filters, which are more suitable than linear filters for shape
analysis, play a major role for geometry based enhancement and detection.
A new method for image segmentation and e sharpening edges based on
morphological transformation is proposed. This algorithm uses the
morphological transformations dilation and erosion. A gradient determined
grey level morphological procedure for edge increase and decrease is present.
First, the maximum gradient, in the local neighborhood, forms the contribution
to the erosion of the center pixel of that neighborhood. The gradients of the
transformed image are then used as contributions to subsequent dilation of
eroded image. The edge sharpening algorithm is applied on various sample
images. Proposed algorithm segments the image by preserving important
edges.
Keywords: Dilation, Erosion, Peak, Valley, Edge, Toggle contrast.
1 Introduction
Mathematical morphology stresses the role of shape in image pre-processing, segmentation
and object description. Morphology usually divided into binary mathematical morphology
which operates on binary images and gray-level mathematical morphology which operates on
binary images, and Gary-level mathematical morphology which acts on gray-level images.
The two fundamental operations are Dilation and erosion. Dilation expands the object to the
closest pixels of the neighborhood. Dilation combines two sets using vector addition
X
B= {p
2: p = x+b, x
X and b
B}
Where X is Binary image, B is the Structuring element.
Erosion shrinks the object. Erosion
combines two sets using vector subtraction of set

elements and is the dual operation of dilation.
X
B= {p
2: p+b
X and for every b
B}
Gray-level Morphological Filters for Image Segmentation and Sharpening Edges 309
Copyright ICWS-2009
Where X is Binary image, B is the Structuring element.
Extending morphological operators from binary to gray-level images can be done by using
set representations of signals and transforming these input sets by means of morphological set
operators. Thus, consider an image signal f(x) defined on the continuous or discrete plane
ID=R^2 or Z^2 and assuming values in R=R U (-, ). Shareholding f at all amplitude levels
v produces an ensemble of binary image represented by the threshold sets.
V
) f ( }, v ) x ( f : ID x {
- < v < +.
The image can be exactly reconstructed from all its threshold sets since
x : R v sup{ ) x ( f = )]} f ( v [

Where sup denotes superman transforming threshold set of the input signal f by a set
operator and viewing the transformed sets as threshold sets of a new image creates a flat
image operator, whose output signal is
)]} ( [ : sup{ ) )( ( f v x R v x f =

For example if is the set dilation and erosion by B, the above procedure creates the two
most elementary morphological image operators, the dilation and erosion of f(x) by a set
B: (f
g)(x)
f (x-y), (f
B)(x)
f(x+y)

Where V denotes supremum (or maximum for finite B) and
denotes infimum (or

minimum for finite B). Flat erosion (dilation) of a function f by a small convex set B reduce
(increase) the peaks (valleys) and enlarges the minima (maxima) of the function. The flat
opening f o B = (f
B)
B of f by B smooths the graph of f from below by cutting down

its peaks, whereas the closing f B = (f
B)
smoothes it from above by filling up its

valleys. The most general translation-invariant morphological dilation and erosion of a gray-
level image signal f(x) by another signal g are:
(f
g)(x)
f (x-y)+g(y), (f
g)(x)
f(x+y)-g(y).

Note that signal dilation is a nonlinear convolution where the sum of products in the standard
linear convolution is replaced by a max of sums. Dilation or erosions can be combined in
many ways to create more complex morphological operations that can solve a broad variety
of problems in image analysis and nonlinear filtering. Their versatility is further strengthened
by a theory outlined in that represents a broad class of nonlinear and linear operators as a
minimal combination of erosions and dilations. Here we summarize the main results of this
theory, restricting our discussion only to discrete 2-D image signals.
Any translation invariant set operator
is uniquely characterized by its kernel, ker

(
)
)}. ( 0 : 2 ^ { X Z X
The kernel representation requires an infinite number of
erosions or dilations. A more efficient representation uses only substructure of the kernel, its
basis Bas(
), defined as the collection of kernel elements that are minimal with respect to
the partial ordering
.
If
is also increasing and upper semi continuous, then
has a
nonempty basis and can be represented exactly as a union of erosions by its basis sets:
) ( ) ( = Bas A X U

A X
.
The morphological basis representation has also been extended to gray-level signal operators,
that is translation invariant and commutes with thresholding.
310 Gray-level Morphological Filters for Image Segmentation and Sharpening Edges
Copyright ICWS-2009
2 Morphological Peak/Valley Feature Detection
Residual between openings or closings and original image offer an intuitively simple and
mathematically formal way for peak or valley detection. speciafiacally, subtracting from an
input image f its opening by a compact convex set b yields an output consisting of the image
peaks whose support cannot contain b. This is top-hat transformation,
Peak (
), ( ) ( B f f f o =

Which has found numerous applications in geometric feature detection? It can detect bright
blobs, i.e. regions with significantly brighter intensities relative to the surroundings. The
shape of the detected peaks support is controlled by the shape of b, where as the scale of the
peak is controlled by the size of b. similarly, to detect dark blobs, modeled as image intensity
valleys, we can use the valley detector, Valley
f B f f = ) ( ) (

The morphological peak/valley detectors are simple, efficient, and have some advantages
over curvature-based approaches. Their applicability in situations in which the peaks or
valleys are not clearly separated from their surroundings is further strengthened by
generalizing them in following way. The conventional opening is replaced by a general lattice
opening such as an area opening or opening by reconstruction. This generalization allows
more effective estimations of the image background surroundings around the peak and hence
a better detection of the peak.
3 Edge or Contrast Enhancement
3.1 Morphological Gradients
Consider the difference between the flat dilation and erosion of an image f by a symmetric
disk like set b containing the origin whose diameter diam (b) is very small:
Edge (f) =
). ( / ) ( ) ( B diam B f B f

If f is binary, edge (f) extracts its boundary. If f is gray level, the above residual enhances its
edges by yielding an approximation to ||
|| f
which is obtained in the limit of equation as
diam (b) -> 0. Further, thresholding this morphological gradients leads to binary dge
detection. The symmetric morphological gradient is the average of two symmetric ones: the
erosion f-(
) B f
and the dilation gradient (f
. ) f B
the symmetric or asymmetric
morphological edge-enhancing gradients can be made more robust for edge detection by first
smoothing the input image with a linear blur. These hybrid edge-detection schemes that
largely contain morphological gradients are computationally more efficient and perform
comparably or in some cases better than several conventional schemes based only on linear
filters.
3.2 Toggle Contrast Filter
Consider a gray level image f[x] and small-size symmetric disk like structuring element b
containing the origin. The following discrete nonlinear filter can enhance the local contrast of
f by sharpening its edges:
] )[ ( x f
=
] )[ ( x b f
IF
] )[ ( ] [ ] [ ] )[ ( x B f x f x f x b f
)
Gray-level Morphological Filters for Image Segmentation and Sharpening Edges 311
Copyright ICWS-2009

] )[ ( x B f
IF
]. [ ( ] [ ] [ ] )[ ( x B f x f x f x B f >
)
At each pixel x, the output value of this filter toggles between the value of the dilation of f by
b at x and the value of its erosion by b according to which is closer to input value f[x].The
toggle filter is usually applied not only once but is iterated. The more iterations, the more
cintrast enhanecement. Further, the iterations converge to a limit (fixed point) reached after a
finite number of iterations.
4 Experimental Results
In this work now turn to several experiments made with the algorithm introduced above. For
all tests, in this study use a 8-neighborhood system of order1. For Example in this study two
examples are taken; one is linkon, 64X64 and monalisa has 64X64. Figures are shown below.
Toggle Contrast Enhancement

Original Monalisa After Erode After Dilation After Toggle Contrast
Feature Detection Based on Peaks

Original Lincon Image After Opening After Closing After Peak
Feature Detection Based on Valleys

Original Lincon After Closing After Opening After Valley
Edge Detection Based Morphological Gradient

Lincon Original After Dilation After Erosion After Gradient

312 Gray-level Morphological Filters for Image Segmentation and Sharpening Edges
Copyright ICWS-2009
5 Conclusion
The present study on Image processing is a collection of techniques that improve the quality
of the given image in some sense. The techniques developed are mainly problem oriented. In
this paper Morphological approach is made, the edges in the images are thickly marked and
are better visible than that of primitive operations. The Rank filter algorithm described in
present study has a potentiality to generate new concepts in design of constrained filter. In
morphology, the Dilation is performed if the central value of the kernel is less than n and if
it is grater than n Erosion is performed. These two are contradictory transformations, and
the resultant images require an in-depth study.
A new algorithm for image segmentation and sharpening edges has been implemented using
morphological transformations. The edge sharpening operator illustrates that it can be useful
to consider edges as two-dimensional surface. This allows the combination of gradient
direction and magnitude information. Edge sharpening is useful for extraction of phase
regions. It does not have much effect, when implemented on diagonal edges. Sharp edges
have been detected by this algorithm. This algorithm has been tested on various images and
verified the result.
References
[1] [H.J.A.M. Heijmans] Morphological Image Operators (Academic, Boston, 1994)
[2] [H.P Kramer and J.B Buckner] Iterations of a nonlinear transformation for enhancement of digital
images, Pattern Recognition, 7, 53- 58(1975).
[3] [P. Maragos and R.W. Schafer] Morphological Filters. Part I, Their Set-Theoretic Analysis and relations
to linear shift-invariant filters, PartII: Their relations to median, order-static and stack filters IEEE Trans.
Acoust Speech Signal Process.35, 1153-1184(1987),
[4] [F. Meyer] Contrast Feature Extraction , in special issue of Practical Metallographic, J.L Chermant, Ed
(Rfederer-Verlag, Stuttgart, 1978) Pp.374-380.
[5] [S. Osher and L.I. Rudin] Feature-oriented image enhancement using shock filters SIAM J. Numer Anal,
27,919-940(1990).
[6] [O P. Salembier] Adaptive rank order based filters, Signal Process 27, 1-25(1992).
[7] [J. Serra, Ed] Image Analysis and Mathematical Morphology (Academic, Newyork, 1982).
[8] [J. Serra ed.] Image Analysis and Mathematical Morphology Vol -2. Theoretical Advances (Academic,
New York, 1988).
Watermarking for Enhancing Security
of Image Authentication Systems

S. Balaji B. MouleswaraRao N. Praveena
K.L. College of Engineering K.L. College of Engineering K.L. College of Engineering
Green Fields, Vaddeswaram Green Fields, Vaddeswaram Green Fields, Vaddeswaram
Guntur 522502 Guntur 522502 Guntur - 522502
ecmhod_klce@yahoo.com mbpalli@yahoo.com nveena-4u@yahoo.co.in

Abstract

Digital watermarking techniques can be used to embed proprietary
information, such as a company logo, in the host data to protect the intellectual
property rights of that data. They are also used for multimedia data
authentication. Encryption can be applied to biometric templates for increasing
security; the templates (that can reside in either (i) a central database, (ii) a
token such as smart card, (iii) a biometric-enabled device such as a cellular
phone with fingerprint sensor) can be encrypted after enrolment. Then, during
authentication, these encrypted templates can be decrypted and used for
generating the matching result with the biometric data obtained online. As a
result, encrypted templates are secured, since they cannot be utilized or
modified without decrypting them with the correct key, which is typically
secret. However, one problem associated with this system is that encryption
does not provide security once the data is decrypted. Namely, if there is a
possibility that the decrypted data can be intercepted, encryption does not
address the overall security of the biometric data. On the other hand, since
watermarking involves embedding information into the host data itself (e.g.,
no header-type data is involved), it can provide security even after decryption.
The watermark, which resides in the biometric data itself and is not related to
encryption-decryption operations, provides another line of defense against
illegal utilization of the biometric data. For example, it can provide a tracking
mechanism for identifying the origin of the biometric data (e.g., FBI). Also,
searching for the correct decoded watermark information during authentication
can render the modification of the data by a pirate useless, assuming that the
watermark embedding-decoding system is secure. Furthermore, encryption
can be applied to the watermarked data (but the converse operation, namely,
applying watermarking to encrypted data is not logical as encryption destroys
the signal characteristics such as redundancy, that are typically used during
watermarking), combining the advantages of watermarking and encryption
into a single system. In this paper we address all the above issues and explore
the possibility of utilizing watermarking techniques for enhancing security of
image authentication systems.
314 Watermarking for Enhancing Security of Image Authentication Systems
Copyright ICWS-2009
1 Introduction
While biometric techniques have inherent advantages over traditional personal identification
techniques, the problem of ensuring the security and integrity of the biometric data is critical.
For example, if a persons biometric data (e.g., her fingerprint image) is stolen, it is not
possible to replace it, unlike replacing a stolen credit card, ID, or password. It is pointed out
that a biometrics-based verification system works properly only if the verifier system can
guarantee that the biometric data came from the legitimate person at the time of enrolment.
Furthermore, while biometric data provide uniqueness, they do not provide secrecy. For
example, a person leaves fingerprints on every surface she touches and face images can be
surreptitiously observed anywhere that person looks. Hence, the attacks that can be launched
against biometric systems have the possibility of decreasing the credibility of a biometric
system.
2 Generic Watermarking Systems
Despite the obvious advantages of digital environments for the creation, editing and
distribution of multimedia data such as image, video, and audio, there exist important
disadvantages: the possibility of unlimited and high-fidelity copying of digital content poses a
big threat to media content producers and distributors. Watermarking, which can be defined
as embedding information such as origin, destination, and access levels of multimedia data
into the multimedia data itself, was proposed as a solution for the protection of intellectual
property rights.
The flow chart of a generic watermark encoding and decoding system is given in Fig. 1. In
this system, the watermark signal (W) that is embedded into the host data (X) can be a
function of watermark information (I) and a key (K) as in
W = f
0
(I,K),
or it may also be related to host data as in
W = f
0
(I,K,X).

Fig. 1: Digital watermarking block diagram: (a) watermark encoding, (b) watermark decoding.
Watermarking for Enhancing Security of Image Authentication Systems 315
Copyright ICWS-2009
The watermark information (I) is the information such as the legitimate owner of the data that
needs to be embedded in the host data. The key is optional (hence shown as a dashed line in
Fig. 1) and it can be utilized to increase the security of the entire system; e.g., it may be used
to generate the locations of altered signal components, or the altered values. The watermark is
embedded into host data to generate watermarked data
Y=f
1
(X,W).
In watermark decoding, the embedded watermark information or some confidence measure
indicating the probability that a given watermark is present in the test data (the data that is
possibly watermarked) is generated using the original data as
I =g(X,Y,K),
or without using the original data as
I =g(Y,K).
Also, it may be desirable to recover the original, non-watermarked data, X, in some
applications, such as reversible image watermarking. In those cases, an estimate X of the
original data is also generated.
In watermark embedding, it is desired to keep the effects of watermark signal as
imperceptible as possible in invisible watermarking applications: the end user should not
experience a quality degradation in the signal (e.g., video) due to watermarking. For this
purpose, some form of masking is generally utilized. For example, the frequency masking
properties of the human auditory system (HAS) can be considered in designing audio
watermark signals. Similarly, the masking effect of edges can be utilized in image
watermarking systems. Conversely, in visible watermarking applications, it is not necessary
to consider these systems as the actual aim of the application is to robustly mark the data,
such as in embedding copyright data in terms of logos for images available over the Internet.
An example of visible image watermarking is given in Fig. 2

Fig. 2: Visible Image Watermark.
Copyright ICWS-2009
Although there exist watermarking methods for almost all types of multimedia data, the
number of image watermarking methods is much larger than the other types of media. In text
document watermarking, generally the appearance of an entity in the document body is
modified to carry watermark data. For example, the words in a sentence can be shifted
slightly to the left or right, the sentences themselves can be shifted horizontally, or the
features of individual characters can be modified (Fig. 3). Although text document
watermarks can be defeated relatively easily by retyping or Optical Character Recognition
(OCR) operations, the ultimate aim of making unauthorized copies of the document more
expensive in terms of effort/time/money than obtaining the legal rights from copyright owner
can still be achieved.

Fig. 3: Text Watermarking Via Word-Shift Coding.
Some authors claim that image watermarking methods can be applied to video, since a video
can be regarded as a sequence of image frames. But the differences that reside in available
signal space (much larger in video) and processing requirements (real time processing may be
necessary for video) require methods specifically designed for video data. Sample methods
modify the motion vectors associated with specific frames or the labeling of frames to embed
data. Audio watermarking techniques are generally based on principles taken from spread-
spectrum communications. Modifying audio samples with a pseudo-randomly generated
noise sequence is a typical example.
In image watermarking, the watermark signal is either embedded into the spatial domain
representation of the image, or one of many transform domain representations such as DCT,
Fourier, and wavelet. It is generally argued that embedding watermarks in transform domains
provides better robustness against attacks and leads to less perceptibility of an embedded
watermark due to the spread of the watermark signal over many spatial frequencies and better
modeling of the human visual system (HVS) when using transform coefficients. An example
of watermarking in the spatial domain is given in Fig. 4(b). Amplitude modulation is applied
to the blue channel pixels to embed the 32-bit watermark data, represented in decimal form as
1234567890. This is a robust watermarking scheme: the watermark data can be retrieved
correctly even after the watermarked image is modified. For example, the embedded data
1234567890 is retrieved after the watermarked image is (i) blurred via filtering the image
pixels with a 5x5 structuring element (Fig. 4(c)), and (ii) compressed (via JPEG algorithm
with a quality factor of 75) and decompressed (Fig. 4(d)).A specific class of watermarks,
called fragile watermarks, are typically used for authenticating multimedia data. Unlike
robust watermarks (e.g., the one given in Fig. 5), any attack on the image invalidates the
fragile watermark present in the image and helps in detecting/identifying any tampering of
the image. Hence, a fragile watermarking scheme may need to possess the following features:
(i) detecting tampering with high probability, (ii) being perceptually transparent, (iii) not
requiring the original image at decoding site, and (iv) locating and characterizing
modifications to the image.
Copyright ICWS-2009

( a ) (b)

(c) (d)
Fig. 4: Image Watermarking: (A) Original Image (640x480, 24 Bpp), (B) Watermarked Image Carrying The
Data 1234567890, (C) Image Blurred After Watermarking, (D) Image Jpeg Compressed-Decompressed After
Watermarking.
3 Fingerprint Watermarking Systems:
There have been only a few published papers on watermarking of fingerprint images. Ratha
et al. proposed a data hiding method, which is applicable to fingerprint images compressed
with the WSQ (Wavelet Scalar Quantization) wavelet-based scheme. The discrete wavelet
transform coefficients are changed during WSQ encoding, by taking into consideration
possible image degradation. Fig.6 shows an input fingerprint image and the image obtained
after the data embedding-compressing-decompressingc cycle. The input image was obtained
using an optical sensor. The compression ratio was set to 10.7:1 and the embedded data
(randomly generated bits) size was nearly 160-bytes. As seen from these images, the image
quality does not suffer significantly due to data embedding, even though the data size is
considerable.

(a) (b)
Figure 5: Compressed-domain fingerprint watermarking [51]: (a) input fingerprint, (b) data embedded-
compressed-decompressed fingerprint.
Copyright ICWS-2009
Pankanti and Yeung proposed a fragile watermarking method for fingerprint image
verification. A spatial watermark image is embedded in the spatial domain of a fingerprint
image by utilizing a verification key. Their method can localize any region of image that has
been tampered after it is watermarked; therefore, it can be used to check integrity of the
fingerprints. Fig. 6 shows a sample watermark image comprised of a company logo, and the
watermarked image. Pankanti and Yeung used a database comprised of 1,000 fingerprints (4
images each for 250 fingers). They calculated the Receiver Operating Characteristics (ROC)
curves before and after the fingerprints were watermarked. These curves are observed to be
very close to each other, indicating that proposed technique does not lead to a significant
performance loss in fingerprint verification.

(a) (b)
Fig. 6: Fragile fingerprint watermarking : (a) watermark image, (b) fingerprint image carrying the image in (a).
Pankanti and Yeung proposed a fragile watermarking method for fingerprint image
verification. A spatial watermark image is embedded in the spatial domain of a fingerprint
image by utilizing a verification key. Their method can localize any region of image that has
been tampered after it is watermarked; therefore, it can be used to check integrity of the
fingerprints. Fig.6 shows a sample watermark image comprised of a company logo, and the
watermarked image. Pankanti and Yeung used a database comprised of 1,000 fingerprints (4
images each for 250 fingers). They calculated the Receiver Operating Characteristics (ROC)
curves before and after the fingerprints were watermarked. These curves are observed to be
very close to each other, indicating that proposed technique does not lead to a significant
performance loss in fingerprint verification.
4 Architecture of the Proposed System
Two application scenarios are considered in this study. The basic data hiding method is the
same in both scenarios, but it differs in the characteristics of the embedded data, the host
image carrying that data, and the medium of data transfer. While fingerprint feature vector or
face feature vector is used as embedded data, other information such as user name (e.g.,
John Doe), user identification number (12345), or authorizing institution (FBI) can
also be hidden into the images. In this paper we explore the first scenario.

Fig. 7: Fingerprint watermarking results for : (a) input fingerprint, (b) fingerprint image watermarked using
gradient orientation, (c) fingerprint image watermarked using singular points.
Copyright ICWS-2009
The first scenario involves a steganography-based application (Fig. 8): the biometric data
(fingerprint minutiae) that need to be transmitted (possibly via a non-secure communication
channel) are hidden in a host (also called cover or carrier) image, whose only function is to
carry the data. For example, the fingerprint minutiae may need to be transmitted from a law
enforcement agency to a template database, or vice versa. In this scenario, the security of the
system is based on the secrecy of the communication. The host image is not related to the
hidden data in any way. As a result, the host image can be any image available to the
encoder. In our application, we consider three different types of cover images: a synthetic
fingerprint image, a face image, and an arbitrary image (Fig. 9). The synthetic fingerprint
image (360x280) is obtained after a post-processing of the image generated using the
algorithm described by Cappell. Using such a synthetic fingerprint image to carry actual
fingerprint minutiae data provides an increased level of security since the person who
intercepts the communication channel and obtains the carrier image is likely to treat this
synthetic image itself as a real fingerprint image, and not consider that it is in fact carrying
the critical data. The face image (384x256) was captured in our Regional Forensic Science
Lab,Vijayawada. The Sailboat image (512x512) is taken from the USC-SIPI database.

Fig. 8: Block diagram of application scenario
Copyright ICWS-2009
This application can be used to counter the seventh type of attack (namely, compromising the
communication channel between the database and the fingerprint matcher)

(a) (b) (c)
Fig. 9: Sample Cover images: (a) Synthetic Fingerprint, (b) Face, (c) Sailboat.
An attacker will probably not suspect that a cover image is carrying the minutiae information.
Furthermore, the security of the transmission can be further increased by encrypting the host
image before transmission. Here, symmetric or asymmetric key encryption can be utilized,
depending on the requirements of the application such as key management, coding-decoding
time (much higher with asymmetric key cryptography), etc. The position and orientation
attributes of fingerprint minutiae constitute the data to be hidden in the host image.
5 Conclusion
The ability of biometrics-based personal identification techniques to differentiate between an
authorized person and an impostor who fraudulently acquires the access privilege of an
authorized person is one of the main reasons for their popularity compared to traditional
identification techniques. However, the security and integrity of the biometric data itself raise
important issues, which can be ameliorated using encryption, watermarking, or
steganography. In addition to watermarking encryption can also be used to further increase
the security of biometric data. Our first application is related to increasing the security of
biometric data exchange, which is based on steganography. As a consequence the verification
accuracy based on decoded watermarked images is very similar to that with original images.
The proposed system can be coupled with a fragile watermarking scheme to detect
illegitimate modification of the watermarked templates.
References
[1] [D. Maio, D. Maltoni, R. Cappelli, J.L. Wayman and A.K. Jain FVC 2004] Third Fingerprint Verification
Compet i t i on i n Pr oc. I nt er nat i onal Conf er ence on Bi omet r i c Aut hent i cat i on ( I CBA) .
[2] [D. Maio, D. Maltoni, R. Cappelli, J.L. Wayman and A.K. Jain FVC 2002] Third Fingerprint Verification
Competition in Proc. International Conference on Pattern Recognition.
[3] [S. Pankanti and M.M. Yeung] Verification watermarks on fingerprint recognition and retrieval. In
Proc.SPIE, Security and Watermarking of Multimedia Contents, vol. 3657, pages 6678, 2006.
[4] [N. K. Ratha, J. H. Connell, and R. M. Bolle] Secure data hiding in wavelet compressed fingerprint images.
In Proc. ACM Multimedia, pages 127130, 2007.
Copyright ICWS-2009
[5] [N. K. Ratha, J. H. Connell, and R. M. Bolle] An analysis of minutiae matching strength. In Proc. AVBPA
2001, Third International Conference on Audio- and Video-Based Biometric Person Authentication, pages
223228, 2006.
[6] [A. K. Jain, S. Prabhakar, and S. Chen] Combining multiple matchers for a high security fingerprint
verification system, Pattern Recognition Letters, vol. 20, pp. 13711379, 2005.
Unsupervised Color Image Segmentation
Based on Gaussian Mixture Model
and Uncertainity K-Means

Srinivas Yarramalle Satya Sridevi. P
Department of Information Technology, M.Tech (CST) (CL)
Vignans IIT, Visakhapatnam-46 Acharya Nagarjuna University
yarramalle_s@yahoo.com

Abstract

In this paper we propose a new model of image segmentation based on Finite
Gaussian Mixture Model and UK-Means algorithm. In the Gaussian mixture
model the pixels inside image region follows the Gaussian distribution and the
image is assumed to be a mixture of these Gaussians. The initial components
of the image are estimated by using the UK-means algorithm. This method
does not totally depend on random selection of parameters; hence it is a
reliable and sustainable which can be used for unsupervised image data. The
performance of this algorithm is demonstrated by color image segmentation.
Keywords: Gaussian Mixture Model, K-Means algorithm, Uk-Means, Segmentation
1 Introduction
Image segmentation is a key process of image segmentation of image Analysis with its
applications to the Pattern recognition, Object detection and Medical image analysis. A
number of image segmentation algorithms based on Histogram[1], Model based [2], Saddle
point [3], Markovian [4] etc., were proposed. Among these algorithms model based image
segmentation has taken importance since the segmentation is based using the parameters of
each of the pixels. Segmentation algorithms differ from application to application there exists
no algorithm which suits for all the purposes [5]. The advantage with image segmentation is
by compressing some segments communication can be made possible by saving network
resource.
To segment a image one can use models based on Bayesian Classifier, Markov, Graph cut
approach etc., Depending on these models, to segment an image we have three major
approaches: 1) Edge based 2) Region based and 3) Gaussian Mixture Model based. Among
these models Gaussian mixture model based image segmentation has gained popularity.
[4][5][6].
Here we assume that each pixel in the image follows a distribution which is assumed to be a
Normal Distribution with mean and variance, we assume that this distribution is a Gaussian
distribution and since each pixel in the image is following a Gaussian distribution, the image
is assumed as a Gaussian Mixture Model. To identify the pixel density and to estimate the
mixture densities of the image, Joint Entropy algorithm is used. The segmentation process is
carried out by clustering each pixel of the image data based on homogeneity. This method is
Unsupervised Color Image Segmentation Based on Gaussian Mixture Model and Uncertainity K-Means 323
Copyright ICWS-2009
stable. The main disadvantage in image segmentation is, if the number of components for
Gaussian Mixture Model is assumed to be known in prior, then the segments may not be
effective, and secondly, the initialization of parameters, which may greatly effect the
segmentation result. Hence to estimate the parameters efficiently, in our model UK-Means
algorithm is used.
2 Gaussian Mixture Model
Image segmentation is a process of dividing the image such that the homogenous pixels come
together. A pixel is defined as a random variable which varies on a two dimensional space.
To understand and interpret the pattern of the pixels inside the image region, one has to fit in
a model. Generally image segmentation based on Gaussian mixture model are considered,
here each pixel is assumed to be following a Gaussian Mixture Model and the entire image
is a mixture of these Gaussian variates. The basic methodology for segmenting an image is
to find effectively the number of clusters so that the homogenous pixels come together. If a
feature, Texture, Pattern is known then it is easy to segment the image based on these
patterns, However for realistic data we cannot interpret the number of clusters hence UK-
mean algorithm is used to identify the number of clusters inside the image. Once the
number of clusters are identified then for each of these image region we have to estimate
the model parameters namely , , (where is the mean, is the standard deviation, is
the mixing weight)
2.1 The Probability Density Function of Gaussian Mixture Model
Image is a matrix where each element is a pixel. The value of the pixel is a number that
shows intensity or color of the image. Let X is a random variable that takes these values. For
a probability model determination, we can suppose to have mixture of Gaussian distribution
as the following form
) / ( ) (
1
2
,
=
=
k
i
i i i
x N p x f
(1)
Where K is the number of regions to be estimated and P
i
>0 are weights such that
=
=
k
i
i
p
1
1

2
) (
exp
2
1
) (
2
2
, 2
,
i
i
i i
x
pi
N

=

(2)
Where
i, i
are mean, standard deviation of region i. The parameters of each region are
) ...., ,......... , ..... ,......... , ., .......... (
2 2
1 1 , 1 k k k
p p =
.
To estimate the number of image regions UK-mean algorithm is used
3 K-Means Clustering
In section-3, K-Means algorithm is discussed and then in the section-3.1 UK-Means
algorithm is presented
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that
solve the well known clustering problem. The procedure follows a simple and easy way to
324 Unsupervised Color Image Segmentation Based on Gaussian Mixture Model and Uncertainity K-Means
Copyright ICWS-2009
classify a given data set through a certain number of clusters (assume K clusters) fixed a
priori. The main idea is to define K centroids should be placed in a cunning way because
different locations cause different result. So, the better choice is to place them as much as
possible far away from each other. The next step is to take each belonging to a given data set
and associate it to nearest centroid. When no point is pending, the first step is completed and
an early group age is done. At this point we need to re-calculate K new centroids as centers
of the clusters resulting from the previous step. After we have these K new centriods, a new
binding has to be done between the same data set points and the nearest new centroid. A loop
has been generated. As a result of this loop we may notice that the K centriods change their
location step by step until no more changes are done. In other words centroids do not move
any more. Finally, this algorithm aims at minimizing an objective function, in this a squared
error function.
The objective function

2
Where x
i
j
c
j
2
is a chosen distance measure between a data point xj
and
the cluster
centre cj

is an indicator of the distance of the n data points from their respective cluster
centres.
The algorithm is composed of the following steps:
1. Place K points into the space represented by the objects that are being clustered.
These points represent initial group centroids
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat steps 2 and 3 until the centroids no longer move. This produces a separation of
the objects into groups from which the metric to be minimized can be calculated.
Although it can be proved that the procedure will always terminate, the K-means algorithm
does not necessarily find the most optimal configuration, corresponding to the global
objective function minimum. The algorithm is also significantly sensitive to the initial
randomly selected cluster centers. The K-means algorithm can be run multiple times to
reduce this effect.
3.1 K-Means for Uncertain Data
The clustering algorithm with the goal of minimizing the expected sum of squared errors E
(SSE) is called UK-Means algorithm. UK-Means algorithm suits best when the data is
unsupervised. The UK-Means algorithm calculations are presented below.
2
1
i
j
K
i j
j X
C
C X
=

Where is the distance metric between a data point Xi and a cluster mean cj. The distance
is measured using the eculidean distance and is given by
Unsupervised Color Image Segmentation Based on Gaussian Mixture Model and Uncertainity K-Means 325
Copyright ICWS-2009
2
1
k
i i
i
X Y x y
=
=

1
j
j i
i C
j
C x
C
=
=

The K-Means algorithm is as follows
1. Assign initial values for cluster means c1 to ck
2. repeat
3. for i=1 to n do
4. Assign each data point xi to cluster Cj where E(||Cj-Xj||) is less
5. end for
6. for j=1 to K do
7. Recalculate Cluster mean of cluster Cj
8. end for
9. until convergence
10. return
The main difference between UK-Means and K-Means clustering lies in the computation of
distance and clusters. UK-Means compute the expected distance and cluster centroids based
on the data uncertainity.
4 Performance Evaluation and Conclusion
The performance evaluation of above two methods are tested with two images the Sheep and
Mountains using image quality metrics such as signal to noise ratio and mean square error.
The original and segmented images are shown in fig (1)

Fig. 1: Segmented image
from the above images it can be easily seen that the image segmentation methods developed
by UK-Means gives the best results.
The edges inside the image are clear. the performance evaluation of the two methods is given
in table (1)
326 Unsupervised Color Image Segmentation Based on Gaussian Mixture Model and Uncertainity K-Means
Copyright ICWS-2009
Name of image

Image segmentation by Gaussian model
+K means algorithm
Image segmentation by Gaussian model +U K
means algorithm
Image SNR MSE SNR MSE
Sheep 38.2 0.7 47.3 0.4
Mountains 32.7 0.8 41.7 0.6
from the table (1) it can be easily seen that signal to noise ratio for the image segmentation
algorithms based on UK-Means is more i.e., if the signal is more, the error is less which
implies that the output image is very nearer to the input image
References
[1] C. A. Glasbey An analysis of histogram based threshold algorithm CVGIP, VOL 55 Pg 532-537.
[2] Micheal Chau et al Uncertain K-Means algorithm, proceedings of the workshop on sciences of Artificial,
December 7-8, 2005.
[3] M. Brezeail and M. Sonka Edge based image segmentation IEEE world congress on computional
intelligence pg 814-819,1998
[4] L. Chan et al, Image Texture classification Based on finite Gaussian mixture models IEEE transaction
on image processing 2001.
[5] Rahman Farnoosh, Gholamhossein Yari, Image segmentation using Gaussian mixture model,
Proceedings of Pattern Recognition, 2001
[6] S. K. Pal, N. R. Pal, A Review Of Image Segmentation Techniques, Proceedings of IEEE Transcations
on Image Processing, 1993.
[7] Bo Zhao, Zhongxing Zhu, Enrong Mao and Zhenghe Song, Image Segmentation Based on Ant Colony
Optimization and K-means Clustering proceedings of IEEE International conference on Automation and
Logistics, 2007
Recovery of Corrupted Photo
Images Based on Noise Parameters
for Secured Authentication

Pradeep Reddy CH Srinivasulu D Ramesh R
Jagans College of Engg & Tech. Narayana Engg. College Narayana Engg. College
Nellore Nellore Nellore
pradeep1417@gmail.com dsnvas@gmail.com rameshragala@gmail.com

Abstract

Photo-image authentication is an interesting and demanding field in image
processing mainly for reasons of security. In this paper, photo-image
authentication refers to the verification of corrupted facial image of an
identification card, passport or smart card based on its comparison with the
original image stored in database. This paper concentrates on noise parameter
estimation. In the training phase, a list of corrupted images is generated by
adjusting the contrast, brightness and Gaussian noise of the original image
stored in the database and then PCA (Principal Component Analysis) training
is given to generated images. In the testing phase, the Gaussian noise is
removed from the scanned photo image using wiener filter. Then, linear
coefficients are calculated based on LSM (Least Square Method) method and
noise parameters are estimated. Based on these estimated parameters,
corrupted images are synthesized. Finally, comparing the synthesized image
with the scanned photo image using normalized correlation method performs
authentication. The proposed method can be applied to various fields of image
processing such as photo image verification for credit cards and automatic
teller machine (ATM) transactions, smart card identification systems,
biometric passport systems, etc.
1 Introduction
1.1 Background
Photo-image authentication plays an important role in a wide range of applications such as
biometric passport system, smart card authentication, identification cards that use photo-
image for authentication. This paper mainly concentrates on the topic of authenticating the
corrupted photo-images. That is, the proposed method provides secure authentication against
forgery and offers accurate and efficient authentication. Authentication is still a challenge for
researchers because corrupted images can easily be forged. The different methodologies and
noise parameter estimation give an insight as to which algorithm is to be generated to
estimate the noise parameters from the original image and generated corrupted images.
328 Recovery of Corrupted Photo Images Based on Noise Parameters for Secured Authentication
Copyright ICWS-2009
1.2 Related Work
Research of face authentication has been carried out for a long time. Several researchers have
analyzed various methods for dealing with corrupted images. In previous studies, modeling
the properties of noise, as opposed to those of the image itself, has been used for the purpose
of removing the noise from the image. These methods use local properties to remove noise.
But they cannot remove the noise totally which is distributed over a wide area. In addition, a
similar noise property of different regions in the image is affecting the noise removing
process. Also, these methods do not support to recover regions, which are damaged due to
noise or occlusion. The corruption of photo- images is a commonly occurring phenomenon,
which is a source of serious problems in many practical applications such as face
authentication [6]. There are several approaches to solve the noise problem without taking
multiple training images per person. The Robust Feature Extraction in corrupted image use
polynomial coefficients. It mainly works on Gaussian and White noise. Using Principal
Component Analysis and Kernel Principal Component Analysis can do Reconstruction of the
missing parts in partially corrupted images. But it uses multiple images to produce good
results and not efficient for real-time face-image authentication. In the face-authentication
based on virtual view, single image is used in training set. This method gives a good
performance in relation to various kinds of poses, but difficulties are encountered in case of
virtual view generation for occluded regions.
2 Proposed Work
Our intended approach is to estimate the noise parameters using Least Squares Minimization
method. In order to authenticate the corrupted photo-images, the proposed method has a
training phase and a testing phase. In the training phase adjusting the parameters of contrast,
brightness and Gaussian blur of an original photo-image generates corrupted images. Then,
basic vectors for the corrupted images and the noise parameters are obtained. In the testing
phase, first, the Gaussian noise is removed from the test-image by using the Wiener filter.
Then linear coefficients are computed by decomposition of the noise model, and then the
noise parameters are estimated by applying these coefficients to the linear composition of the
noise parameters. Subsequently, a synthesized image is obtained by applying the estimated
parameters to the original image contained in the database. Finally, comparing the
synthesized photo-image and the corrupted photo-image does photo- image authentication
2.1 Noise Model
Digital images are prone to various types of noises like blur, speckles, stains, scratches,
folding lines, salt and pepper. There are several ways in which a noise can be introduced into
an image, depending on how the image is created. There are various types of noises present in
the digital images. They are Gaussian noise, Poisson noise, Speckle noise and Impulse noise.
Generally noise can be included at the time of acquisition and transmission. Gaussian noise is
the one, which is frequently added in the image acquisition process. Gaussian noise makes
more effects in the image. This paper mainly concentrates on Gaussian noise. Three noise
parameters namely contrast, brightness and Gaussian noise are considered in the current
study because the noise in the corrupted image can be synthesized by a combined adjustment
of the three noise parameters. The contrast and brightness of an image are changed for
generating corrupted images in the following manner: I
cb
= c * Image (x, y) + b
Recovery of Corrupted Photo Images Based on Noise Parameters for Secured Authentication 329
Copyright ICWS-2009
Where c is contrast parameter, b is brightness parameter and I
CB
is image corrupted by the
changes of contrast and brightness.
Gaussian blur can be generated by I
G
= I
org
(x,y) * Gblur(x,y)
Noise model is defined as the combination of corrupted images Ic and noise parameters p.
Noise Model = (I
i
c
) i= (1m)
Where I
i
c

is a combination of corrupted images, M is the number of corrupted images and p is
the noise parameter value. Then the values of Noise parameters are estimated by applying
linear coefficients. This is calculated by using Least Square Minimization method (discussed
in section 2.4) from the trained data set and the wiener filtered image to the linear
composition of the noise parameters. The calculated noise parameters should be applied to
the original image to obtain the synthesized image
2.3 Principal Component Analysis Algorithms
Principal component analysis is a data-reduction method that finds an alternative set of
parameters for a set of raw data (or features) such that most of the variability in the data is
compressed down to the first few parameters. It is a facial algorithm, which is used for 2D
eigenfaces. In PCA training, a set of eigenfaces is created. Then the new images are projected
onto the eigenfaces and checked if image are close to face space.
Step 1: Prepare the data
The faces constituting the training set should be prepared for processing.
Step 2: Subtract the mean
=1/M
i=1
M
i
where - matrix of original faces, average matrix

Step 3: Calculate the covariance matrix
C = (1/M)
n-1
M

n

n
T
where C covariance matrix
Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix
The eigenvectors (eigenfaces) ui and the corresponding eigenvaluesi should be calculated.
C = (1/M)
n-1
M

m

n
= AA
T

L = A
T
A L
n x m
=
m
T
n

ul =
k-l
M
v
lk

k
l = 1,...,M

Where, L is M M matrix, v are M eigenvectors of L and u are eigenfaces.
Step 5: Select the principal components
Rank the eigenfaces according to their eigenvalues in descending order from M eigenvectors
(eigenfaces) ui, only M' should be chosen, which have the highest eigenvalues. If the
eigenvalue of the eigenvectors is high, then it explains the more characteristic features of the
faces. Low eignevalues of the eigenfaces are neglected as they explain very small part of
characteristic features of the faces. After M' eigenfaces ui are determined, the training phase
of the algorithm is finished.
Copyright ICWS-2009
2.4 Least Square Minimization Method
Least Square Minimization Method is mathematical optimization technique, which gives a
series of measured data and attempts to find a function, which closely approximates the data
(a "best fit"). Least Square Method is a statistical approach to estimate the expected value or
function with the highest probability from the observation with random errors. It is
commonly applied for two cases, curve fitting and coordinate transformation.) By minimizing
the square sum of residuals, the unknown parameters a and b are determined. Unknown
parameters in case of y = ax + b, are determined as follows AX = B or x
i
a = b = y
i
.
In this paper the Least Square Minimization method is used to estimate the linear coefficients
of the corrupted photo-image. The linear coefficients are calculated by the decomposition of
the noise models generated using principal component analysis technique.
2.5 Normalized Correlation Method
Normalized Correlation Method is the frequently used approach to perform matching
between two images. It is used for the purpose of authenticating the test-image with the
synthesized-image. Normalized Correlation Method should calculate the correlation
coefficient. The correlation coefficient gives the amount of similarity between the
synthesized-image and the test- image. The higher the correlation coefficient value, the
higher the similarity between the images is. This correlation coefficient value should lie
between -1 and 1, independent of scale changes in the amplitude of the image.
3 Proposed System Architecture

Fig. 1.System Architecture
The Corrupted Image module generates a list of corrupted images by adjusting the three-
noise parameters contrast, brightness and Gaussian blur of the original image stored in the
Recovery of Corrupted Photo Images Based on Noise Parameters for Secured Authentication 331
Copyright ICWS-2009
database. PCA algorithm is used in Noise Parameter Estimation and also to train the list of
corrupted images. This also removes the Gaussian noise from the scanned corrupted photo-
image by using wiener filter. The Noise Parameter Estimation module uses linear coefficients
and then the noise parameters are estimated. The Synthesized Image module is used to
synthesize a corrupted photo-image by applying the noise parameters estimated in the
previous module to the original photo-image. The Authentication module performs photo-
image authentication by comparing the test-image and the synthesized photo-image.
Thus, new method of authenticating corrupted photo-images based on noise parameter
estimation has been implemented. In contrast to the previous techniques, this method deals
with the corrupted photo-images based on noise parameter estimation, using only one image
per person for training and using few relevant principle components. This method provides an
accurate estimation of the parameters and improves the performance of photo-image
authentication. The experimental results show that the noise parameter estimation of the
proposed method is quite accurate and that this method can be very useful for authentication.
Further research is needed to develop a method of estimating partial noise parameters in a
local region and generating various corrupted images for the purpose of accurate
authentication. Thus, it is expected that the proposed method can be utilized for practical
applications requiring photo-image authentication, such as biometric passport systems and
smart card identification systems.
References
[1] http://www.imageprocessingplace.com/image databases
[2] http://mathworld.wolfram.com/CorrelationCoefficient.html
[3] http://www.wolfram.com.
[4] [L.I. Smith, 2006] Lindsay I. Smith: A Tutorial on Principal Component Analysis
http://www.csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
[5] [A. Pentland and M. Turk, 1991] A. Pentland, M. Turk, Eigenfaces for recognition, Journal of Cognitive
Neurosciece, Vol. 2 (1), pp. 7186, 1991.
[6] [S.W. Lee et at., 2006] S.W. Lee, H.C Jung, B.W Hwang, Lee Seong-Whan, Authenticating corrupted
photo images based on noise parameter estimation Pattern Recognition, Vol.39 (5), pp. 910-920, May
2006.
[7] Rafael C. Gonzalez, Richard E. Woods, Steven L. Eddins, Digital image processing using Matlab, 2
nd

Edition, Pearson, 2004
[8] [B. W. Hwang and S.W. Lee 2003] B.-W. Hwang, S.-W. Lee, Reconstruction of partially damaged face
images based on a morphable face model, IEEE Transactions on. Pattern Analysis Machine Intelligence,
Vol.25 (3), pp.365372, 2003.
[9] [K.K. Paliwal and C. Sanderson, 2003] K.K. Paliwal, C. Sanderson, Fast features for face authentication
under illumination direction changes, Pattern Recognition Letters, Vol. 24 (14), 2003.
[10] C. Sanderson, S. Bengio, Robust features for frontal face authentication in difficult image condition, in:
Proceedings of International Conference on Audio- and Video-based Biometric Person Authentication,
Guildford, UK, pp. 495504, 2003
[11] [J. Bigun et al., 2002] J. Bigun, W. Gerstner, F. Smeraldi, Support vector features and the role of
dimensionality in face authentication, Lecture Notes in Computer Science, Pattern Recognition Support
Vector Mach, 2002.
[12] [A.M. Martinez, 2001]A.M. Martinez, Recognizing imprecisely localized, partially occluded, and
expression variant faces from a single sample per class, IEEE Transactions on. Pattern Analysis Machine
Intelligence, Vol.24 (6), pp.748763, 2001
Copyright ICWS-2009
[13] [A.C. Kak and A.M. Martinez, 2001] A.C. Kak, A.M. Martinez, PCA versus LDA, IEEE Transactions on
Pattern Analysis Machine Intelligence, Vol.23 (2), pp. 229233, 2001.
[14] [P.N. Belhumeur et al.,1997]P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces:
recognition using class specific linear projection, IEEE Transactions on Pattern Analysis Machine
Intelligence, Vol.19 (7), pp. 711720, 1997
[15] [K.R. Castleman, 1996] K.R. Castleman, Digital Image Processing, 2
nd
Edition, Prentice-Hall, Englewood
Cliffs, New Jersey, 1996
[16] D. Beymer, T. Poggio, Face recognition from one example view, in: Proceedings of International
Conference on Computer Vision, Massachusetts, USA, pp. 500507, 1995
An Efficient Palmprint Authentication System

K. Hemantha Kumar
CSE Dept, Vignans Engineering College, Vadlamudi-522213

Abstract

A reliable and robust personal verification approach using palmprint features
is presented in this paper. The characteristics of the proposed approach are that
no prior knowledge about the objects is necessary and the parameters can be
set automatically. In our work, we use poly online palmprint database
provided by honkong polytechnique university. In the proposed approach, user
can place palm on any direction, finger-webs are automatically selected as the
datum points to define the region of interest (ROI) in the palmprint images.
The hierarchical decomposition mechanism is applied to extract principal
palmprint features inside the ROI, which includes directional decompositions.
The directional decompositions extracts principal palmprint features from each
ROI. A total of 7720 palmprint images were collected from 386 persons to
verify the validity of the proposed approach. For palmprint verification we use
principle component analysis and the results are satisfactory with acceptable
accuracy (FRR: 0.85% and FAR: 0.75%). Experimental results demonstrate
that our proposed approach is feasible and effective in palmprint verification.
Keywords: Palmprint verification, Finger-web, Template generation, Template Matching;
Principle component analysis.
1 Introduction
Due to the explosive growth and popularity of the Internet in recent years, an increasing
number of security access control systems based on personal verifications is required.
Traditional personal verification methods rely heavily on the use of passwords, personal
identification numbers (PINs), magnetic swipe cards, keys, smart cards, etc. These traditional
methods offers only limited security. Many biometric verification techniques dealing with
various physiological features including facial images, hand geometry, palmprint, fingerprint
and retina pattern [1] have been proposed to improve the security of personal verification.
Some of the important features of biometric verification techniques are uniqueness,
repeatability, immunity to forgery, operation under controlled light or not, throughput rate is
high, false rejection rate (FRR) and false acceptance rate (FAR) are low, and ease of use.
There is still no biometric verification technique that can satisfy all these needs. In this paper,
we present a novel palmprint verification method for personal identification. In general,
palmprints consist of some significant textures and a lot of minutiae similar to the creases,
ridges and branches of fingerprints. In addition, there are many different features existing in
palmprint images, such as the geometry, the principal line, the delta point, wrinkle features,
etc. [9]. Both palmprint and fingerprint offer stable, unique features for personal
identification which have been used for criminal verification by law enforcement agents for
more than 100 years [9]. However, it is a difficult task to extract small unique features
334 An Efficient Palmprint Authentication System
Copyright ICWS-2009
(known as minutiae) from the fingers of elderly people as well as manual laborers [5,6].
Many verification technologies using biometric features of palms were developed recently
[711].
In this paper, we propose an approach for personal authentication using palmprint
verification. The overall system is compact and simple. User can place palm on any direction.
Both principal lines and wrinkles of palmprints are named principal palmprints in the
following contexts. Most of the users are not willing to give palm images because there may
be chances to misuse the palmimages. To overcome the above problem we generate templates
and store templates for verification and from these templates not possible to generate original
palm images. The rest of this paper is organized as follows. The segmentation and the
procedure for the determination of finger-web locations and the location of region of interest
(ROI) are presented in section 2. Template construction is presented in section 3. The user
verification is presented in Section 4. Finally, concluding remarks are given in Section 5.
2 Region of Interest Identification
By carefully analyzing the histograms of palmprint images, we find that the histograms are
typically bimodal. Hence, we adopt a mode method to determine the suitable threshold in
binarizing palmprint images. Using the threshold detect the border of the palm and only
border pixel place high intensity then the border image is shown in fig 1(b).

Fig.1(a): Palm image (b) Border of the Palm
After identifying the border depending on the fingers direction image must be rotated to
proper direction then identify the Region Of Interest. The figure 2 has different directions of
palm images of the same person.
To increase the verification accuracy and reliability, we compare the principal palmprint
features extracted from the same region in different palmprint images for easy verification.
The region to be extracted is known as the ROI and its size is 180X180. For this reason, it is
important to fix the ROI at the same position in different palmprint images to ensure the
stability of the extracted principal palmprint features. The ROI of the above images is given
in figure 3.
An Efficient Palmprint Authentication System 335
Copyright ICWS-2009

Fig. 2: Palm Images of Same Person on Different Directions

Fig. 3: ROI of figure2 in chronological order.
3 Principal Palmprint Features Extraction and Template Generation
The line segment is a significant feature to represent principal palmprints. It contains various
features, such as end point locations, directions, width, length and intensity. To extract these
features iteratively we apply edge function and morphological functions. After applying these
functions we apply subsample the images then generates the templates. The template of the
ROI is shown in figure4.
336 An Efficient Palmprint Authentication System
Copyright ICWS-2009

Fig. 4: Template of the above ROI
4 Verification
We use Principle Component Analysis to verify whether they are captured from the same
person are not. The total number of palmprint images used in our experiment was 7720,
which were collected from 386 persons each with 20 palmprint images captured. The size of
each palmprint image was 284X384 with 100 dpi resolution and 256 gray-levels. The first
acquired 10 images were used as the template images set and the 10 images acquired
afterwards were taken as the test set. There were a total of 3860 palmprint images used in
constructing 386 templates for the template library and each template size is 22X22. Another
3860 palmprint images were used as the testing images to verify the validity of the proposed
approach. We adopted the statistical pair known as False Rejection Rate (FRR) and False
Acceptance Rate (FAR) to evaluate the performance of the experimental results. The results
are satisfactory with acceptable accuracy (FRR: 0.85% and FAR: 0.75%). Experimental
results demonstrate that our proposed approach is feasible and effective in palmprint
verification.
5 Conclusion
In this paper, we present An Efficient Palmprint Authentication System. There are two main
advantages of our proposed approach. The first is that the user place palm in any direction
and our algorithm automatically rotate palm image and obtain ROI. The second is that instead
of palm images templates are stored for verification so users need not worry about palm
images are used for other purposes. The algorithm for automatically rotation and detecting
the finger-webs location has been tested on 7720 palmprint images captured from 386
different persons. The results show that our technique conforms to the results of manual
estimation. We also demonstrate that the use of finger-webs as the datum points to define
ROIs is reliable and reproducible. Under normal conditions, the ROIs should cover almost
the same region in different palmprint images. Within the ROI, principal palmprint features
are extracted by applying the template generation which consists of edge detectors, and
sequential morphological operators. Any new palmprint features are matched with those from
the template library by the PCA to verify the identity of the person. Experimental results
demonstrate that the proposed approach can obtain acceptable verification accuracy. Such an
approach can be applied in access control systems. In high-security demanded applications,
very low FAR (even zero) and FRR (acceptable) are mandatory. It is a conflict to reduce both
FAR and FRR by using the same biometric features. In order to reduce FAR without
increasing FRR, we can combine our techniques with those using palm geometric shapes,
finger creases and other biometric features for verification in future research.
An Efficient Palmprint Authentication System 337
Copyright ICWS-2009
References
[1] A.K. Jain, R. Bolle, S. Pankanti, Biometrics Personal Identification in Networked Society, Kluwer
Academic Publishers, Massachusetts, 1999.
[2] Y. Yoshitomi, T. Miyaura, S. Tomita, S. Kimura, Face identification thermal image processing, Proceeding
6th IEEE InternationalWorkshop on Robot and Human Communication, RO-MAN 97 SENDAI.
[3] J.M. Cross, C.L. Smith, Thermographic imaging of the subcutaneous vascular network of the back of the
hand for biometric identification, Institute of Electrical and Electronics Engineers 29thAnnual 1995
International Carnahan Conference, 1995, pp. 2035.
[4] Chih-Lung Lin, Thomas C. Chuang, Kuo-Chin Fan, Palmprint verification using hierarchical
decomposition, Pattern Recognition 38 (2005) 2639 2652.
[5] A. Jain, L. Hong, R. Bolle, On-line fingerprint verification, IEEE Trans. Pattern Anal. Mach. Intell. 19
(1997) 302313.
[6] L. Coetzee, E.C. Botha, Fingerprint recognition in low quality images, Pattern Recogn. 26 (1993) 1441
1460.
[7] C.C. Han, P.C. Chang, C.C. Hsu, Personal identification using hand geometry and palm-print, Fourth Asian
Conference on Computer Vision (ACCV), 2000, pp. 747752.
[8] H.J. Lin, H.H. Guo, F.W. Yang, C.L. Chen, Handprint Identification Using Fuzzy Inference, The 13th IPPR
Conference on Computer Vision Graphics and Image Processing, 2000, pp. 164168.
[9] D. Zhang, W. Shu, Two novel characteristics in palmprint verification: datum point invariance and line
feature matching, Pattern Recogn. 32 (1999) 691702.
[10] J. Chen, C. Zhang, G. Rong, Palmprint recognition using crease, International Conference on Image
Processing, vol. 3, 2001, pp. 234237.
[11] W.K. Kong, D. Zhang, Palmprint texture analysis based on low-resolution images for personal
authentication, 16
th
International Conference on Pattern Recognition, vol. 3, 2002, pp. 807810.
Speaker Adaptation Techniques

D. Shakina Deiv Pradip K. Das M. Bhattacharya
ABV-IIITM, Gwalior Deparment of CSE Depatrment of ICT
IIT, Guwahati ABV-IIITM, Gwalior

Abstract

Speaker Adaptation techniques are used to reduce speaker variability in
Automatic Speech Recognition. Speaker dependent acoustic model is obtained
by adapting the speaker independent acoustic model to a specific speaker,
using only a small amount of speaker specific data. Maximum likelihood
transformation based approach is certainly one of the most effective speaker
adaptation methods known so far. Some researchers have proposed constraints
on the transformation matrices for model adaptation, based on the knowledge
gained from Vocal tract length normalization (VTLN) domain. It is proved
that VTLN can be expressed as linear transformation of cepstral coefficients.
The cepstral domain linear transformations were used to adapt the Gaussian
distributions of the HMM models in a simple and straightforward manner as
an alternative to normalizing the acoustic feature vectors. The VTLN
constrained model adaptation merits exploration as its performance does not
vary significantly with the amount of adaptation data.
1 Introduction
In Automatic Speech Recognition (ASR) systems, speaker variation is one of the major
causes for performance degradation. The error rate of a well trained speaker dependent
speech recognition system is three times less than that of a speaker independent speech
recognition system [Huang and Lee, 1991]. Speaker normalization and speaker adaptation are
the two commonly used techniques to alleviate the effects of speaker variation.
In Speaker Normalization, transformation is applied to acoustic features of a given speakers
speech wave so as to better match it to a speaker independent model. Cepstral mean removal,
Vocal Tract Length Normalization (VTLN), Feature space normalization based on mixture
density Hidden Markov Model (HMM) and signal bias removal estimated by Maximum
Likelihood Estimation (MLE) are some techniques employed for Speaker Normalization.
Most of the state of the art speech recognition systems make use of Hidden Markov Model
(HMM) as a convenient statistical representation of speech. One way to account for the
effects of speaker variability is Speaker Adaptation, achieved by modifying the acoustic
model. Model transformations attempt to map out distributions of HMMs to a new set of
distributions so as to make them a better statistical representation of a new speaker. Mapping
of output distributions of HMMs are very flexible and can provide compensation for not only
speaker variability but also environmental conditions.
Speaker Adaptation Techniques 339
Copyright ICWS-2009
2 Speaker Adaptation Techniques
Speaker adaptation is typically undertaken to improve the recognition accuracy of a Large
Vocabulary Conversational Speech Recognition (LVCSR) System. In this approach, a
speaker dependent acoustic model is obtained by adapting the speaker independent acoustic
model to a specific speaker, using only a small amount of speaker specific data, thus
enhancing the recognition accuracy close to that of a speaker dependent model.
The following two are the well established methods of Speaker Adaptation:
2.1 Bayesian or MAP Approach
Maximum Adaptation a Posteriori (MAP) is a general probability distribution estimation in
which prior knowledge is used in the process of estimation. The parameters of the speaker
independent acoustic models form the prior knowledge in this case. This approach requires a
large amount of adaptation data and is slow, though optimal.
2.2 Maximum Likelihood Linear Regression
This is a transformation based speaker adaptation method that is widely used. The parameters
of general acoustic models are adapted to a speakers voice using a linear regression model
estimated by maximum likelihood of adaptation data. However this method too takes a fairly
large adaptation data to be effective.
3 Extension of Standard Adaptation Techniques
The above techniques increase the recognition rate but are computationally intensive.
Therefore efforts are on to reduce the number of parameters to be computed by exploiting
some special structure or constraints on the transformation matrices for adaptation using less
data.
3.1 Extended MAP
The extended MAP (EMAP) adaptation makes use of information about correlations among
parameters [Lasry and Stern, 1984]. Though the adaptation equation makes appropriate use of
correlations among adaptation parameters, solution of the equation depends on the inversion
of a large matrix, making it computationally intensive.
3.2 Adaptation by Correlation
An adaptation algorithm that used the correlation between speech units, named Adaptation by
Correlation (ABC) was introduced [Chen and DeSouza, 1997]. The estimates are derived
using least squares theory. It is reported that ABC is more stable than MLLR when the
amount of adaptation data is very small.
3.3 Regression Based Model Prediction
Linear regression was applied to estimate parametric relationships among the model
parameters and update those parameters for which there was insufficient adaptation data
[Ahadi and Woodland, 1997].
Copyright ICWS-2009
3.4 Structured MAP
Structured MAP (SMAP) adaptation was proposed [Shinoda and Lee, 1998], in which the
transformation parameters were estimated in a hierarchical structure. The MAP approach
helps to achieve a better interpolation of the parameters at each level. Parameters at a given
level of the hierarchical structure are used as the priors for the next lower child level. The
resulting transformation parameters are a combination of the transformation parameters at all
levels. The weights for the combinations are changed according to the amount of adaptation
data present. The main benefit of the SMAP adaptation is that automatic control is obtained
over the effective cluster size in a fashion that depends on the amount of adaptation data.
3.5 Constraints on Transformation Based Adaptation Techniques
The transformation matrix was constrained to a block diagonal structure, with feature
components assumed to have correlation only within a block [Gales et al., 1996]. This
reduces the number of parameters to be estimated. However, the block diagonal matrices did
not provide better recognition accuracy.
Principal component analysis (PCA) reduces the dimensionality of the data. [Nouza, 1996]
used PCA for feature selection in a speech recognition system. [Hu,1999] applied PCA to
describe the correlation between phoneme classes for speaker normalization.
The speaker cluster based adaptation approach explicitly uses the characteristics of an HMM
set for a particular speaker. [Kuhn et al., 1998] introduced eigenvoices to represent the
prior knowledge of speaker variation. [Gales et al., 1996] proposed Cluster adaptation
training (CAT). The major difference between CAT and Eigenvoice approaches is how the
cluster models are estimated.
4 VTLN constrained Model Adaptation
Some researchers have proposed constraints on the transformation matrices for model
adaptation, based on the knowledge gained from VTLN domain.
Vocal tract length normalization (VTLN) is one of the most popular methods to reduce inter-
speaker variability that arises due to physiological differences in the vocal-tracts. This is
especially useful in gender independent systems, since on average the vocal tract is 2-3cm
shorter for females than males, causing females formant frequencies to be about 15% higher.
VTLN is usually performed by warping the frequency-axis of the spectra of speakers/clusters
by appropriate warp factor prior to the extraction of cepstral features. The most common
method for finding warp factors for VTLN invokes the maximum likelihood (ML) criterion
to choose a warp factor that gives a speakers warped observation vectors the highest
probability.
However, as the VTLN transformation is typically non-linear, exact calculation of the
Jacobian is highly complex and is normally approximated.. Moreover, cepstral features are
the predominantly used ones in ASR. This has lead many research groups to explore the
possibility of substituting the frequency-warping operation by a linear transformation in the
cepstral domain.
[McDonough et al., 1998] proposed A class of transforms which achieve a remapping of the
frequency axis similar to conventional VTLN. The bilinear transform (BLT) is a conformal
map expressed as
Copyright ICWS-2009
Q(z) = (z-) /(1- z)
where is real and || < 1. The use of BLT and a generalization of BLT known as All Pass
Transforms were explored for the purpose of speaker normalization. The BLT was found to
approximate to a reasonable degree the frequency domain transformations often used in
VTLN. The cepstral domain linearity of APT makes speaker normalization easy to
implement and produced substantial improvements in recognition performance of LVCSR.
The work was extended to develop a speaker adaptation scheme based on APT [McDonough
and Byrne, 1999]. Its performance was compared to BLT and the MLLR scheme. Using test
and training material obtained from Switchboard corpus, they have shown that the
performance of the APT based speaker adaptation was comparable or better than that of
MLLR when 2.5 min. of unsupervised data was used for parameter estimation. It is shown
that the APT scheme outperformed MLLR when the enrollment data was reduced to 30 sec.
[Claes et al., 1998] devised a method to transform the HMM based acoustic models trained
for a particular group of speakers, (say adult male speakers) to be used on another group of
speakers (children). The transformations are generated for spectral characteristics of the
features from a specific speaker. The warping factors are estimated based on the average third
formant. As MFCC involves additional non-linear mapping, linear approximation for the
exact mapping was computed by locally linearizing it. With reasonably small warping data,
the linear approximation was accurate.
Uebel and Woodland studied the effect of non-linearity on normalization. The linear
approximation to transformation matrix, K between the warped and the unwarped cepstra was
estimated using Moore-Penrose pseudo inverse as
K = (C
T
C)
-1
C
T

Where and C are column-wise arranged warped and unwarped cepstral feature matrices
respectively. They have inferred that both linear approximation based and extract
transformation based VTLN approaches provided similar performance.
[Pitz et al., 2001] concluded that vocal tract normalization can always be expressed as a
linear transformation of the cepstral vector for arbitrary invertible warping functions. They
derived the analytic solution for the transformation matrix for the case of piece-wise linear
warping function.
It was shown [Pitz and Ney, 2003] that vocal tract normalization can also be expressed as
linear transformation of Mel Frequency Cepstral coefficients (MFCC). An improved signal
analysis in which Mel frequency warping and VTN are integrated into DCT is
mathematically simple and gives comparable recognition performance. Using special
properties of typical warping functions it is shown that the transformation matrix can be
approximated by a tridiagonal matrix for MFCC. The computation of transformation matrix
for VTN helps to properly normalize the probability distribution with Jacobian determinant of
the transformation. Importantly, they infer that VTN amounts to a special case of MLLR
explaining the experimental results that improvement in Speech recognition obtained by VTN
and MLLR are not additive.
In the above said cases [McDonough et al.,1998; Pits and Ney, 2005], the derivation is done
for continuous speech spectra, and requires the transformation to be analytically calculated
for each warping function. The direct use of the transformation for discrete samples will
result in aliasing error.
Copyright ICWS-2009
The above-mentioned relationships are investigated in the discrete frequency space [Cui and
Alwan, 2005]. It is shown that, for MFCCs computed with Mel-scaled triangular filter banks,
a linear relationship can be obtained if certain approximations are made. Utilizing that
relationship as a special case of MLLR, an adaptation approach based on formant-like peak
alignment is proposed where the transformation of the means is performed deterministically
based on the linearization of VTLN. Biases and adaptation of the variances are estimated
statistically by the EM algorithm.
The formant-like peak alignment algorithm is used to adapt adult acoustic models to
childrens speech. Performance improvements are reported compared to traditional MLLR
and VTLN.
In the APT and piece-wise linear warping based VTLN approaches discussed above, the
warping is expressed as a linear transformation in linear cepstra only. Incase of Mel cepstra,
the warping function cannot be expressed as linear transformations unless linear
approximations are used [Claes et al., 1998 ; Pits and Ney 2005].
In the shift based speaker normalization approach [Sinha and Umesh, 2007], warping is
effected through shifting of the warped spectra, and therefore the warping function can easily
be expressed as exact linear transformation in feature (Mel cepstral) domain.
The derivation of the transformation matrices [Sinha, 2007], relating the cepstral features in
shift based speaker normalization method is straightforward and much simpler than those
suggested in the above said methods.
The cepstral domain linear transformations were used to adapt the Gaussian distributions of
the HMM models in a simple and straightforward manner as an alternative to normalizing the
acoustic feature vectors. This approach differs from other methods in that, the transformation
matrix is not estimated completely using adaptation data rather selected from a set of matrices
which are a priori known and fully determined. Therefore this approach leads to a highly
constrained adaptation of the model.
If only the means of the model are adapted, this can be considered a highly constrained
version of standard MLLR.
In principle this approach allows different VTLN transformations to be applied to different
groups of models, as in regression class approach. Thus the constraint of choosing a single
warping function for all speech models is relaxed unlike in VTLN.
Better normalization performance of the model adaptation based speaker compensation is
reported for children speech compared to conventional feature- transformation approach.
The experiments conducted by [Sinha, 2007] show that VTLN constrained model adaptation
approach does not significantly vary in performance with the amount of adaptation data.
Hence the strengths of this approach compared to conventional MLLR merits further
exploration.
5 Conclusion
The review of various speaker adaptation techniques has lead to the following observations.
The cepstral domain linearity of APT makes speaker normalization easy to implement
and produced substantial improvements in recognition performance of LVCSR.
Copyright ICWS-2009
Vocal tract normalization can be expressed as a linear transformation of the cepstral
vector for arbitrary invertible warping functions.
Vocal tract normalization can also be expressed as linear transformation of Mel
Frequency Cepstral coefficients (MFCC) with the help of some approximations. The
transformations are to be analytically calculated for each warping function.
The shift in the speech scale required for speaker normalization can be equivalently
performed through a suitable linear transformation in the cepstral domain. The
derivation of the Transformation matrix is simple.
VTLN amounts to a special case of MLLR explaining the experimental results that
improvement in Speech recognition obtained by VTLN and MLLR are not additive.
The relative strengths of VTLN constrained model adaptation compared to MLLR,
merits exploration as its performance does not significantly vary with the amount of
adaptation data.
References
[1] [Huang and Lee, 1991] Huang, X. and Lee, K. Speaker Independent and Speaker Dependent and Speaker
Adaptive Speaker Recognition. Proc. of IEEE International Conference on Acoustics, Speech and Signal
Processing. 1991 pp. 877-880.
[2] [McDonough and Byrne 1999] McDonough, J. and Byrne, W. Speaker Adaptation with All-Pass
Transforms. Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing. 1999.
[3] [Pitz and Ney, 2003] Pitz, M. and Ney, H. Vocal Tract Normalization as Linear Transformation of MFCC.
Proc. of EUROSPEECH. 2003.
[4] [Cui and Alwan, 2005] Cui, X. and Alwan, A. MLLR-like Speaker Adaptation based on Linearization of
VTLN with MFCC features. Proc. of EUROSPEECH.2005.
[5] [Sinha and Umesh, 2007] Sinha, R. and Umesh, S. A Shift based Approach to Speaker Normalization using
Non-linear Frequency Scaling Model. Speech Communication. 2007, doi:10.1016/j.specom.2007.08.002.
[6] [Sinha, 2004] Sinha, R. Front-End Signal Processing for Speaker Normalization in For Speech Recognition.
Ph.D. thesis I.I.T., Kanpur 2004.
Text Clustering Based on WordNet and LSI

Nadeem Akhtar Nesar Ahmad
Aligarh Muslim University, Aligarh- 202002 Aligarh Muslim University, Aligarh- 202002
nadeemalakhtar@gmail.com nesar.ahmad@gmail.com

Abstract

Text clustering plays an important role in retrieval and navigation of
documents in several web applications. Text documents are clustered on the
basis of statistical and semantic information they shared among them. This
paper presents the experiments for text clustering based on semantic word
similarity. Semantic similarities among documents are found using both
WordNet and Latent Semantic Analysis (LSI). WordNet is a lexicon database,
which provides semantic relationships like synonymy, hypernymy etc. among
words. Two words are taken semantically similar if at least one synset in the
WordNet for the two is same. LSI is a technique which brings out the latent
statistical semantics in a collection of documents. LSI uses the higher order
term co-occurrence to find the semantic similarity among words and
documents. The proposed clustering technique uses WordNet and LSI to find
the semantic relationship among words in the document collection. Based on
the semantic relationship, Sets of strongly related words are selected as
keyword sets. These keyword sets of words are then used to make the
document clusters.
1 Introduction
Clustering [Graepel, 1998] is used in a wide range of topics and areas. Uses of clustering
techniques can be found in pattern recognition and Pattern Matching, Artificial Intelligence,
Web Technology, and Learning. Clustering improves the accuracy of search and retrieval of
web documents in search engine results. Several algorithms and methods like suffix tree,
fuzzy C-mean, hierarchical [Zamir et al, 1998], [Bezdek, 1981], [Fasulo, 1999] have been
proposed for the text clustering. In most of them, a document is represented using Vector
Space Model (VSM) as a vector in n-dimensional space. Different words are given
importance according to criteria like term frequency-Inverse Document frequency (tf-idf).
These methods consider the document as a bag of words, where single words are used as
features for representing the documents and they are treated independently. They ignore the
semantic relationship among them. Moreover, documents belong to very large number of
dimensions.
Semantic information can be incorporated by employing ontologies like WordNet [Wang et
al]. In this paper, to cluster a given collection of documents, semantic relationships among all
the words present in the document collection. This relationship is found using WordNet and
Latent Semantic Analysis (LSI).
Semantic information among words is found using WordNet [Miller et al., 1990] dictionary.
WordNet contains words organized into synonym sets called synsets. The semantic similarity
Text Clustering Based on WordNet and LSI 345
Copyright ICWS-2009
between two words is found by considering their associated synsets. If at least one synset is
common, the two words are considered semantically same.
Latent Semantic Analysis (LSI) [Berry, et al 1996] is technique which finds the latent
semantic relationships among words in the document collection by exploiting the higher
order co-occurrence among words. So, two words may be related even if they dont occur in
the same document. LSI also reduces the dimensions resulting in richer word relationship
structure that reveals latent semantic hidden the document collection.
To find word semantic relationships, we adopt two different approaches. In the first approach,
we first run the LSI algorithm to get word-word relationships and then WordNet dictionary is
used to find semantically similar words. In the second approach, we first find the
semantically similar words and then LSI is used to get word-word relationships. Using the
relationships among words, sets of strongly related words are selected. These strongly related
words are then used to make the document clusters. In the end, both approaches are evaluated
against the same document collection and results are compared.
The paper is organized in five sections. After the first introductory section, second section
defines how WordNet and LSI are used separately to find semantic word similarity. Section 3
relates to combined use of LSI and WordNet. Section 4 presents results and discussion. In
section 5, conclusion and future work is presented.
2 Word Semantic Relationship
The proposed clustering method combines both statistical and semantic similarity to cluster
the document set. The sets of strongly related words in the document collection are identified
using LSI and WordNet. LSI derives the semantic relationship among words based on the
statistical co-occurrence of the words. It is based on the specific document collection
knowledge, whereas WordNet provides general domain knowledge about words. These word-
set are used to identify document cluster. There are n documents d
1
, d
2
,, d
n
and m distinct
terms t
1,
t
2,,
t
m
in the document collection.
2.1 Preprocessing
Stop words (both natural like the, it, where etc. and domain dependent) are removed from all
the documents. Those words, which are too frequent or too rare dont help in clustering
process. So they are also removed by deleting all those words which frequency of occurrence
in the document is out of the predefined frequency range (f
min
-f
max
). Each document d
i
is
represented as a vector of term frequency-inverse document frequency (tf-idf) of m terms
d
i=
(w
1i,,
w
2i
,, w
mi
) where w
ji
is the weight of term j in the document i. We first find the
relationship among all the words in the document collection. The relationship between two
words is based on the occurrence of those words in the document collection found using LSI
and semantic relationship derived from the WordNet dictionary. This involves the
construction of a term correlation matrix R of all the words included in the document vectors.
By constructing a term correlation matrix, the relationship among different words in the
documents is captured.
346 Text Clustering Based on WordNet and LSI
Copyright ICWS-2009
2.2 WordNet
WordNet is a large lexical database of English, developed under the direction of George A.
Miller at Princeton University. In WordNet Nouns, verbs, adjectives and adverbs are grouped
into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are
interlinked by means of conceptual-semantic and lexical relations. Despite having several
types of lexical relations, it is heavily grounded on its taxonomic structure that employs the
IS A inheritance (Hyponymy/Hypernymy) relation.
We can view WordNet as a graph where the synsets are vertices and the relations are edges.
Each vertex contains a gloss that expresses its semantics and a list of words that can be used
to refer to it. Noun synsets are the most dominant type of synsets; there are 79689 noun
synsets, which correspond to almost 70% of the synsets in WordNet.
To find semantic relationship between two words we use noun and verb synsets.
In the word similarity calculation using WordNet, similarity is based on the common
synonym sets (synsets). If the two words share at least one synset, they are considered similar
[Chua et al, 2004]. For example, noun synsets for the words buff, lover and hater are shown
in the table 1.
Table 1 WordNet Synset Similarity
Words Synsets
Buff Sense 1: fan, buff, devotee, lover
Sense 2: buff
Sense 3: buff
Sense 4: yellowish brown, raw sienna, buff,
caramel, caramel brown
Sense 5: buff, buffer
Lover Sense 1: lover
Sense 2: fan, buff, devotee, lover
Sense 3: lover
Hater Sense 1: hater
From the table I, we can see that there are two identical synsets. Sense 1 of buff and sense 2
of lover have the same synsets with the identical synonyms. So, words buff; and lover
are considered semantically same. Whereas, there is no synset match between words pair
buff-hate and lover-hater. So these pairs are not similar.
Word sense disambiguation is not performed in the approach. Although word sense
disambiguation (WSD) is advantageous for identifying the correct clusters, we have not used
the WSD to make the clustering approach simple.
2.3 Latent Semantic Analysis (LSI)
Latent Semantic Analysis is a well-known learning algorithm which is mainly used in
searching, retrieving and filtering. LSI is based on a mathematical technique called Singular
Value Decomposition (SVD). LSI is used in the reduction of the dimensions of the
documents in the document collection. It removes those less important dimensions which
introduce noise in the clustering process.
Copyright ICWS-2009
In our approach, we have used LSI to calculate term by term matrix, which gives hidden
semantic relationships among words in the document collection. In LSI, term by document
matrix M is decomposed into three matrices- a term by dimension matrix U, a dimension by
dimension singular matrix and a document by dimension matrix V.
M = U V
T

is a singular matrix in which diagonal represents the Eigen values. In the dimensionality
reduction, the highest k Eigen values are retained and rest are ignored.
M
k
= U
k

k
V
k
T

The value for the constant k must be chosen for each document set carefully.
Term by term matrix TT is calculated as [Kontostathis et al, 2006]:
TT = (U
k

k
)( U
k

k
)
T

The value at position (i,j) of the matrix represents the relationship between word i and word j
in the document collection. This method exploits the transitive relations among words. If a
word A co-occurs with word B and word B co-occur with word C, word A and C will be
related because there is second order co-occurrence between them. The above-mentioned
method explores all the high order term co-occurrence of the two terms, thus providing
relationship values that cover all connectivity paths between those two terms. Words that are
used in the same context are given high relationship values even though they are not used
together in the same document.
LSI also reduces the dimensions by neglecting dimensions associated with lower Eigen
values. Only those dimensions are kept which are associated with high Eigen values in this
way LSI removes the noisy dimensions.
3 Coupling WordNet and LSI
WordNet and LSI both provide semantic relationship among words. WordNet information is
purely semantic that is synonym sets are used to match the words. LSI semantic information
is based on the statistical data. It finds the hidden semantic relationship among words by
churning out words co-occurrence in the document collection.
We have adopted two different approaches to couple WordNet and LSI word relationships.
In the first approach, WordNet is used before LSI. Every word of a document is compared
with every word of other documents using WordNet synset approach. If two words are found
similar, each word is added to document vector of the other document. For example,
document D
1
contains word w
1
and document D
2
contains word w
2
. If words w
1
and w
2
share
some synset in the WordNet, then w
2
is added to the document vector of D
1
and w
1
is added
to document vector of D
2.
In this way, the semantic relationship between two words is
converted into statistical relationship because now those two words co-occur in two different
documents. After that we use LSI. Document by term matrix is formed from the document
vectors. We get term by document matrix M by transposing the document by term matrix.
From the matrix M, we get term by term matrix TT using above-mentioned method. We call
this approach WN-LSI.
In the second approach, we use LSI before WordNet. From the document vectors, we formed
term by document matrix, which is used to get matrix TT by applying LSI. After that, for
every pair of words, semantic relationship from the WordNet is found. If a match occurs,
corresponding entry in the matrix TT is updated. We call this approach LSI-WN.
Copyright ICWS-2009
Next sets of those words are found, which are strongly related. For this, Depth first search
graph traversal algorithm is used. Term by term matrix TT is seen as representation of a
graph containing m nodes where (i,j) entry in TT represents the label of edge between node i
and j. In the traversal of this graph, only those edges are traversed whose label is greater than
a predefined value . Independent components of the graph identify the different set of
keywords. By setting the value of , we can control the number of set of keywords c.
Next, each document is compared with each keyword set. If the document t contains more
than % words in the keyword set, document is assigned to the cluster associated with that
keyword set.. In this way each cluster has those documents in which words in associated
keyword set are frequent. Documents generally have several topics with different strength. As
a result documents in distinct cluster may overlap. To avoid nearly identical clusters, similar
clusters are merged together. For this purpose similarity among clusters is calculated as:
S
c1, c2
= | N
c1
N
c2
| / max (N
c1
, N
c2
)
Where S
c1, c2
is the similarity between cluster c1 and c2
N
c1
is the number of documents in cluster c1
N
c2
is the number of documents in cluster c2
N
c1
N
c2
is the number of documents common to cluster c1 and c2
If S
c1, c2
is greater than , cluster c1 and c2 are merged.
4 Experiments and Results
For evaluation purpose, we performed experiments using mini 20NewsGroup document
collection, which is a subset of 20NewsGroup document collection. This collection contains
2000 documents categorized in 20 categories. We also downloaded some documents from the
www and performed experiments on them.
Various parameters values are specified below for the experiments:
Inter-word similarity 0.60
Inter-cluster similarity 0.20
Similarity between document and cluster 0.15
For the second experiment, documents are taken from three different categories, two of which
are further divided into subcategories. (sub-categories listed within brackets in Table 2).
For the first experiment, we choose documents from 6 different categories from
20miniNewsGrop data set.
Table 2 Document Sets
Document Set A1 Number
Computer 08
Food (fish, yoke, loaf, cheese, beverage, water) 25
Automobile (car, bus, roadster, auto) 17
Document Set A2 Number
rec.sport.baseball 10
Sci.electronics 10
sci.med 20
Talk.politics.guns 10
Talk.religion.misc 10
Copyright ICWS-2009
Experiments on these two document sets are performed by running three different programs
named LSI, WN-LSI and LSI-WN. In the LSI, we have used only latent semantic analysis.
WN-LSI and LSI-WN are as defined in section 3.
For the document set 1, LSI produces 12 clusters, WN-LSI produces 4 clusters and LSI-WN
produces 24 clusters. LSI produces 6 clusters that belong to subcategories correctly.
Remaining 6 clusters produces mix of document from subcategories, but all documents in a
cluster belongs to one of the three main categories computer, food and automobile. WordNet
enhances the similarity relationship between the following words pair: (picture, image), (data,
information), (memory, store), (disk, platter) etc. in computer category. (car, automobile),
(transit, transportation), (travel, journey), (trip, travel) in automobile category and (drink,
beverage), (digest, stomach), (meat, substance), (nutrient, food) in the food category. First
two clusters produced by WN-LSI belong to food category and next two belong two
automobile category. 13 documents are placed in wrong clusters.
For the document set 2, LSI produces 9 clusters, WN-LSI produces 5 clusters and LSI-WN
produces 30 clusters.
The number of clusters produced by LSI-WN is quite large for both the document sets.
Possible reason for this is that WordNet might have introduced noise into the word
relationships after the singular value decomposition.
WN-LSI performed a little bit satisfactorily. WordNet enriched the document vectors with
the common synsets which helped LSI getting strong relationships among relevant words for
identifying relevant clusters.
This paper describes a method of incorporating knowledge from WordNet dictionary and
Latent Semantic Analysis into text document clustering task. The method uses the WordNet
synonyms and LSI to build the term relationship matrix.
Overall results of the experiments performed on the document sets are quite disappointing.
WN-LSI gives somewhat better results than LSI but LSI-WN fails to perform.
It seems that the major weakness of our approach is the keyword sets selection procedure.
Some of the keyword sets are quite relevant and produces good clusters. But some of the
keyword sets contains words that span over multiple categories. Also, some documents are
not assigned to any keyword set.
Clustering results for a document set are highly dependable on the values of various
parameter values (,,). Changing the value of these parameters severely affects the number
of initial clusters. These must be chosen for each document set carefully.
Results are also affected by polysemy problem. Two words are held similar if at least one
synset is same. But senses of words may be different in the documents.
There is a lot of scope for improvement in our approach. One area for the future work is the
incorporation of other WordNet relations like hypernyms. Use of a word sense
disambiguation technique may certainly improve the clustering results.
Copyright ICWS-2009
References
[1] [Berry, et al 1996] Berry, M. et al. SVDPACKC (Version 1.0) User's Guide, University of Tennessee Tech.
Report CS-93-194, 1993 (Revised October 1996).
[2] [Bezdek, 1981] J.C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. New York,
1981.
[3] [Chua et al, 2004] Chua S, Kulathuramaiyer N, Semantic Feature Selection Using WordNet, proceedings
of the IEEE/WIC/ACM International Conference on Web Intelligence (WI04).
[4] [Fasulo, 1999] D. fasulo. An analysis of recent work on clustering algorithms. Technical report # 01-03-02,
1999.
[5] [Graepel, 1998] T. Graepel. Statistical physics of clustering algorithms. Technical Report 171822, FB
Physik, Institut fur Theoretische Physic, 1998.
[6] [Hotho et al, 2003] A. Hotho, S. Staab, and G. Stumme, Wordnet improves Text Document Clustering,
Proc. of the Semantic Web Workshop at SIGIR-2003, 26th Annual International ACM SIGIR Conference,
2003.
[7] [Kontostathis et al, 2006] Kontostathis A, Pottenger W M, A Framework for Understanding Latent
Semantic Indexing (LSI) performance, International journal of Information processing and management 42
(2006) 56-73.
[8] [Miller et al., 1990] Miller et al Introduction to WordNet: An On-line Lexical Database, International
Journal of Lexicography 1990 3(4):235-244.
[9] [Wang et al] Y Wang, J Hodges, Document Clustering with Semantic Analysis, Proceedings of the 39
th

Hawaii international conference on system sciences.
[10] [Zamir et al, 1998] Oren Zamir and Oren Etzioni. Web Document Clustering: A Feasibility Demonstration
SIGIR98, Melbourne, Australia. 1998 ACM 1-58113-015-5 8/98
Cheating Prevention in Visual Cryptography

Gowriswara Rao G. C. Shoba Bindu
Dept. of Computer Science JNTU College of Engg.
JNTU College of Engg. Anantapur-515002
Anantapur-515002 shoba_bindu@yahoo.co.in
gowriswar@yahoo.com

Abstract

Visual cryptography (VC) is a method of encrypting a secret image into shares
such that stacking a sufficient number of shares reveals the secret image.
Shares are usually presented in transparencies. Each participant holds a
transparency. VC focuses on improving two parameters: pixel expansion and
contrast. In this paper, we studied the cheating problem in VC and extended
VC. We considered the attacks of malicious adversaries who may deviate from
the scheme in any way. We presented three cheating methods and applied
them on attacking existent VC or extended VC schemes. We improved one
cheat-preventing scheme. We proposed a generic method that converts a VCS
to another VCS that has the property of cheating-prevention.
1 Introduction
Visual Cryptography is a cryptographic technique which allows visual information (pictures,
text, etc.) to be encrypted in such a way that the decryption can be performed by the human
visual system, without the aid of computers.
Visual cryptography was pioneered by Moni Naor and Adi Shamir in 1994. They
demonstrated a visual secret sharing scheme, where an image was broken up into n shares so
that only someone with all n shares could decrypt the image, while any n-1 shares revealed
no information about the original image. Each share was printed on a separate transparency,
and decryption was performed by overlaying the shares. When all n shares were overlaid, the
original image would appear.
Using a similar idea, transparencies can be used to implement a one-time pad encryption,
where one transparency is a shared random pad, and another transparency acts as the cipher
text.
2 Visual Cryptography Scheme
The secret image consists of a collection of black and white pixels. To construct n shares of
an image for n participants, we need to prepare two collections, C
0
and C
1,
which consist of n
x m Boolean matrices. A row in a matrix in C
0
and C
1
corresponds to m subpixels of a pixel,
where 0 denotes the white sub pixel and 1 denotes the black sub pixel. For a white (or black)
pixel in the image, we randomly choose a matrix M from C
0
(or C
1
, respectively) and assign
row i of M to the corresponding position of share S
i,
1 < I < n. Each pixel of the original
image will be encoded into n pixels, each of which consists of m subpixels on each share.
Copyright ICWS-2009
Since a matrix in C
0
and C
1
constitutes only one pixel for each share. For security, the
number of matrices in C
0
and C
1
must be huge. For succinct description and easier realization
of the VC construction, we do not construct C
0
and C
1
directly. Instead, we construct two n x
m basis matrices S
0
and S
1
and then let C
0
and C
1
be the set of all matrices obtained by
permuting columns of S
0
and S
1
, respectively.
Let OR(B,X) be the vector of bitwise-OR of rows i
1
, i
2
,,i
q
of B, where B is an n x m
Boolean matrix and X = {P
i1
,P
i2
,,P
iq
} is a set of participants. Let w(v) be the Hamming
weight of row vector v. For brevity, we let w(B,X) = w(OR(B,X)). Let P
b
(S) = w(v)/m, where
v is a black pixel in share S and m is the dimension of v. Similarly, P
w
(S) = w(v)/m, where v is
a white pixel in share S. Note that all white (or black) pixels in a share have the same
Hamming weight. We use S
i
+ S
j
to denote the stacking of shares S
i
and S
j
. The
stacking corresponds to the bitwise-OR operation + of subpixels in shares S
i
and S
j
.
3 Image Secret Sharing
Unlike visual secret sharing schemes which require the halftoning process to encrypt gray-
scale or color visual data, image secret sharing solutions operate directly on the bit planes of
the digital input. The input image is decomposed into bit-levels which can be viewed as
binary images. Using the {k,n} threshold concept, the image secret sharing procedure
encrypts individual bit-planes into the binary shares which are used to compose the share
images with the representation identical to that of the input image. Depending on the number
of the bits used to represent the secret (input) image, the shares can contain binary, gray-scale
or color random information. Thus, the degree of protection afforded by image secret sharing
methods increases with the number of bits used to represent the secret image.
= + =
Secret Gray- Share1 Share2 Decrypted Scale Image Image
The decryption operations are performed on decomposed bit-planes of the share images.
Using the contrast properties of the conventional {k,n}-schemes, the decryption procedure
uses shares' bits to recover the original bit representation and compose the secret image. The
decrypted output is readily available in a digital format, and is identical to the input image.
Because of the symmetry constraint imposed on the encryption and decryption process,
image secret sharing solutions hold the perfect reconstruction property. This feature in
conjunction with the overall simplicity of the approach make this approach attractive for real-
time secret sharing based encryption/decryption of natural images.
4 Cheating in Visual Cryptography
For cheating, a cheater presents some fake shares such that the stacking of fake and genuine
shares together reveals a fake image. There are two types of cheaters in Visual Cryptography.
One is a malicious participant (MP) who is also a legitimate participant and the other is a
malicious outsider (MO). A cheating process against a VCS consists of the following two
phases:
Copyright ICWS-2009
a. Fake share construction phase: the cheater generates the fake shares;
b. Image reconstruction phase: the fake image appears on the stacking of genuine shares
and fake shares.
5 Cheating Methods
A VCS would be helpful if the shares are meaningful or identifiable to every participant. A
VCS with this extended characteristic is called extended VCS (EVCS). A EVCS is like a
VCS except that each share displays a meaningful image, which will be called share image.
Cheating methods are three types: first cheating method is initiated by an MP, while the
second cheating method is initiated by an MO. Both of them apply to attack VC. Our third
cheating method is initiated by an MP and applies to attack EVC.
A Cheating a VCS by an MP
The cheating method CA-1, depicted in Fig. 1, applies to attack any VCS. Without loss of
generality, we assume that P
1
is the cheater. Since the cheater is an MP, he uses his genuine
share as a template to construct a set of fake shares which are indistinguishable from its
genuine share. The stacking of these fake shares and S
1
reveals the fake image of perfect
blackness. We see that, for Y = {Pi
1
,Pi
2,.,
Piq} not belongs to Q, the stacking of their
shares reveals no images. Thus, the stacking of their shares and the fake shares reveals the
fake image due to the perfect blackness of the fake image.
Example: Fig. 2 shows how to cheat the participants in a (4,4)-VCS. There are four shares S
1
,
S
2
, S
3
, and S
4
in the (4,4)-VCS. P
1
is assumed to be the MP. By CA-1, one fake share FS
1
is
generated. Since Y = (P
1
, P
3
, P
4
) (or (P
1
, P
2
)) is not belongs to Q, we see that S
1
+ FS
1
+ S
3
+
S
4
reveals the fake image FI.

Copyright ICWS-2009
Fig. 1: Cheating method CA-1, initiated by an MP.

Fig. 2: Example of cheating a (4,4)-VCS by an MP.
B Cheating a VCS by an MO
Our second cheating method CA-2, depicted in Fig. 3, demonstrates that an can cheat even
without any genuine share at hand. The idea is as follows. We use the optimal (2,2)-VCS to
construct the fake shares for the fake image. Then, we tune the size of fake shares so that they
can be stacked with genuine shares.
Now, the only problem is to have the right share size for the fake shares. Our solution is to try
all possible share sizes. In the case that the MO gets one genuine share, there will be no such
problem. It may seem difficult to have fake shares of the same size as that of the genuine
shares. We give a reason to show the possibility. The shares of a VCS are usually printed in
transparencies. We assume that this is done by a standard printer or copier which accepts only
a few standard sizes, such as A4, A3, etc. Therefore, the size of genuine shares is a fraction,
such as 1/4, of a standard size. We can simply have the fake shares of these sizes.
Furthermore, it was suggested to have a solid frame to align shares in order to solve the
alignment problem during the image reconstruction phase. The MO can simply choose the
size of the solid frame for the fake shares. Therefore, it is possible for the MO to have the
right size for the fake shares.
Example: Fig. 4 shows that an MO cheats a (4, 4)-VCS. The four genuine shares S
1
, S
2
, S
3
,
and S
4
are those in Fig. 2 and the two fake shares are FS
1
and FS
2
. For clarity, we put S
1
here
to demonstrate that the fake shares are indistinguishable from the genuine shares. We see that
the stacking of fewer than four genuine shares and two fake shares shows the fake image FI.

Fig. 3: Cheating method CA-2, initiated by an MO.
Copyright ICWS-2009

Fig. 4: Example of cheating a (4, 4)-VCS by an MO.
C Cheating an EVCS by an MP
In the definition of VC, it only requires the contrast be nonzero. Nevertheless, we observe
that if the contrast is too small, it is hard to see the image. Based upon this observation, we
demonstrate the third cheating method CA-3, depicted in Fig. 5, against an EVCS. The idea
of CA-3 is to use the fake shares to reduce the contrast between the share images and the
background. Simultaneously, the fake image in the stacking of fake shares has enough
contrast against the background since the fake image is recovered in perfect blackness.
Example: Fig. 6 shows the results of cheating a (T, m)-EVCS, where P = {P
1
, P
2
, P
3
}, and Q
= {(P
1
, P
2
), (P
2
, P
3
), (P
1
, P
2
, P
3
)}. In this example, P
1
is the cheater who constructs a fake
share FS
2
with share image B in substitute for P
2
to cheat P
3
. S
1
+ FS
2
+S
3
reveals the fake
image FI.

Fig. 5: Cheating method CA-3 against an EVCS.
Copyright ICWS-2009

Fig. 6: Example of cheating a (T,m)-EVCS.
6 Cheat-Preventing Methods
There are two types of cheat-preventing methods. The first type is to have a trusted authority
(TA) to verify the shares of participants. The second type is to have each participant to verify
the shares of other participants. In this section, we present attacks and improvement on four
existent cheat-preventing methods.
Attack on Yang and Laihs Cheat-Preventing Methods
The first cheat-preventing method of Yang and Laih needs a TA to hold the special
verification share for detecting fake shares. The second cheat-preventing method of Yang and
Laih is a transformation of a (T, m)-VCS (but not a (2, n)-VCS) to another cheat-preventing
(T, m + n(n-1))-VCS.
Attacks on Horng et al.s Cheat-Preventing Methods
In the first cheat-preventing method of Horng et al., each participant P
i
has a verification
share V
i.
The shares S
i
s are generated as usual. Each V
i
is divided into n-1 regions R
i,j
, 1< j<
n, j not equal to i. Each region R
i,j
of V
i
is designated for verifying share S
j.
The region R
i,j
of
V
i
+ S
j
shall reveal the verification image for P
i
verifying the share S
j
of P
j
. The verification
image in R
i,j
is constructed by a (2,2)-VCS. Although the method requires that the
verification image be confidential, but it is still possible to cheat.
Horng et al.s second cheat-preventing method uses the approach of redundancy. It uses a (2,
n + l)-VCS to implement a (2, n)-VCS cheat-preventing scheme. The scheme needs no on-
line TA for verifying shares. The scheme generates n + l shares by the (2, n + l)-VCS for
some integer l>0, but distributes only n shares to the participants. The rest of shares are
destroyed. They reason that since the cheater does not know the exact basis matrices even
with all shares, the cheater cannot succeed.
Copyright ICWS-2009
7 Generic Transformation for Cheating Prevention
By the attacks and improvement in previous sections, we propose that an efficient and robust
cheat-preventing method should have the following properties.
a. It does not rely on the help of an on-line TA. Since VC emphasizes on easy
decryption with human eyes only, we should not have a TA to verify validity of
shares.
b. The increase to pixel expansion should be as small as possible.
c. Each participant verifies the shares of other participants. This is somewhat necessary
because each participant is a potential cheater.
d. The verification image of each participant is different and confidential. It spreads over
the whole region of the share. We have shown that this is necessary for avoiding the
described attacks.
e. The contrast of the secret image in the stacking of shares is not reduced significantly
in order to keep the quality of VC.
f. A cheat-preventing method should be applicable to any VCS.
Example: Fig. 8 shows a transformed (T, m + 2)-VCS with cheating prevention, where P =
{P
1
, P
2
, P
3
} and Q = {(P
1
, P
2
), (P
2
, P
3
), (P
1
, P
2
, P
3
)}. The verification images for participants
P
1
, P
2
, and P
3
are A, B, and C, respectively.

Fig. 7: Generic transformation for VCS with cheating prevention
Copyright ICWS-2009

Fig. 8: Example of a transformed VCS with cheating prevention.
8 Conclusion
The Proposed system explained three cheating methods against VCS and EVCS examined
previous cheat-preventing schemes; found that they are not robust enough and still
improvable. The system presents an improvement on one of these cheat-preventing schemes
and finally proposed an efficient transformation of VCS for cheating prevention. This
transformation incurs minimum overhead on contrast and pixel expansion. It only adds two
subpixels for each pixel in the image and the contrast is reduced only slightly.
References
[1] Chih-Ming Hu and Wen-Guey Tzeng, Cheating Prevention in Visual Cryptography, IEEE Trans. Image
Processing, Vol.16, N0.1, 2007.
[2] H.Yan, Z. Gan, and K. Chen, Acheater detectable visual cryptography scheme, (in Chinese) J. Shanghai
Jiaotong Univ., vol. 38, no. 1, 2004.
[3] G.-B. Horng, T.-G. Chen, and D.-S. Tsai, Cheating in visual cryptography, Designs, Codes, Cryptog.,
vol. 38, no. 2, pp. 219236, 2006.
Image Steganalysis Using LSB Based
Algorithm for Similarity Measures

Mamta Juneja
Computer Science and Engineering Department
Rayat and Bahara Institute of Engineering and Technology (RBIEBT)
Sahauran(Punjab), India
er_mamta@yahoo.com

Abstract

A novel technique for steganalysis of images has been presented in this paper
which is subjected to Least Significant Bit (LSB) type steganographic
algorithms. The seventh and eight bit planes in an image are used for the
computation of several binary similarity measures. The basic idea is that, the
correlation between the bit planes as well the binary texture characteristics
within the bit planes will differ between a stego-image and a cover-image.
These telltale marks can be used to construct a steganalyzer, that is, a
multivariate regression scheme to detect the presence of a steganographic
message in an image.
1 Introduction
Steganography refers to the science of invisible communication. Unlike cryptography,
where the goal is to secure communications from an eavesdropper, steganographic techniques
strive to hide the very presence of the message itself from an observer [G. J. Simmons,
1984].Given the proliferation of digital images, and given the high degree of redundancy
present in a digital representation of an image (despite compression), there has been an
increased interest in using digital images for the purpose of steganography. The simplest
image steganography techniques essentially embed the message in a subset of the LSB (least
significant bit) plane of the image, possibly after encryption [[N. F. Johnson; S.
Katzenbeisser, 2000]]. Popular steganographic tools based on LSB-embedding vary in their
approach for hiding information. Methods like Steganos and Stools use LSB embedding in
the spatial domain, while others like Jsteg embed in the frequency domain. Non-LSB
steganography techniques include the use of quantization and dithering [N. F. Johnson; S.
Katzenbeisser, 2000].
Since the main goal of steganography is to communicate securely in a completely
undetectable manner, an adversary should not be able to distinguish in any sense between
cover-objects (objects not containing any secret message) and stego-objects (objects
containing a secret message). In this context, steganalysis refers to the body of techniques
that are conceived to distinguish between cover-objects and stego-objects.
Recent years have seen many different steganalysis techniques proposed in the literature.
Some of the earliest work in this regard was reported by Johnson and Jajodia [N. F. Johnson;
S. Jajodia, 1998]. They mainly look at palette tables in GIF images and anomalies caused
360 Image Steganalysis Using LSB Based Algorithm for Similarity Measures
Copyright ICWS-2009
therein by common stego-tools. A more principled approach to LSB steganalysis was
presented in [A. Westfield; A. Pfitzmann, 1999] by Westfeld and Pfitzmann. They identify
Pairs of Values (PoVs), which consist of pixel values that get mapped to one another on LSB
flipping. Fridrich, Du and Long [J. Fridrich; R. Du, M. Long, 2000] define pixels that are
close in color intensity to be a difference of not more than one count in any of the three color
planes. They then show that the ratio of close colors to the total number of unique colors
increases significantly when a new message of a selected length is embedded in a cover
image as opposed to when the same message is embedded in a stego-image. A more
sophisticated technique that provides remarkable detection accuracy for LSB embedding,
even for short messages, was presented by Fridrich et al. in [J. Fridrich; M. Goljan ; R. Du,
2001]. Avcibas, Memon and Sankur [I. Avcibas; N. Memon; B. Sankur,2001] present a
general- technique for steganalysis of images that is applicable to a wide variety of
embedding techniques including but not limited to LSB embedding. They demonstrate that
steganographic schemes leave statistical evidence that can be exploited for detection with the
aid of image quality features and multivariate regression analysis. Chandramouli and Memon
[R. Chandramouli; N. Memon, 2001] do a theoretical analysis of LSB steganography and
derive a closed form expression of the probability of false detection in terms of the number of
bits that are hidden. This leads to the notion of steganographic capacity, that is, the number of
bits one can hide in an image using LSB techniques without causing statistically significant
modifications.
In this paper, a new steganalysis technique for detecting stego-images is presented. The
technique uses binary similarity measures between successive bit planes of an image to
determine the presence of a hidden message. In comparison to previous work, the technique
we present differs as follows:
[N. F. Johnson; S. Jajodia, 1998] present visual techniques and work for palette
images. Our technique is based on statistical analysis and works with any image
format.
[A. Westfield; A. Pfitzmann, 1999], [J. Fridrich; R. Du, M. Long, 2000] and [J.
Fridrich; M. Goljan ; R. Du, 2001] work only with LSB encoding. Our technique aims
to detect messages embedded in other bit planes as well.
[A. Westfield; A. Pfitzmann, 1999], [J. Fridrich; R. Du, M. Long, 2000] and [J.
Fridrich; M. Goljan ; R. Du, 2001] detect messages embedded in the spatial domain.
The proposed technique works with both spatial and transform-domain embedding.
Our technique is more sensitive than [A. Westfield; A. Pfitzmann, 1999], [J. Fridrich;
R. Du, M. Long, 2000] and [J. Fridrich; M. Goljan ; R. Du, 2001]. However, in its
current form it is not as accurate as [J. Fridrich; M. Goljan ; R. Du, 2001] and cannot
estimate the length of the embedded message like [J. Fridrich; M. Goljan ; R. Du,
2001].
Notice that our scheme, does not need a reference image for steganalysis. The rest of this
paper is organized as follows: In Section 2 we review binary similarity measures. In Section 3
we describe our steganalysis technique. In Section 4 we give simulation results and conclude
with a brief discussion in Section 5.
Image Steganalysis Using LSB Based Algorithm for Similarity Measures 361
Copyright ICWS-2009
2 Binary Similarity Measures
There are various ways to determine similarity between two binary images. Classical
measures are based on the bit-by-bit matching between the corresponding pixels of the two
images. Typically, such measures are obtained from the scores based on a contingency table
(or matrix of agreement) summed over all the pixels in an image. In this study, where we
examine lower order bit-planes of images, for the presence of hidden messages, we have
found that it is more relevant to make a comparison based on binary texture statistics. Let
{ } K k x
k i i
, , 1 , K = =
x and { } K k y
k i i
, , 1 , K = =
y be the sequences of bits representing the 4-

neighborhood pixels, where the index i runs over all the image pixels. Let
= =
= =
= =
= =
=
1 1 4
0 1 3
1 0 2
0 0 1
s r
s r
s r
s r
r
s
x and x if
x and x if
x and x if
x and x if
(1)
Then we can define the agreement variable for the pixel x
i
as: ) , (
1
=
K
k
k i
i
j
i
j , 4 , , 1 K = j , K =
4, where
=
=
n m
n m
n m
, 0
, 1
) , ( . (2)
The accumulated agreements can be defined as:
=
i
i
MN
a
1
1
,
=
i
i
MN
b
2
1
,
=
i
i
MN
c
3
1
,
=
i
i
MN
d
4
1
. (3)
These four variables {a,b,c,d} can be interpreted as the one-step co-occurrence values of the
binary images. Normalizing the histograms of the agreement scores for the 7
th
bit-plane can
be defined as follows:
7
/ .
j j j
i i
i i j
p =

(4)
Similarly, one can define
j
p
8
for the 8
th
bit plane. In addition to these we calculate the Ojala
texture measure as follows. For each binary image we obtain a 16-bin histogram based on the
weighted neighborhood as shown in Fig. 1, where the score is given by:
=
=
3
0
2
i
i
i
x S by
weighting the four directional neighbors as in Fig. 1.
1
8
i
x 2
4
Fig. 1: The Weighting of the Neighbors in the Computation of Ojala Score.
S= 4+8=12 given W, S bits 1 and E, N bits 0.
The resulting Ojala measure is the mutual entropy between the two distributions, that is
=
=
N
n
n n
S S m
1
8 7
7
log , (5)
Copyright ICWS-2009
Where N is the total number of bins in the histogram,
7
n
S is the count of the nth histogram
bin in the 7
th
bit plane and
8
n
S is the corresponding one in the 8
th
plane.
Table 1: Binary Similarity Measures
Similarity Measure Description
Sokal & Sneath Similarity Measure 1
d c
d
d b
d
c a
a
b a
a
m
+
+
+
+
+
+
+
=
1

Sokal & Sneath Similarity Measure 2 ) )( )( )( (
2
d c d b c a b a
ad
m
+ + + +
=

Sokal & Sneath Similarity Measure 3 c b d a
d a
m
+ + +
+
=
) ( 2
) ( 2
3

Variance Dissimilarity Measure ) ( 4
4
d c b a
c b
m
+ + +
+
=

Dispersion Similarity Measure ( )
2
5
d c b a
bc ad
m
+ + +
=

Co-occurrence Entropy
=
=
4
1
8 7 6
log
j
j j
p p dm

Ojala Mutual Entropy
=
=
15
0
8 7
7
log
n
n n
S S dm
Using the above definitions various binary image similarity measures are defined as shown in
Table 1. The measures m
1
to m
5
are obtained for seventh and eighth bits separately by
adapting the parameters {a,b,c,d} (3) to the classical binary string similarity measures, such
as Sokal & Sneath. Then their differences
th
i
th
i i
m m dm
8 7
= 5 , , 1 K = i are used as the final
measures. The measure dm
6
is defined as the co-occurrence entropies using the 4-bin
histograms of the 7
th
and 8
th
bit planes. Finally the measure dm
7
is somewhat different in that
we use the neighborhood-weighting mask proposed by Ojala [T. Ojala, M. Pietikainen, D.
Harwood]. Thus we obtain a 16-bin histogram for each of the planes and then calculate their
mutual entropy.
3 Steganalysis Technique Based on Binary Measures
This approach is based on the fact that embedding a message in an image has a telltale effect
on the nature of correlation between contiguous bit-planes. Hence we hypothesize that binary
similarity measures between bit planes will cluster differently for clean and stego-images.
This is the basis of our steganalyzer that aims to classify images as marked and unmarked.
We conjecture that hiding information in any bit plane decreases the correlation between that
plane and its contiguous neighbors. For example, for LSB steganography, one expects a
decreased similarity between the seventh and the eighth bit planes of the image as compared
to its unmarked version. Hence, similarity measures between these two LSBs should yield
higher scores in a clean image as compared to a stego-image, as the embedding process
destroys the preponderance of bit pair matches.
Since the complex bit pair similarity between bit planes cannot be represented by one
measure only, we decided to use several similarity measures to capture different aspects of bit
plane correlation. The steganalyzer is based on the regression of the seven similarity
measures listed in Table 1:
Copyright ICWS-2009
q q
m m m y + + + = ...
2 2 1 1
(6)
Where { }
q
m m m ,... ,
2 1
are the q similarity scores and { }
q
,... ,
2 1
are their regression coefficients.
In other words we try to predict the state y, whether the image contains a stego-message (y =
1) or not (y = -1), based on the bit plane similarity measures. Since we have n observations,
we have the set of equations
1 1 12 2 11 1 1
... + + + + =
q q
m m m y
n nq q n n n
m m m y + + + + = ...
2 2 1 1
(7)
Where
kr
m is the rth similarity measure observed in the kth test image. The corresponding
optimal MMSE linear predictor can be obtained by using the matrix M of similarity
measures:
( ) ( ) y
T T
M M M
1

= . (8)
Once prediction coefficients are obtained in the training phase, these coefficients can then be
used in the testing phase. Given an image in the test phase, binary measures are computed
and using the prediction coefficients, these scores are regressed to the output value. If the
output exceeds the threshold 0 then the decision is that the image is embedded, otherwise the
decision is that the image is not embedded. That is, using the prediction
q q
m m m y
...

2 2 1 1
+ + + = (9)
the condition 0 y implies that the image contains a stego-message, and the condition
0 < y signifies that it does not.
The above shows how one can design a steganalyzer for the specific case of LSB embedding.
The same procedure generalizes quite easily to detect messages in any other bit plane.
Furthermore, our initial results indicate that we can even build steganalyzer for non-LSB
embedding techniques like the recently designed algorithm F5 [Springer-Verlag Berlin,
2001]. This is because a technique like F5 (and many other robust watermarking techniques
which can be used for steganography in an active warden framework [I. Avcibas; N. Memon;
B. Sankur,2001]) results in the modification of the correlation between bit planes. We note
that LSB techniques randomize the last bit plane. On the other hand Jsteg or F5 introduce
more correlation between 7
th
and 8
th
bit planes, due to compression that filters out the natural
noise in a clean image. In other words whereas spatial domain techniques decrease
correlation, frequency domain techniques increase it.
The designed steganalyzer is based on a training set and had been using various image
steganographic tools. The steganographic tools were Steganos [Springer-Verlag Berlin,
2001], S-Tools [Steganos II Security Suite] and Jsteg [J. Korejwa, Jsteg Shell 2.0], since
these were among the most popular and cited tools in the literature. The image database for
the simulations was selected containing a variety of images such as computer generated
images, images with bright colors, images with reduced and dark colors, images with textures
and fine details like lines and edges, and well-known images like Lena, peppers etc.
In the experiments 12 images were used for training and 10 images for testing. The embedded
message size were 1/10 of the cover image size for Steganos and Stools, while the message
Copyright ICWS-2009
size were 1/100 of the cover image size for Jsteg. The 12 training and 10 test images were
embedded with separate algorithms (Steganos, S-Tools and Jsteg). They were compared
against their non-embedded versions in the test and training phases.
The performance of the steganalyzers is given in Table II. In this table we compare two
steganalyzers: the one marked Binary is the scheme discussed in this paper. The one marked
as IQM is based on the technique developed in [I. Avcibas; N. Memon; B. Sankur, 2001].
This technique likewise uses regression analysis, but it is based on several image quality
measures (IQM) such as block spectral phase distance, normalized mean square error, angle
mean etc. The quality attributes are calculated between the test image and its low-pass
filtered version. The steganalyzer scheme denoted as IQM [I. Avcibas; N. Memon; B. Sankur,
2001] is more laborious in the computation of the quality measures and preprocessing.
Table 2: Performance of the Steganalyzer
False Alarm Rate Miss Rate Detection Rate
IQM BSM IQM BSM IQM BSM
Steganos 2/5 1/5 1/5 1/5 7/10 8/10
Stools 4/10 1/10 1/10 2/10 15/20 17/20
Jsteg 3/10 2/10 3/10 1/10 14/20 17/20
F5 2/10 2/10 16/20

Simulation results indicate that the binary measures form a multidimensional feature space
whose points cluster well enough to do a classification of marked and non-marked images
and in a manner comparable to the previous technique presented in [I. Avcibas; N. Memon;
B. Sankur,2001].
5 Conclusion
In this paper, I have addressed the problem of steganalysis of marked images. I have
developed a technique for discriminating between cover-images and stego-images that have
been subjected to the LSB type steganographic marking. This approach is based on the
hypothesis that steganographic schemes leave telltale evidence between 7
th
and 8
th
bit planes
that can be exploited for detection. The steganalyzer has been instrumented with binary
image similarity measures and multivariate regression. Simulation results with commercially
available steganographic techniques indicate that the new steganalyzer is effective in
classifying marked and non-marked images.
As described above, the proposed technique is not suitable for active warden steganography
(unlike [I. Avcibas; N. Memon; B. Sankur, 2001]) where a message is hidden in higher bit
depths. But initial results have shown that it can easily generalize for the active warden case
by taking deeper bit plane correlations into account. For example, we are able to detect
Digimarc when the measures are computed for 3rd and 4th bit planes.
References
[1] [G. J. Simmons, 1984] Prisoners' Problem and the Subliminal Channel (The), CRYPTO83 - Advances in
Cryptology, August 22-24. 1984, pages. 51-67.
[2] [N. F. Johnson; S. Katzenbeisser, 2000] A Survey of steganographic techniques, in S. Katzenbeisser and
F. Petitcolas (Eds.): Information Hiding, pages. 43-78. Artech House, Norwood, MA, 2000.
Copyright ICWS-2009
[3] [N. F. Johnson; S. Jajodia, 1998] Steganalysis: The investigation of Hidden Information, IEEE
Information Technology Conference, Syracuse, NY, USA, 1998.
[4] [N. F. Johnson; S. Jajodia, 1998] Steganalysis of Images created using current steganography software, in
David Aucsmith (Ed.): Information Hiding, LNCS 1525, pages. 32-47. Springer-Verlag Berlin Heidelberg
1998.
[5] [A. Westfield; A. Pfitzmann, 1999] Attacks on Steganographic Systems, in Information Hiding, LNCS
1768, pages. 61-76, Springer-Verlag Heidelberg, 1999.
[6] [J. Fridrich; R. Du, M. Long, 2000] Steganalysis of LSB Encoding in Color Images, Proceedings of
ICME 2000, New York City, July 31-August 2, New York, USA
[7] [J. Fridrich; M. Goljan ; R. Du, 2001] Reliable Detection of LSB Steganography in Color and Grayscale
Images. Proc. of the ACM Workshop on Multimedia and Security, Ottawa, CA, October 5, 2001, pages.
27-30.
[8] [I. Avcibas; N. Memon; B. Sankur, 2001] Steganalysis Using Image Quality Metrics, Security and
Watermarking of Multimedia Contents, SPIE, San Jose, 2001.
[9] [R. Chandramouli; N. Memon, 2001] Analysis of LSB Based Image Steganography Techniques,
Proceedings of the International Conference on Image Processing, Thessalonica, Greece, October 2001.
[10] [C. Rencher, 1995] Methods of Multivariate Analysis, New York, John Wiley (1995).
[11] [Springer-Verlag Berlin, 2001]F5A Steganographic Algorithm: High Capacity Despite Better
Steganalysis. Information Hiding. Proceedings, LNCS 2137, Springer-Verlag Berlin 2001
[12] Steganos II Security Suite, http://www.steganos.com/english/steganos/download.htm[13] A. Brown, S-
Tools Version 4.0, Copyright 1996, http://members.tripod.com/steganography/stego/s-tools4
[13] [J. Korejwa, Jsteg Shell 2.0]
http://www.tiac.net/users/korejwa/steg.htm.http://www.cl.cam.ac.uk/~fapages2/watermarking/benchmark/i
mage_database.html
[14] [T. Ojala, M. Pietikainen, D. Harwood] A Comparative Study of Texture Measurss with Classification
Based on Feature distributions, Pattern Recognition, vol. 29, pages. 51-59
Content Based Image Retrieval Using Dynamical
Neural Network (DNN)

D. Rajya Lakshmi A. Damodaram
GITAM University JNTU College of Engg.
Visakhapatnam, India Hydearbad, India
dlakmi@rediffmail.com adamodaram@jntuap.ac.in
K. Ravi Kiran K. Saritha
GITAM University GITAM University
Visakhapatnam, India Visakhapatnam, India
ravikiranbtech09@gmail.com sarithait47@gmail.com

Abstract

In content-based image retrieval (CBIR), content of an image can be expressed
in terms of different features such as color, texture, shape, or text annotations.
Retrieval based on these features can be various by the way how to combine
the feature values. Most of the existing approaches assume a linear
relationship between different features, and the usefulness of such systems
was limited due to the difficulty in representing high-level concepts using low-
level features. We introduce Neural Network-based Image Retrieval system, a
human-computer interaction approach to CBIR using Dynamical Neural
Network to have approximate similarity comparison between images can be
supported. The experimental results show that the proposed approach captures
the user's perception subjectivity more precisely using the dynamically
updated weights.
Keywords: Content Based, Neural Network-based Image Retrieval, Dynamical Neural
Network, Binary signature, Region-based retrieval, Debouche compression, Image
segmentation.
1 Introduction
With the rapid development of computing hardware, digital acquisition of information has
become one popular method in recent years. Every day, G-bytes of images are generated by
both military and civilian equipment. Consequently, how to make use of this huge amount of
images effectively becomes a highly challenging problem. The traditional approach relies on
image content manual annotation and Database Management System (DBMS) to accomplish
the image retrieval through keywords. Although simple and straightforward, the traditional
approach has two main limitations. First, the descriptive keywords of an image are inaccurate
and incomplete. Second, manual annotation is time-consuming and subjective. Users with
different backgrounds tend to describe the same object using different descriptive keywords
resulting in the difficulties in image retrieval. To overcome the drawbacks of the traditional
approach, content-based image retrieval (CBIR) was proposed to retrieve visual-similar
images from an image database based on automatically-derived image features, which has
been a very active research area. There have been many projects performed to develop
Content Based Image Retrieval Using Dynamical Neural Network (DNN) 367
Copyright ICWS-2009
efficient systems for content-based image retrieval. The well known CBIR system is probably
IBMs QBIC [Niblack, 93],.Other notable systems include MITs Photobook [Pentland, 94],
Virages VisualSEEK [Smith, 96], etc.
Neural Networks are relatively simple systems containing general structures that can be
directly applied to image analysis and visual pattern recognition problems [Peasso, 95]. They
usually are viewed as nonparametric classifiers, although their trained outputs may indirectly
produce maximum a posteriori (MAP) classifiers [Peasso,95].A central attraction for using
neural networks is due to their computationally efficient decision based on training
procedures
Due to the variations encountered during Image matching, as same image has different
variations as direction, light, orientation the features of same object may not be same always,
associative memory, which facilitates approximate matching instead of exact matching finds
its relevance. Our DNN as associative memory with enhanced capacity permits overcoming
these impediments resourcefully and provide a feasible solution.
Organization of rest of the paper is as follows. Section 2 describes about detailed System
Architecture, section3 about Dynamical Neural Network (DNN) With Reuse, section 4
describes about Performance Evaluation. In section 5 Methodology and Experimentation is
discussed, section 6 includes Results and Interpretation and section 7 includes discussions
and conclusions.
2 Neural Network Approach
Several researchers have explored the field of Neural Networks application [Mighell 89] in
the field of content based Image retrieval.
In the present work the Dynamic Neural Network (with reuse) as associative memory has
been used to handle the content based Image retrieval. It is found to be efficient with its
special features of high capacity, fast learning and exact recall. The experimental results
presented emphasize the efficiency.
Neural Networks are relatively simple systems containing general structures that can be
directly applied to image analysis and visual pattern recognition problems [Peasso 95]. They
usually are viewed as nonparametric classifiers, although their trained outputs may indirectly
produce maximum a posteriori (MAP) classifiers [Peasso 95].A central attraction for using
neural networks is due to their computationally efficient decision based on training
procedures.
Due to the variations encountered during Image matching, as same image has different
variations as direction, light, orientation the features of same object may not be same always,
associative memory, which facilitates approximate matching instead of exact matching finds
its relevance. Our DNN as associative memory with enhanced capacity permits overcoming
these impediments resourcefully and provide a feasible solution.
Many models of neural networks are proposed to solve the problems to classification, vector
unitization self-organization and associative memory. Associative memory concept is related
to the association of stored information for given input pattern. High storage capacity and
accurate recall are the most desired properties of an associative memory network. Many
models are proposed by different researchers [Hilberg 97],] [Smith 96], [Sukhaswami 93],]
368 Content Based Image Retrieval Using Dynamical Neural Network (DNN)
Copyright ICWS-2009
[Kang 93] for improving the storage capacity. Most of the associative memory models are
some sort of variations of Hopfield model [Hopfield 82]. These models are demonstrated to
be very useful in a wide range of applications. The Hopfield Model has certain limitations of
its own. The practical storage capacity is 0.15n as compared to the other models. The stability
of the patterns stored in the memory approaches the maximum capacity. The accuracy of
recall falls with the increase in the number stored pattern. These precincts have incited us to
crop up with an improvised model of neural network correlated to associative memory. The
model, which we proposed, Dynamic Neural Network [Rao & Pujari 99] finds its advantages
in the fast learning, accurate recall and relative pruning and is being exploited in many
important applications such as faster text retrieval and word sense disambiguation [Rao &
Pujari 99]. Subsequent to this there are some imperative amendments to this composite
structure. It is often advantageous to have higher capacity with smaller initial structure. This
concept has led to come up with a method, which enables reusing of the nodes, which are
being pruned facilitating increase in the network capacity. This algorithm makes possible
reusability and would prove efficient in many applications. One application, which we have
studied extensively and found that, DNN would be most supportive, is Content Based Image
Retrieval. This is to retrieve information from a very large Image database using approximate
queries. The DNN with reuse is more suitable to accomplish the problem of association
related to Image Retrieval as it has the properties like associative memory with 100% perfect
recall, large storage capacity, avoiding spurious states and converging only to user specified
states. The image is subjected to multiresolution wavelet analysis for image compression, and
to capture invariant features of image.
3 Dynamical Neural Network (DNN) with Reuse
3.1 Dynamical Neural Network (DNN)
A new architecture Dynamical Neural (DNN) is proposed in [Rao & Pujari 99]. It is called
Dynamical Neural Network in the sense that its architecture gets modified dynamically over
time as raining progresses. The architecture (Figure 1) of the DNN has a composite structure
wherein each node of the network is a Hopfield network by itself. The Hopfield network
employs the new learning technique and converges to user-specified stable states without
having any spurious states. The capabilities of the new architecture are as follows. The DNN
works as associative memory without spurious stable states, and it also demonstrates a novel
idea of order-sensitive learning which gives preference to chronological order presentation of
the exemplar patterns. The DNN prunes nodes as it progressively carries out associative
memory retrieval.
3.2 Training and Relative Pruning of DNN
The underlying idea behind this network structure is the following. If each basic node
Memorizes p patterns then p basic nodes are grouped together. When a test pattern is
presented to the DNN, assume that it is presented to all the basic nodes simultaneously and
all basic nodes are activated at the same time to reach the respective stable states. Within a
group of basic nodes one of them is designed as the leader of the group. For simplicity
consider the first node as the leader. After the nodes in a group reach the respective stable
states these nodes transmit their stable states to the leader in that group. The corresponding
Copyright ICWS-2009
connections among the nodes are shown in Figure 1(a), for p=2. At this stage DNN adopts a
relative pruning strategy and it returns only the leader of each group and ignores all other
basic nodes within a group. In the next pass the DNN consists of lesser number of nodes, but
the structure is retained. These leader nodes are treated as basic nodes and each of them is
trained to memorize p patterns corresponding to p stable states of the member nodes of the
group. These leader nodes are again grouped together talking p nodes at a time. This process
is repeated till a single node remains. Thus in one cycle, the nodes carry out state-transitions
in parallel, keeping the weights unchanged and in the next cycle, the nodes communicate
among themselves to change the weights. At this stage half of the network is pruned and the
remaining half is available for the next iteration of two cycles. In this process, the network
eventually converges to the closest pattern.
It is clear that in DNN if each basic node memorizes p then each group memorizes p
2

patterns. This one leader node representing a portion of the DNN memorizes p
2 patterns
. When
p such leader nodes are grouped together then p
3
patterns can be memorized. If the process
runs for i iterations to arrive at a single basic node then the DNN can memorize p
i
patterns.
On the other hand if it required that the DNN to memorize K patterns, then we must start with
K / p basic nodes.

Fig. 1: The Architecture of DNN Fig. 1(a) First Level Grouping and Leader Nodes
3.3 Reuse of Pruned Nodes in DNN
One of the novel features of DNN is relative pruning, wherein the network releases half of its
neurons at every step. In this process the DNN, though appears to be a composite massively
connected structures, progressively simplifies by shedding a part of its structure. As a result,
the DNN makes intelligent use of the resources.
As we know the network should have more capacity to tackle realistic applications
efficiently, it is advantageous to have higher capacity with smaller initial structure. In the
present work we show that if we do not prune the nodes and reuse these pruned nodes then it
is possible to memorize more number of patters even with smaller initial structure. We
emphasize here that by pruning we release the nodes to perform some other task, or, we can
make use of pruned nodes for the same task at the subsequent iterations. The usage of pruned
nodes for processing fresh set of exemplar patterns is depicted as an algorithm in the Figure
2.
Copyright ICWS-2009
INPUT: S
1
,S
2
,,S
2m+2l
Exemplar patterns, X Test pattern, (where m is the number of basic
nodes in the network; 2l is the number of leftover Exemplar patterns after first iteration to be
memorized; l=1,2)
OUTPUT: O output pattern.
PROCEDURE
DNN ( S
1
,S
2
,,S
2m+1
,,S
2m+2l
, X : INPUT, O : OUTPUT )
j = 1; k = 2; a = 0;
For i<= 1 to m, with increment j
Constantly present the input pattern X to each node Hi
End for
For a<= 1 to l - 1, with increment 1
Train-network ( H
i
, S
2i-1
, S
2i
) /* the node H
i,
is trained with S
2i-1
and S
2i
exemplar patterns.*/
End for
For i<=1 to m, with increment j
Hopfield ( H
i,
X, O
i
) /* each node H
i
stabilizes at a stable state O
i
*/
End for
For i< 1 to m, with increment k
S
2i-1
O
i

S
2i
O
i+j

S
2(i+j)-1
S
2(2m+a)-1

S
2(i+j)
S
2(2m+a)

End for
End for
Repeat Until m = 1
Begin do
Train-network ( H
i
, S
2i-1
, S
2i
)
End for
Hopfield (Hi
,
X, O
i
)
End for
For i< 1 to m, with increment k
S
2i-1
O
i
S
2i
O
i+j
Prune the node H
i+j

End for
j = j + j; k = k + k; m = m / 2
End do END
Fig. 2: The Algorithm for Training and Pruning of DNN with reuse.
4 Performance Evaluation
The concept reusing the pruned nodes in each iteration has very good advantage when the
numbers of exemplar patterns are much more than that of the twice the number of the basic
Hopfield nodes in the DNN. Because the number of iterations required getting the final
output, when the reusing of nodes is used is much less compared to that of the original DNN.
The quantitative analysis is presented as follows.
Copyright ICWS-2009
The DNN without reuse of pruned nodes require n nodes when there are 2n patterns and
converge to final pattern in log
2
2n iterations. If the pruned nodes are reused with the same
n number of nodes we can memorize or store (k+1-log
2
n) n patterns, where k is the
number of iterations. This k is log
2
2n in the case of DNN without reuse. Now if one can
tolerate an extra iteration i.e., ((log
2
2n)+1) then, n extra patterns can be stored in the DNN.
As a example:
NODES CAPACITY ITERATIONS
DNN(without reuse) 8 16 patterns 4
DNN with reuse 8 24 patterns 5
DNN with reuse 8 32 6
5 Methodology and Experimentation
CBIR aims at searching image libraries for specific image features like colors, shape and
textures and querying is performed by comparing feature vectors (e.g. color histograms) of a
search image with the feature vectors of all images in the database. One of the most important
challenges when building image based retrieval systems is the choice and the representation
of the visual features. Color is the most intuitive and straight forward for the user while shape
and texture are also important visual ambits but there is no standard way to use them
compared to color for efficient image retrieval. Many content-based image retrieval systems
use color and texture features.
In the present work a method combining both color and texture features of image is proposed
to improve the retrieval performance. Since the DNN operates with binary data the feature
vector is transformed into a binary signature and stored into a signature file. Before to get one
signature for image from the database following operations are performed.
1. Image is compressed by using Harr or Debouche compression technique.
2. A feature vector (color +Texture) for each pixel is extracted image (c1, c2, c3, t1, t2,
t3) by applying color and texture feature extracting technique.
3. Image segmentation is a crucial step for a region-based system to increase
performance and accuracy during image similarity distance computation. Image
segmentation to obtain objects/regions. Images are segmented by grouping pixels with
similar descriptions (color and texture) to form objects/regions. image(c1,c2,c3,
t1,t2,t3) = sigma Oi (c1,c2,c3, t1,t2,t3)
4. The Final feature vector for each image is calculated by augmenting the feature
vectors of each object of a image.
Feature Vector = FV O1 || FV O2 || FVOi
5. The Feature Vector is converted binary signature and stored into a signature file SF.
Thus image database is transformed into signature file storage.
Similarly the signature is extracted for a query image. The test signature through
which the person claims his identity is verified with reference signature using
Dynamic Neural Network. DNN based CBIR system the set of signature extracted
from image database becomes training set for the DNN and the signature for the query
Copyright ICWS-2009
image becomes test signature (Input pattern). The DNN with reuse consists of fully
connected basic nodes, each of which is a Hopfield node. When a user claims to be a
particular individual and presents a signature (test signature), the binary
representation of the test signature is used as test pattern. According to the dynamics
of the Hopfield model, the DNN with reuse retrieves one of the memorized patterns
that is close to the test pattern.
5.1 Image Compression and Feature Extraction
Generally in CBIR systems before the feature extraction, and generate a reference signature
for each individual image it is required to compress the image. This is preprocessing phase
and next is feature extraction phase where signature is computed.
5.2 Wavelet Analysis
During the preprocessing phase of images the raw file of each image is subjected
multiresolution wavelet analysis. This required because the storage and manipulation of
scanned images very expensive, because of large storage space. To make widespread use of
digital images, practical, some form of data compression must be used. The wavelet
transform has become a cutting edge technology in image compression research.
The Wavelet representation encodes the average image and the detail images as we transform
the image for coarser resolutions. The detail images encode the directional features of
vertical, horizontal and diagonal direction whereas the average image retains the average
features. Thus the average image of a scanned signature can retain its salient structure even at
coarser resolution. However not all types of basis functions are able to preserve these features
in the present context. We tried a set of scaling functions and noticed that their behavior
differs. Another advantage of using wavelet representation is that the preprocessing stages of
contouring, thinning or edge detection are no longer required. We apply wavelet transform to
the gray scale image and use the process of finalization on wavelet coefficient of the average
image.
Wavelet representation gives information about the variations in the image at different scales.
A high wavelet coefficient at coarse resolution corresponds to a region with high global
variations. The idea is to find relevant point to represent this global variation by looking at
wavelet coefficients at finer resolutions A Wavelet is an oscillating and attenuated function
with zero integral. It is a basis function that has some similarity to both spines and Fourier
series. It decomposes the image into different frequency components and analyzes each
component with resolution matching its scale. We study the image at the scales 2
-j
, j Z
+
.
Application of wavelets to compute signatures of images is an area of active research.
The (forward) Wavelet transforms can be viewed as a form of sub band coding with a low
pass filter (H) and a high pass filter (G) which split a signals bandwidth in half. The impulse
responses of H and G are mirror images, and are related by
n
n
n
h g
=
1
1
) 1 ( . A one-
dimensional signal s can be filtered by convolving the filter coefficients
k
c with the signal
values:
k k
M
s c s
=
1
~
where M is the number of coefficients. The one-dimensional forward
wavelet transform of a signal s is performed by convolving s with both H and G and down
Copyright ICWS-2009
sampling by 2. The image f(x,y) is first filtered along the x-dimension, resulting in a low pass
image f
L
(x, y) and a high pass image f
H
(x, y). The sampling is accomplished by dropping
every other filtered value. Both f
L
and f
H
are then filtered along the y dimension, resulting in
four sub images: f
LL,
,

f
LH
, f
HL
and f
HH
. Once again, we can down sample the sub images by 2,
this time along the y-dimension. The 2-D filtering decomposes an image into an average
signal f
LL
and three detail signals which are directionally sensitive: f
LH
emphasizes the
horizontal image features, f
HL
the vertical features, and f
HH
the diagonal features. This process
reduces a 130x40 image to a set of four 17x5 images of which we consider only one, namely
the average image.
The properties of the basis wavelet depends on the proper choice of filter characteristic H
(w). Several filters have been proposed by the researchers working in this field. The specific
type of filter to be used depends on the application. The constraints for choosing a filter are
good location in space and frequency domain on one hand and smoothness and
differentiability on the other hand. Here we have used the Battle Lamarie filter co-efficient
for wavelet approximation.
5.2.1 Battle-Lemarie Filter [Mallat, 89]
h(n): 0.30683, 0.54173, 0.30683, -0.035498, -0.077807, 0.022684, 0.0297468, -0.0121455, -
0.0127154, 0.00614143, 0.0055799, -0.00307863, -0.00274529, 0.00154264, 0.00133087, -
0.000780461, 0.000655628, 0.0003955934
) (n g : 0.541736,-0.30683, -0.035498, 0.077807, 0.022684, -0.0297468, -0.0121455,
0.0127154, 0.00614143, -0.0055799, -0.00307863, 0.00274529, 0.00154264, -.00133087, -
0.000780461, 0.000655628, 0.0003955934, 0.000655628
5.2.2 Feature Extraction
Since the basic building block of DNN is Hopfield network and it uses only binary form of
information, to learn, the training set (images) has to be converted into binary. The image
retrieval system for preprocessing and feature extraction we used MATLAB image
processing tools and statistical tools. For clustering we use k-means clustering. We use a
general-purpose image database containing 1000 images from COREL. These images are
pre-categorized into 10 groups: African people, beach, buildings, buses, dinosaurs, elephants,
flowers, horses, mountains and glaciers, and food. All images have the size of 384x256 and
256x386. As explained above the color and texture feature is extracted and fed into a k-means
algorithm to get the clustered objects. As explained above in step 5 the binary signature is
computed. This process is repeated for all the images in the database and these signatures are
used as training set to the DNN. Same process is repeated for query image and the resulted
signature is used as test set. The DNN with reuse consists of fully connected basic nodes,
each of which is a Hopfield node.
5.3 Signature Comparison
The DNN with reuse consists of fully connected basic nodes, each of which is a Hopfield
node. The binary signatures of database images o are used as exemplar patterns. The query
image binary representation is used as test pattern. According to the dynamics of the Hopfield
model, the DNN with reuse retrieves one of the memorized patterns that are close to the test
pattern.
Copyright ICWS-2009
6 Results and Interpretation
In this experiment, we use 1000 images from COREL database. These images are pre-
categorized into 10 classes: African people, beach, buildings, buses, dinosaurs, elephants,
flowers, horses, mountains and glaciers, and food (below table). The first 80 images of each
class are used to as typical images to train the designed DNN. The feature vector has 200 bits.
The Next 15 images of each class are used test images. Fig 3 shows the retrieved results of
the DNN. In order to verify the performance of the proposed system, we also compare the
proposed DDN system with other image retrieval systems. Table II is the comparison of
average precision with the SIMPLicity system and the RIBIR system which shows that DNN
system is more efficient to retrieve the images than the other two systems. The reason is that
the trained DNN neural network can memorize some prior information about each class.

Fig. 2: Images Categorization

Fig. 3: Retrived Results of DNN (a) Flowers (b) Elephants
Copyright ICWS-2009
Table 1 Comparision of Average Precision between BPBIR and other Systems
classes BPBIR SIMPLicity DNN
African People and villages 32% 48% 56.99%
Beaches 40% 32% 54.69%
Buildings 34% 33% 53.46%
Buses 43% 37% 81.92%
Dinosaurs 54% 98% 98.46
Elephants 49% 40% 52.52
Flowers 36% 40% 76.35
Horses 54% 71% 81.21%
Mountains 34% 32% 48.65%
Food 50% 31% 66.45%
7 Conclusion
In this paper we presented a novel image retrieval system, which is based on DNN., which
has the observation that the images users need are often similar to a set of images with the
same conception instead of one query image and the assumption that there is a nonlinear
relationship between different features. Finally, we compare the performance of the proposed
system with other image retrieval system in Table 1. Experimental results show that it is more
effective and efficient.
References
[1] [Pender 91] D.A. Pender ``Neural Networks and Handwritten Signature Verification, Ph.D Thesis,
Department of Electrical Engineering, Stanford University.
[2] [Smith 96] Smith, K., Palaniswami, M. Krishnamurthy, M. ``Hybrid neural approach to combinatorial
optimization, Computers & Operations Reassert, 23, 6,597-610, 1996.
[3] [Mighel 89] D. A. Mighell, T. S. Willkinson and J. W. Goodman ``Backpropagation and its application To
Handwritten Signature Verification, Adv in Neural Inf Proc Systems 1, D. S. Touretzky (ed0, Morgon
Kaufman Pub, pp 340-347.
[4] [Peasso 95] F.C. Pessao, Multilayer Perceptron Vesus Hiddeen Markov Models: Comparison and
applications to Image Analysis and Visual Pattern Recognition, Pre PhD qualifying report,, Georgia
Institute of Technology, School of Electrical and Computer Engineering, Aug 10, 1995.
[5] [Hilberg 97] Hilberg, W. ``Neural Networks in higher levels of abstractons. Biological Cybernetics, 76,
2340, 1997.
[6] [Kang 93] Kang, H. ``Multilayer Associative Neural Networks (M.A.N.N): Storage capacity Vs noise-free
recall.International IT Conference on neural networks, (IJCNN) 93, 901-907, 1993.
[7] [Rao & Pujari 99] Rao, M.S. and Pujari, A.K. `À New Neural Networks architecture with associative
memory, pruning and order sensitive learning, International journal of neural systems,9,4, 351-370,1999.
[8] [Sukhaswami 93] Sukhaswami, M.B. `Ìnvestigations on some applications of artificial neural networks,
Ph.D, Thesis, Dept. of CIS, University of Hyderabad, Hyderabad, India, 1993.
[9] [Hopfield 82] Hopfield, J.J. ``Neural networks and physical systems with emergent collective
computational abilities. Proceedings of National Academy of Sciences, USA, 81 3088-3092,1994.

Development of New Artificial Neural Network
Algorithm for Prediction of Thunderstorm Activity

K. Krishna Reddy K. S. Ravi
Y.V.University, Kadapa K.L.College of Engg., Vijayawada
krishna.kkreddy@gmail.com sreenivasaravik@yahoo.co.in
V. Venu Gopalal Reddy Y. Md. Riyazuddiny
JNTU College of Engg., Pulivendula V.I.T.University, Vellore, TamilNadu
vgreddy7@rediffmail.com riyazymd@yahoo.co.in

Abstract

Thunderstorm can cause great damage to human life and property. Hence,
prediction of thunderstorm and associated rainfall are essential to the
agriculture, house hold purpose, industries and construction of buildings. Any
weather prediction is extremely complicated. This is because associated
mathematical models are complicated, involving many simultaneous non-
linear hydrodynamic equations. In many occasions such models do not give
accurate predictions. Artificial neural network (ANN) are known to be good at
problems where there are no clear cut mathematical models and so ANNs have
been tried out to make predictions in our application. ANNs are now being
used in many branches of research, including the atmospheric sciences. The
main contribution of this paper is the development of ANN to identify the
presence of thunderstorms (and their location) based on Automatic Weather
Station data collected at Semi-arid-zonal Atmospheric Research Centre
(SARC) of Yogi Vemana University.
1 Introduction
Thunderstorm is a highly destructive force of nature and the timely tracking of the
thundercloud direction is of paramount importance to reduce the property damages and
human casualties. Annually, it is estimated that thunderstorm related phenomenon causes
crores of rupees of damages world wide through forest fires, shutdown of electrical plants
and industries, property damages, etc [Singye et al., 2006]. Although there are thunderstorm
tracking mechanisms already in place, often such systems deploy complicated radar systems,
the cost of which can only be afforded by bigger institutions. The artificial neural networks
have been studied since the nineteen sixties (Rosenblatt, 1958), but their use for forecasting
meteorological events appeared only in the last 15 years [Lee et al., 1990 and Marzban,2002].
The greater advantage in using ANN is their intrinsic non-linearity, which helps in describing
complex meteorological events in a better way than linear methods.
2 Artificial Neural Network
An Artificial Neural Network (ANN) is a computational model that is loosely based on the
manner in which the human brain processes information. Specifically, it is a ANN of highly
Development of New Artificial Neural Network Algorithm for Prediction of Thunderstorm Activity 377
Copyright ICWS-2009
interconnecting processing elements (neurons) operating in parallel (Figure 1). An ANN can
be used to solve problems involving complex relationships between variables. The particular
type of ANN used in this study is a supervised one, wherein an output vector (target) is
specified, and the ANN is trained to minimize the error between the output and input vectors,
thus resulting in an optimal solution.

Fig. 1: A 2-layer ANN with Multiple Inputs and Single Hidden and Output Neurons
Today, most ANN research and applications are accomplished by simultaneous ANNs on
high performance computers. ANNs with fewer than 150 elements have been successfully
used in vehicular control simulation, speech recognition and undersea mine detection. Small
ANNs have also been used in airport explosive detection, expert systems, remote sensing
biomedical signal processing, etc. Figure 2 demonstrates a single layer perception that
classifies an analog input vector into two classes denoted A and B. This net divides the space
spanned by the input into two regions separated by a hyperplane or a line in two dimensions
as shown on the top right.

Fig. 2: A Single Layer Perception
378 Development of New Artificial Neural Network Algorithm for Prediction of Thunderstorm Activity
Copyright ICWS-2009
Figure.3 depicts a three-layer perception with N continuous valued inputs, M outputs and two
layers of hidden units. The nonlinearity can be any of those shown in Fig.3 The decision rule
is to select that class corresponding to the output node with the largest outputs in the
formulas, Xj and Xk are the outputs of nodes in the first and second hidden layers. Theta j
and theta k are internal offsets in those nodes. W
ij
is the counection strength from the input to
the first hidden layer and w
ik
and w
ij
are the connection strengths between the first and
second and between the second and the output layers, respectively.

Fig. 3: A three-layer Perception
3 Methodology
Consider figure 4. Here Y = Actual Output; D = Desired Output
Error E =1/2 (Y-D)
2

W
i
= W
i
-E/W
i
where is the learning rate
and W
i
is the adjusted weight
Now y=1/(1+e
-x
)
y/x = y/(1-y)
x/W
i
= /W
i

Therefore
-E/W
i
= E/y * y/W
i
= E/y * y/x * x/W
i

= (y-D)y(1-x)x
i
W
j
o
= W
j
o
- E/W
i
o

= W
j
o
- E/y * y/W
i
o

= W
j
o
- (y-D) * y(net
o
) * (net
o
)/W
i
o

= W
j
o
- (y-D)y(1-y)I
j

W
j
o
= W
j
o
- (y-D)y(1-y)I
j

N
( X
i
W
i
) = X
i
i=1
Copyright ICWS-2009

Fig. 4: ANN without Hidden Layer

Fig. 5: ANN with Hidden Layer
Consider the figure 5. For the sake of simplicity, we have taken 3/2/1 ANN. Our aim is to
determine the set of optimum weights between i) the input and hidden layer and ii) hidden
layer and out put layer.
For the hidden nodes, we have
W
ij
h
= W
ij
h
- E/W
ij
h

E/W
ij
h
- E/y * y/W
ij
h

=- (y-D) * [ (net
o
)]/(net
o
) * (net
o
) /W
ij
h

= - (y-D) * [ (net
o
)]/(net
o
) * (net
o
) /I
j
* I
j
/W
ij
h

= - (y-D) * [ (net
o
)]/(net
o
) *W
i
o
* I
j
/W
ij
h

= - (y-D) * [ (net
o
)]/(net
o
) *W
i
o
* I
j
( 1-I
j
) X
i
= - (y-D) y(1-y) W
i
o
*I
j
( 1-I
j
) X
i
We have
W
j
o
= W
j
o
- (y-D) y(1-y)I
j

Hence W
ij
h
= W
ij
h
- (y-D) y(1-y) W
i
o
*I
j
( 1-I
j
) X
i
4 Software Development
Extensive programming was done to execute the above ANN calculations. FORTRAN
programmes used for training the data. In addition, a Linux version, depicting the training
and testing has also been prepared. The programmes are developed and source codes are
written in C and the exe files can work under GNU environment. The programme gives a
graphical display of process of testing and training of ANN with depiction of results on
screen in graphical mode.
Copyright ICWS-2009
5 Meteorological Data
The various Input parameters namely temperature, pressure, relative humidity, wind speed
and wind direction are described.
5.1 Temperature
Temperature is the manifestation of heat energy. In meteorology, temperature is measured in
free air, in shade, at a standard height of 1.2 m above the ground. Measurement is made at
standard hours using thermometers. The unit of measurement is degree Celsius (
0
C). Ambient
temperature is refereed to as dry bulb temperature. This is different from wet bulb
temperature, which is obtained by keeping the maturing thermometer consistently wet. The
wet bulb temperature or due point provides a measure of the moisture content of the
atmosphere. We have used dry bulb temperature in our study.
5.2 Pressure
Atmospheric pressure is defined as the force excited by a vertical column of air of unit cross
section at a given level. It is measured by a barometer. The unit of pressure is milli bar (mb).
Atmospheric pressure varies with time of the day and latitude as also with altitude and
weather conditions. Pressure decreases with height. This is due to the fact that the
concentration of constituent gases and the depth of the vertical column decrease as we
ascend.
5.3 Wind (Speed and Direction)
The atmosphere reaches equilibrium through winds. Wind is air in horizontal motion. Wind is
denoted by the direction from which it blows and is specified by the points of a compass or
by degrees from True North (0 to 360
o
). Wind direction is shown by the wind vane and wind
speed is measured by anemometer. In our study, for computational purposes, values as in
Table 1 are assigned to the various directions.
Direction Value Assigned
(Degrees)
North (N) 360
North North East (NNE) 22.5
North East (NE) 45
East North East (ENE) 67.5
East (E) 90
East South East (ESE) 112.5
South East (SE) 135
DIRECTION Value assigned
(Degrees)
South South East(SSE) 157.5
South (S) 180
South South West (SSW) 202.5
South West (SW) 225
West South West (WSW) 247.5
West (W) 270
North West (NW) 315
Copyright ICWS-2009
5.4 Relative Humidity
The measure of the moisture content in the atmosphere is humidity. Air can hold only a
certain amount of water at given time. When the maximum limit is reached, air is said to be
saturated. The ratio of the amount of water vapor presented in the atmosphere to the
maximum it can hold at that temperature and pressure, expressed as a percentage is the
relative humidity.
5.5 Data Collection
The greater advantage in using ANN is their intrinsic non-linearity, which helps in describing
complex (thunderstorm) meteorological events in a better way than linear methods. But this
could turn out to be also a drawback, since this intrinsic power permits the ANN to easily fit
the database used to train the model. Unfortunately, it is not sure that the good performance
obtained by the ANN on the training data will be confirmed also on the new data
(generalization ability). To avoid this over fitting problem it is crucial to verify the ANN, i. e.
to divide the original database into training and validation subsets and choose the ANN which
has the best performance on the validation dataset (similarly to what done by Navone and
Ceccato 1994).
The data related to thunderstorm occurrence was collected from the SARC, Department of
Physics, Yogi Veamna University, Kadapa. We have collected a total of 100 data sets, of
which 45 were used for training and 51 for testing. Each set consist of 5 input parameters and
a corresponding output parameter. This output parameter specifies whether thunderstorm
occurred or not.
6 Results
The back propagation approaches described in sections 2 and 3 were tried out to predict
thunderstorm occurrences. Initially a straight forward ANN with one hidden layer was tried
out. The inputs were normalized and initial weights were optimized considering the
numerical limits of the compiler for data training. The sigmoid function was used as the
activation function. There were 5 input nodes, 3 hidden nodes and one output node. This
configuration is called 5/3/1 ANN. Target or desired output was kept 0 and 1 corresponding
to No Thunderstorm and Thunderstorm conditions. If one gives a close look at the
sigmoid function. It can be observed that the function saturates beyond -5 in the negative X
axis and +5 in the positive X axis. Since the sigmoid function reaches 0 and 1 only at
infinity and + infinity, it was decided that for practical purposes the target out put could be
corresponding to 0.0067 (X = -5 for 0 and corresponding to 0.9933 (x=5) for 1.
In order to attain convergence, error levels were fixed with respect to these values. Initially
the error value was fixed at 0.0025. Training was done by taking alternate data sets for
Thunderstorm and no thunderstorm conditions so that a better approximation is made by the
ANN. But it was observed that even after many iterations, the ANN exhibited oscillatory
behavior and finally stagnated at a constant value. Introduction and altering of the momentum
factor or varying the threshold value and learning rate also did not improve the convergence.
The ANN was able to reach error levels of about 0.25 only.
Copyright ICWS-2009
In order to further improve the ANN, a slightly different configuration i.e. 5/4/1 was tried
out, by having 4 hidden layers/units instead of 3. However there was a marginal decline in the
performance.
Table 2 Gives Details of the Number of Sweeps (Iterations) Made the Error Levels Reached
and the Efficiency of the Ann for the Different Configurations Attempted
Sl.No. ANN Type 5/1 5/3/1 5/4/1
1 Convergence reached 0.000045 0.000750 0.001
2 Iterations taken for this convergence 43 241 1606
3 Efficiency over learning data 73.469 93.012 92.763
4 Efficiency over testing data 73.009 89.1812 89.14

It can be seen that by using hidden layers lower error levels and better accuracy could be
achieved. However, the convergence rate was slow. Another significant observation was
made during the course of the work. It was noted that optimizing the initial weights of the
ANN led to faster convergence and better efficiency in all the above cases. Results obtained
were much better than those obtained with random initial weights. It has been possible to
reach prediction efficiency up to 90% with good computing facilities but with limited
datasets and time constraints. There is lot of scope for improvement by using more number of
physical parameters and more data sets. By prolonged analysis R should be possible to get
efficiency exceeding 95% and YES and NO Cases predicted quite close to their desired
values.
7 Conclusion
Artificial neural network without hidden layer shows limited capability in prediction of
thunderstorm. ANN with hidden layers gives good prediction results. However, ANN trained
by algorithm which does not average successive errors, does not reach lower error levels.
Weight initialization is an important factor. It has been found that proper weight initialization
instead of random initialization results in better efficiency and faster convergence. Fairly
accurate prediction of thunderstorm has been possible in spite of limited availability of
physical input parameters and data sets. Prolonged analysis with more number of physical
input parameters and larger volume of data sets will yield prediction efficiency greater than
95% and actual ANN outputs exactly conforming to the desired outputs
References
[1] [Lee et al., 1990] A neural network approach to cloud classify-cation, IEEE Transactions on Geoscience
and Remote Sensing, 28, pages 846-855.
[2] [Marzban and Stumpf, 1996] A neural network for tornado prediction based on Doppler radar-derived
attributes, J. Appl. Meteor., 35, pages 617-626.
[3] [Navone and Ceccatto, 1994] Predicting Indian Monsoon Rainfall: A Neural Network Approach. Climate
Dyn., 10, pages 305-312.
[4] [Rosenblatt, F., 1958]. The Perceptron: A probabilistic model for information storage and organization in
the brain, Psychological Review, 65, pages 386-408.
[5] [Singye et al., 2006], Thunderstorm tracking system using neural ANNs and measured electric fields from a
fe field mills, Journal of Electrical Engineering, 57, pages 8792.
Visual Similarity Based Image Retrieval
for Gene Expression Studies

Ch. Ratna Jyothi Y. Ramadevi
Chaitanya Bharathi Institute Chaitanya Bharathi Institute
of Technology, Hyderabad of Technology, Hyderabad
chrj_269@yahoo.co.in

Abstract

Content Based Image Retrieval (CBIR) is becoming very popular because of
the high demand for searching image databases of ever-growing size. Since
speed and precision are important, we need to develop a system for retrieving
images that is both efficient and effective.
Content-based image retrieval has shown to be more and more useful for
several application domains, from audiovisual media to security. As content-
based retrieval became mature, different scientific applications were revealed
client for such methods.
More recently, botanical applications generated very large image collections
then became very demanding content-based visual similarity
computation[2]. Our implementation describes low-level feature extraction
for visual appearance comparison between genetically modified plants for
gene expression studies
1 Introduction
Image database management and retrieval has been an active research area since the 1970s.
With the rapid increase in computer speed and decrease in memory cost, image databases
containing thousands or even millions of images are used in many application areas such as
medicine, satellite imaging, and biometric databases, where it is important to maintain a high
degree of precision. With the growth in the number of images, manual annotation becomes
infeasible both time and cost-wise.
Content-based image retrieval (CBIR) is a powerful tool since it searches the image database
by utilizing visual cues alone. CBIR systems extract features from the raw images themselves
and calculate an association measure (similarity or dissimilarity) between a query image and
database images based on these features. CBIR is becoming very popular because of the high
demand for searching image databases of ever-growing size. Since speed and precision are
important, we need to develop a system for retrieving images that is both efficient and
effective.
Recent approaches to represent images require the image [3]to be segmented into a number of
regions (a group of connected pixels which share some common properties). This is done
with the aim of extracting the objects in the image. However, there is no unsupervised
segmentation algorithm that is always capable of partitioning an image into its constituent
384 Visual Similarity Based Image Retrieval for Gene Expression Studies
Copyright ICWS-2009
objects, especially when considering a database containing a collection of heterogeneous
images. Therefore, an inaccurate segmentation may result in an inaccurate representation and
hence in poor retrieval performance.
We introduced contour-based CBIR technique[8]. It uses a new approach to describe the
shape of a region, inspired by an idea related to the color descriptor in. This new shape
descriptor, called Directional Fragment Histogram (DFH), is computed using the outline of
the region. One way of improving its efficiency would be to reduce the number of image
comparisons done at query time. This can be achieved by using a metric access structure or a
fitering technique.
2 A Typical CBIR System
Content Based Image Retrieval is defined as the retrieval of relevant images from an image
database on automatically derived imagery features. Content-based image retrieval[7], uses
the visual contents of an image such as color, shape, texture, and spatial layout to represent
and index the image. In typical content-based image retrieval systems (Figure 1-1), the visual
contents of the images in the database are extracted and described by multi-dimensional
feature vectors.

Fig. 1.1: Diagram for Content-based Image Retrieval System
The feature vectors of the images in the database form a feature database. To retrieve images,
users provide the retrieval system with example images or sketched figures. The system then
changes these examples into its internal representation of feature vectors. The similarities
distances between the feature vectors of the query example or sketch and those of the images
in the database are then calculated and retrieval is performed with the aid of an indexing
scheme.
The indexing scheme provides an efficient way to search for the image database. Recent
retrieval systems have incorporated users' relevance feedback to modify the retrieval process
in order to generate perceptually and semantically more meaningful retrieval results. In this
chapter, we introduce these fundamental techniques for content-based image retrieval.
Visual Similarity Based Image Retrieval for Gene Expression Studies 385
Copyright ICWS-2009
3 Fundamental Techniques for CBIR
3.1 Image Content Descriptors
Generally speaking, image content[6] may include both visual and semantic content. Visual
content can be very general or domain specific. General visual content include color, texture,
shape, spatial relationship, etc. Domain specific visual content, like human faces, is
application dependent and may involve domain knowledge. Semantic content is obtained
either by textual annotation or by complex inference procedures based on visual content.
3.1.1 Color
Most commonly used color descriptors include the color histogram, color coherence vector,
color correlogram and so on.The color histogram serves as an effective representation of the
color content of an image if the color pattern is unique compared with the rest of the data set.
The color histogram[4] is easy to compute and effective in characterizing both the global and
local distribution of colors in an image. In addition, it is robust to translation and rotation
about the view axis and changes only slowly with the scale, occlusion and viewing angle.
In color coherence vectors (CCV) spatial information is incorporated into the color
histogram. The color correlogram was proposed to characterize not only the color
distributions of pixels, but also the spatial correlation of pairs of colors.
3.1.2 Texture
Texture is another important property of images. Various texture representations have been
investigated in pattern recognition and computer vision. Basically, texture representation
methods can be classified into two categories: structural and statistical. Structural methods,
including morphological operator and adjacency graph, describe texture by identifying
structural primitives and their placement rules. They tend to be most effective when applied
to textures that are very regular. Statistical methods, including Fourier power spectra, co-
occurrence matrices, shift-invariant principal component analysis (SPCA), Tamura feature,
World decomposition, Markov random field, fractal model, and multi-resolution filtering
techniques such as Gabor and wavelet transform, characterize texture by the statistical
distribution of the image intensity.
3.1.3 Shape

Copyright ICWS-2009
Shape features of objects or regions have been used in many content-based image retrieval
systems. Compared with color and texture features, shape features are usually described after
images have been segmented into regions or objects. Since robust and accurate image
segmentation is difficult to achieve, the use of shape features for image retrieval has been
limited to special applications where objects or regions are readily available. The state-of-art
methods for shape description can be categorized into either boundary-based (rectilinear
shapes, polygonal approximation, finite element models, and Fourier-based shape
descriptors) or region-based methods (statistical moments). A good shape representation
feature for an object should be invariant to translation, rotation and scaling.
3.1.4 Spatial Information
Regions or objects with similar color and texture properties can be easily distinguished by
imposing spatial constraints. For instance, regions of blue sky and ocean may have similar
color histograms, but their spatial locations in images are different. Therefore, the spatial
location of regions (or objects) or the spatial relationship between multiple regions (or
objects) in an image is very useful for searching images. The most widely used representation
of spatial relationship is the 2D strings proposed by Chang et al.. It is constructed by
projecting images along the x and y directions. Two sets of symbols, V and A, are defined on
the projection. Each symbol in V represents an object in the image. Each symbol in A
represents a type of spatial relationship between objects.In addition to the 2D string, spatial
quad-tree, and symbolic image are also used for spatial information representation.
4 Similarity Measures and Indexing Schemes
4.1 Similarity Measures
Different similarity/distance measures will affect retrieval performances of an image retrieval
system significantly[6]. In this section, we will introduce some commonly used similarity
measures. We denote D(I, J) as the distance measure between the query image I and the
image J in the database; and fi(I) as the number of pixels in ith bin of Image I. In the
following sections we briefly introduce some of the commonly used distance measure
techniques.

1 Minkowski-Form Distance
If each dimension of image feature vector is independent of each other and is of equal
importance, the Minkowski-form distance Lp is appropriate for calculating the distance
between two images. This distance is defined as: when p=1, 2, a , D(I, J) is the L1, L2 (also
called Euclidean distance), and L distance respectively. Minkowski-form distance is the
most widely used metric for image retrieval. For instance, MARS system used Euclidean
distance to compute the similarity between texture features; Netra used Euclidean distance for
color and shape feature, and L1 distance for texture feature; Blobworld used Euclidean
distance for texture and shape feature. In addition, Voorhees and Poggio used L distance to
compute the similarity between texture images.

Copyright ICWS-2009
2 Quadratic Form (QF) Distance
The Minkowski distance treats all bins of the feature histogram entirely independently and
does not account for the fact that certain pairs of bins correspond to features which are
perceptually more similar than other pairs. To solve this problem, quadratic form distance is
introduced:
where A=[aij] is a similarity matrix, and aij denotes the similarity between bin i and j. Fj and
Fj are vectors that list all the entries in fi(I) and fi(J). Quadratic form distance has been used
in many retrieval systems for color histogram-based image retrieval. It has been shown that
quadratic form distance can lead to perceptually more desirable results than Euclidean
distance and histogram intersection method as it considers the cross similarity between
colors.
4.2 Indexing Scheme
After dimension reduction, the multi-dimensional data are indexed. A number of approaches
have been proposed for this purpose, including R-tree (particularly, R*-tree), linear quad-
trees, K-d-B tree and grid files. Most of these multi-dimensional indexing methods have
reasonable performance for a small number of dimensions (up to 20), but explore
exponentially with the increasing of the dimensionality and eventually reduce to sequential
searching. Furthermore, these indexing schemes assume that the underlying feature
comparison is based on the Euclidean distance, which is not necessarily true for many image
retrieval applications.
4.3 User Interaction
For content-based image retrieval, user interaction with the retrieval system is crucial since
flexible formation and modification of queries can only be obtained by involving the user in
the retrieval procedure. User interfaces in image retrieval systems typically consist of a query
formulation part and a result presentation part.
4.3.1 Query Specification
Specifying what kind of images a user wishes to retrieve from the database can be done in
many ways. Commonly used query formations are: category browsing, query by concept,
query by sketch, and query by example. Category browsing is to browse through the database
according to the category of the image. For this purpose, images in the database are classified
into different categories according to their semantic or visual content. Query by concept is to
retrieve images according to the conceptual description associated with each image in the
database. Query by sketch and query by example is to draw a sketch or provide an example
image from which images with similar visual features will be extracted from the database.
4.4 Relevance Feedback
Relevance feedback is a supervised active learning technique used to improve the
effectiveness of information systems. The main idea is to use positive and negative examples
from the user to improve system performance. For a given query, the system first retrieves a
list of ranked images according to a predefined similarity metrics. Then, the user marks the
retrieved images as relevant (positive examples) to the query or not relevant (negative
Copyright ICWS-2009
examples). The system will refine the retrieval results based on the feedback and present a
new list of images to the user. Hence, the key issue in relevance feedback is how to
incorporate positive and negative examples to refine the query and/or to adjust the similarity
measure.
4.5 Performance valuation
To evaluate the performance of retrieval system, two measurements, namely, recall and
precision [8,7], are borrowed from traditional information retrieval. For a query q, the data
set of images in the database that are relevant to the query q is denoted as R(q), and the
retrieval result of the query q is denoted as Q(q). The precision of the retrieval is defined as
the fraction of the retrieved images that are indeed relevant for the query:

The recall is the fraction of relevant images that is returned by the query:

4.6 Practical Applications of CBIR
A wide range of possible applications for CBIR technology are
1. Crime prevention
2. Military
3. Intellectual property
4. Architecture and engineering design
5. Fashion and interior design
6. Journalism and advertising
7. Medical diagnosis
8. Geographical information and remote sensing systems
9. Education and training.
10. Home entertainment
5 Our Approach
5.1 Introduction
As content-based image retrieval (CBIR) methods became mature, they are potentially useful
tool in many fields, including scientific investigation in life sciences. As an example, to fully
exploit the many large image collections now available in botany, scientists need automatic
methods to assist them in the study of the visual content. To apply CBIR to these image
databases, one must first develop description methods that are adapted both to the specific
content and to the objectives of the botanists..In this work, we are interested in issues that are
specific to the study of the function of genes in plants[2]. By selectively blocking individual
genes, biologists can obtain rather diverse plant phenotypes. They first need a qualitative and
quantitative characterization of each phenotype, reflecting the expression of a specific gene.
Then, they must find which phenotypes are visually similar; indeed, visual resemblances
between phenotypes reflects similarities in the roles of the genes whose expression was
blocked when obtaining these phenotypes.
Copyright ICWS-2009
5.2 Overview of the System
For small databases such manipulations can be performed manually. But very large databases
obtained as a result of large-scale genetic experiments require robust automatic procedures
for characterizing the visual content and for identifying visual similarities. This will be our
focus in the following. We use here an image database containing all classes of plants taken
in several places in the world, at different periods of the year, under various conditions. All
these plants had undergone genetic modifications. In order to satisfy the requirements of the
application, we defined a task.
In the retrieval task, the user chooses an image as a query; this image is employed by our
system to find all the plant images that are visually similar to the query plant image.
5.3 Feature Extraction
5.3.1 Plant Mask Computation
For retrieval task, we need to perform plant mask extraction with its shape and color
description. In this study, the plant collection contains images with homogeneous background
(synthetic) as well as heterogeneous background (earth). To eliminate the strong influence of
the background on the retrieval process, we decided to separate the plant from the
background and use only the salient region corresponding to the plant to perform partial
queries. In order to have a single mask by plant, even if it contains leaves with color
alterations, we achieve a coarse segmentation. Each pixel is represented by a local histogram
of color distributions calculated around the pixel, in a quantified color space.

Fig. 1: Coarse segmentation and mask construction
5.3.2 Shape Descriptor for Plant Masks
Once every image is segmented into a set of regions, we find the various connected
components, we neglect smallest ones and we use an algorithm of border detection to obtain
the contours of the salient regions. The small neglected regions are represented as hatched
areas and the contours of the salient regions as white curves.
Copyright ICWS-2009

Fig. 2: Detection of external plant contours
Directional Fragment Histograms
We introduce a new approach to describe the shape of a region, inspired by an idea related to
the color descriptor. This new shape descriptor, called Directional Fragment Histogram
(DFH), is computed using the outline of the region. We consider that each element of the
contour has a relative orientation with respect to its neighbors. We slide a segment over the
contour of the shape and we identify groups of elements having the same direction
(orientation) within the segment. Such groups are called directional fragments.
The DFH codes the frequency distribution and relative length of these groups of
elements.The length of the segment defines the scale s of the DFH. Assume that the direction
of an element of the contour can take N different values d0, d1...dN-1.
A fragment histogram at scale s is a two-dimensional array of values. Each direction
correspond to a set of bins and the value of each bin DFHs (i, j) is the number of positions at
which the segment contains a certain percentage of contour elements with the orientation di.
Suppose that the percentage axis (0% - 100%) of the fragment histogram is partitioned into J
percentiles p0, p1... pj-1, the fragment histogram contains N x J bins.
The fragment histogram is computed by visiting each position in the contour, retrieving the
directions of all the elements contained in the segment S starting at this position, computing
the percentage of elements having each direction and incrementing the histogram bins
DFHs(i,j) corresponding to the percentage of elements with a given orientation.
5.3.3 Illustration of Extraction Procedure
Suppose that we have 8 different directions and 4 different fraction ranges as in Fig 3. The
segment used is composed of 200 contour elements. Assume that, at a certain position in the
contour, the segment contains
20 elements with direction d0 (20/200 = 10% [0-25]),
60 elements with direction d2 (60/200=30% [25-50]),
120 elements with direction d7 (120/200 =60% [50-75]).
Then, the first bin in the row d0, the second bin in row d2 and the third bin in the row d7 will
be incremented, respectively. So, in this case, the segment is counted three times, once for
each direction present in the segment, and each time it represents a group of elements of a the
Copyright ICWS-2009
different size. The fragment histogram DFHs (i, j) can be normalized by the number of all the
possible segments at the end of the procedure.

Fig. 3: Extraction of the Directional Fragment Histogram
5.3.4 Quantization of Leave Color Alterations
Each pixel is represented by its color space components. The RGB color spaces were tested.
This segmentation, as shown in Fig. 1, allows a quantitative study of the color alterations that
are an expression of genetic modifications. For example, we distinguish several parts of the
plant having undergone color alterations engendered by genetic modifications compared to
the whole plant. The results of this fine segmentation are used to perform quantitative
measures of the relative area over altered parts of leaves. These measures will provide
automatic textual annotation.

Fig. 4: Two Examples of Area Quantization of Coloralterations, Based on a Fine Segmentation of Plant Images
References
For the purposes of designing and developing this document the following websites/technical
papers have been referred to:
[1] H. Frigui and R. Krishnapuram, Clustering by competitive agglomeration. Pattern Recognition, 30(7):
1109-1119, 1997.
[2] Jie Zou and George Nagy, Evaluation of Model-Based Interactive Flower Recognition. (ICPR04),
Cambridge, United Kingdom, 2004.
[3] N. Boujemaa, On competitive unsupervised clustering, International Conference on Pattern Recognition
(ICPR00),Barcelona, Spain, 2000
[4] R.J. Qian, P.L.J. van Beek and M.I. Sezan, Image retrieval using blob histograms, in IEEE Proc, Intl. Conf.
On Multimedia and Expo, New York City, July 2000.
[5] Peter Belhumeur and al., An Electronic Field Guide: Plant Exploration and Discovery in the 1stCentury
http://www.cfar.umd.edu/~gaaga/leaf/leaf.html
Copyright ICWS-2009
[6] M. L. KHERFI and D.ZIOU: Image Retrieval From the World Wide Web: Issues, Techniques and Systems.
[7] Dr. Fuhui Long, Dr. Hongjiang Zhang and Prof. David Dagan Feng: Fundamentals of Content Based Image
Retrieval
[8] SIA Ka Cheung: Issues on Content Based Image Retrieval
[9] Image processing in JAVA Nick Efford
[10] Writing software Requirements Specifications{DONN Le Vie, Jr. available at
http://www.raycomm.com/techwhirl/softwarerequirementspecs.html
[11] A practitioners approach to software Engineering[pressman, 5
th
edition]
Review of Analysis of Watermarking Algorithms
for Images in the Presence of Lossy Compression

N. Venkatram L.S.S. Reddy
KL College of Engineering KL College of Engineering
Vaddeswaram Vaddeswaram
venkat_ram_ecm@klce.ac.in principal@klce.ac.in

Abstract

In this paper, an analytical study related to the performance of important
digital watermarking approaches is presented. Correlation between the
embedded watermark and extracted watermark is found to identify the optimal
watermarking domain that can help to maximize the data hiding in spread
spectrum and Quantization watermarking.
1 Introduction
Information communication continues to attract the researchers for innovative methods on
digitization, image processing, and compression techniques and data security. The problems
associated with self healing of data, broadcast monitoring and signal tagging can be
successfully overcome by digital watermarking. In all these applications, the robustness of
watermarking is limited by compression which introduces distortion. This paper deals with
the performance analysis of watermark embedding strategies to perceptual coding. Perceptual
coding is lossy compression of multimedia based on human perceptual models. The basis for
the perceptual coding is that minor modifications of the signal representations are not
noticeable in the displayed content. Compression uses these modifications to reduce the
number of bits required for storage, and water marking also uses these modifications to
embed and detect the water mark. A compromise between perceptual coding and water
marking need to be found to integrate so that both processes can achieve their tasks.
2 Literature survey
Wolf Gong et al [1] investigated color image compression techniques using discrete cosine
transform (DCT) and discrete wavelet transform (DWT), using DCT and DWT based spread
spectrum watermarking. Their assertion is that matching water marking and coding
transforms improves the performance. But there is no theoretical basis for their assertion.
Kundur and Hatzinakos [2] argue that the use of same transform for both compression and
watermarking results in suboptimal performance in repetition code based quantization
water marking using both analytical and simulation results. Ramkumar and Akansu [3], [4]
conclude that the transforms which have poor energy compaction and not suitable for
compression are useful at high capacities of spread spectrum data hiding. With these
inconsistencies in the literature, a question arises regarding what is the best embedding
transform for robustness against lossy compression and which one of the spread spectrum or
quantization embedding superior?
394 Review of Analysis of Watermarking Algorithms for Images in the Presence of Lossy Compression
Copyright ICWS-2009
Eggers and Girod [5], [6] provided a detailed analysis of quantization effects on spread
spectrum water marking scenes in DCT Domain. Wu and Yu [7], [8] presented an idea on
combining two different watermark embedding strategies for embedding information 8X8
block DCT coefficients of host video. Quantization water marking [7] is used for embedding
in the low frequencies and spread spectrum water marking in the high frequencies. C. Fei et
al. [8], [9] proposed a model to incorporated quantization from compression from spread
spectrum watermarking. Chen and Wornell [10] and Eggers and Girod [11] developed some
robust schemes of water marking for lossy compression.
3 Quantization effects on Watermarks
Eggers and Girod [6] have analyzed the quantization effects on additive water marking
schemes. Their analysis is based on the computation of statistical dependencies between the
quantized water marked signal and the watermark, which is derived by extending the theory
of dithered quantizers. They obtained expressions for calculating the correlation coefficients
E {eu}, E {ev} and E {e
2
} where u and v are independent, zero mean random variables and e
is quantization error defined as
e= q-u-v (i)
Based on these expressions Kundur et al. [12] proposed a method of finding an expected
correlation coefficient between quantized signal q and the signal u itself as follows.
E {uq} = E {u
2
} + E {eu} (ii)
E {q
2
} = E{u
2
} + E{u
2
} + E{e
2
} + 2 E{eu} + 2 E{eu} (iii)
Based on the above watermark correlation and variance of the extracted water mark of spread
spectrum watermarking is found by Kunder et al. [12]
The model by Eggers and Girod [6] shows that the probability density function of the host
data to be watermarked has a significant influence on the correlation values between the
watermark and the quantized coefficient of spread spectrum watermarking. Their simulation
also shows that Generalized Gaussian model for DCT coefficients agrees closely with the
experimental results. Using these results, Chuhond Fei et al. [12 ] has calculated the
theoretical correlation coefficients.
4 Discussion
For two different images, both fully dependent and independent watermark sequences, are
tested using the following techniques.
4.1 Simulation Results Using Expected Average Correlation Coefficient Measure
It is found that when Joint Photographic Experts Group (JPEG) compressions occurs for a
quality factor less than 92 and the Hadamard transform is much superior to others. Wavelet
transform is a little better than KLT and DCT. And KLT and DCT are slightly better than the
Slant. All these techniques consider the performance of pixel domain as constant, so that its
performance exceeds that of wavelet, KLT, DCT and Slant that transform for very low
quality factors of compression.
Review of Analysis of Watermarking Algorithms for Images in the Presence of Lossy Compression 395
Copyright ICWS-2009
4.2 Simulation Results Using Watermark Detection Error Probability Measure
When JPEG compression occurs for quality factor less than 92 and the Hadamard transform
has the smallest error probability which is much superior to others. The Wavelet, KLT, DCT
and Slant are close in behavior and the performance of pixel domain remains constant in high
quality factors but is superior in low quality factor less than 60
In the case of Quantization based watermarking, again it is proved that it has high quality
factors greater than 90 and the DWT is better than the Slant and Hadamard. And the DCT and
KLT are at lowest in performance. The pixel domain is not bad in very high quality factors
but deteriorates to be worst in low quality factors.
Although the Quantization based algorithm can extract the original watermark perfectly when
water mark is transmitted without distortion the watermarks is severely damaged in case of
high level of compression thus the quantization method is not very robust to JPEG
compression
5 Conclusion
This paper reviewed the various analytical techniques and their appropriateness to the
practical values in case of watermarking algorithms for improved resistance to compression.
The findings show that the use of spread spectrum watermarking with a repetition code, and
quantization based embedding perform well when the watermarking is applied in a
complimentary domain to compression. Spread spectrum watermarking using independent
watermark elements work well when the same domain is employed. For improved robustness
to JPEG compression a hybrid watermarking scheme that takes the predicted advantages of
spread spectrum and quantization based watermarking will give superior performance.
6 Acknowledgements
This paper has benefited from the inspiration and review by Prof. P.Thrimurthy.
References
[1] [R.B. Wolfgang et al.1998] R.B. Wolfgang C I Podilchuk, and E J Delp, The effect of matching watermark
and compression transform in compressed color images, in Proc. IEEE Int. Conf. Image Processing Vol 1
Oct 1998 pp 440 455.
[2] [D Kundur and D Hatzinakos,1999] D Kundur and D Hatzinakos, Mismatching perceptual models for
effective watermarking in the presence of compression. In Proc. SPIE, Multimedia Systems and
Application II, Vol 3845 A G Tescher, Ed, Sept, 1999 pp 29-42.
[3] [M. Ramkumar and A.N. Akansu, 1998] M. Ramkumar and A.N. Akansu, Theoretical capacity measures
for data hiding in compressed images, in Proc SPIE voice, Video and Data Communications, Vol 3528
Nov. 1998 pp 482-492
[4] [M. Ramkumar, 1999] M. Ramkumar, A.N. Akansu, and A Alatan, On the choice of transforms for data
hiding in compressed video, in IEEE ICASSP, vol Vi, Phoenex AZ Mar, 1999 pp. 3049-3052.
[5] [J.J. Eggers and B. Girod,1999] J.J. Eggers and B. Girod, Watermark detection after quantization attacks,
in Proc 3
rd
workshop on Information Hiding, Dresden, Germany, 1999.
[6] [J.J. Eggers and B. Girod,2001] J.J. Eggers and B. Girod, Quantization effects on digital watermarks,
Signal Processing. Vol 81 no. 2 pp 239 -263 Feb 2001
[7] [M. Wu and H. Yu, 2000] M. Wu and H. Yu. Video access control via multi level data hiding, in IEEE
int. conf. Multimedia and Expo (ICME 00) New York, 2000
396 Review of Analysis of Watermarking Algorithms for Images in the Presence of Lossy Compression
Copyright ICWS-2009
[8] [C. Fei et al., 2001] C. Fei, D. Kundur, and R. Kwong. The choice of watermark domain in the presence of
compression, In proc IEEE Int. Conf. on information Technology coding and computing, Las Vegas, NV
Apr. 2001 pp 79-84
[9] [C. Fei et al., 2001] C. Fei, D. Kundur, and R. Kwong Transform based hybrid data hiding for improved
robustness in the presence of perceptual coding, in proc SPIE Mathematics of Data Image Coding,
Compression and encryption IV vol 4475 San Diego, CA July 2001 pp 203 212.
[10] [B. Chen and G. W. Wornell,2001] B. Chen and G. W. Wornell. Quantization index modulation: a class of
provably good methods for digital watermarking and information embedding, IEEE Trans. Inform Theory,
Vol 47 pp. 1423-1433, May 2001
[11] [J. Eggers and B. Girod,2002] J. Eggers and B. Girod, Informed Watermarking Norwell, MA Kluwer,
2002
[12] [Chuhond Fei et al. 2004] Chuhond Fei Deepa Kundur, Raymond H. Kwong, Analysis and Design of
Watermarking Algorithms for Improved Resistance to compression, IEEE transactions on Image
processing vol 13 February, 2004 pp 126-144.

Software Engineering
Evaluation Metrics for Autonomic Systems

K. Thirupathi Rao B. Thirumala Rao
ktr.klce@gmail.com thirumail123@gmail.com
L.S.S. Reddy V. Krishna Reddy
Koneru Lakshmaiah College of Engineering Lakkireddy BaliReddy College of Engineering
principal@klce.ac.in krishna4474@gmail.com
P. Saikiran
Srinidhi Institute of Science and Technology
psaikiran@gmail.com

Abstract

Most computer systems become increasingly large and complex, thereby
compounding many reliability problems. Too often computer systems fail,
become compromised, or perform poorly. To improve the system reliability,
one of the most interesting methods is the Autonomic Management which
offers a potential solution to these challenging research problems. It is inspired
by nature and biological system, such as the autonomic nervous system that
have evolved to cope with the challenges of scale, complexity, heterogeneity
and unpredictability by being decentralized, context aware, adaptive and
resilient. The complexity makes the autonomic systems more difficult to
evaluate. So to measure the performance and to compare the autonomic
systems we need to derive the metrics and benchmarks. This is highly
important and interesting area. This paper gives an important direction for
evaluating Autonomic Systems. Initially we also attempting to give the reader
a feel for the nature of autonomic computing Systems for this review of
autonomic computing systems, their properties, general architecture and
importance were presented.
1 Introduction
With modern computing, consisting of new paradigms such as planetary-wide computing,
pervasive, and ubiquitous computing, systems are more complex than before. Interestingly,
when chip design became more complex we employed computers to design them. Today we
are now at the point where humans have limited input to chip design. With systems becoming
more complex it is a natural progression to have the system to not only automatically
generate code but build systems, and carryout the day-to-day running and configuration of the
live system. Therefore autonomic computing has become inevitable and therefore will
become more prevalent. To deal with the growing complexity of computing systems requires
autonomic computing. The autonomic computing, which is inspired by biological systems
such as the autonomic human nervous system [1, 2] and enables the development of self-
managing computing systems and applications. The systems and applications use autonomic
strategies and algorithms to handle complexity and uncertainties with minimum human
400 Evaluation Metrics for Autonomic Systems
Copyright ICWS-2009
intervention. An autonomic application or system is a collection of autonomic elements,
which implement intelligent control loops to monitor, analyze, plan and execute using
knowledge of the environment. A fundamental principle of autonomic computing is to
increase the intelligence of individual computer components so that they become self-
managing, i.e., actively monitoring their state and taking corrective actions in accordance
with overall system-management objectives. The autonomic nervous system of the human
body controls bodily functions such as heart rate, breathing and blood pressure without any
conscious attention on our part. The parallel notion when applied to autonomic computing is
to have systems that manage themselves without active human intervention. The ultimate
goal is to create Autonomic Management computer systems that will become self-managing,
and more powerful; users and administrators will get more benefits from computers, because
they can concentrate their works with little conscious intervention. The paper is organized as
follows. Section 2 deals with the characteristics of autonomic computing system, Section 3
deals with architecture for autonomic computing, section 4 deals with the Evaluation Metrics
and concluded in section 5 followed by References.
2 Characteristics of Autonomic Computing System
The new era of computing is driven by the convergence of biological and digital computing
systems. To build tomorrows autonomic computing systems we must understand working
and exploit characteristics of autonomic system. Autonomic systems and applications exhibit
following characteristics. Some of these characteristics are discussed in [3, 4].
Self Awareness: An autonomic system or application knows itself and is aware of its state
and its behaviors.
Self Configuring: An autonomic system or application should be able configure and
reconfigure itself under varying and unpredictable conditions without any detailed human
intervention in the form of configuration files or installation dialogs.
Self Optimizing: An autonomic system or application should be able to detect suboptimal
behaviors and optimize it self to improve its execution.
Self-Healing: An autonomic system or application should be able to detect and recover from
potential problems and continue to function smoothly.
Self Protecting: An autonomic system or application should be capable of detecting and
protecting its resources from both internal and external attack and maintaining overall system
security and integrity.
Context Aware: An autonomic system or application should be aware of its execution
environment and be able to react to changes in the environment.
Open: An autonomic system or application must function in an heterogeneous world and
should be portable across multiple hardware and software architectures. Consequently it must
be built on standard and open protocols and interfaces.
Anticipatory: An autonomic system or application should be able to anticipate to the extent
possible, its needs and behaviors and those of its context, and be able to manage it self
proactively.
Evaluation Metrics for Autonomic Systems 401
Copyright ICWS-2009
Dynamic: Systems are becoming more and more dynamic in a number of aspects such as
dynamics from the environment, structural dynamics, huge interaction dynamics and from a
software engineering perspective the rapidly changing requirements for the system. Machine
failures and upgrades force the system to adapt to these changes. In such a situation, the
system needs to be very flexible and dynamic.
Distribution: systems become more and more distributed. This includes physical distribution,
due to the invasion of networks in every system, and logical distribution, because there is
more and more interaction between applications on a single system and between entities
inside a single application.
Situated ness: systems become more and more situated: there is an explicit notion of the
environment in which the system and entities of the system exist and execute, environmental
characteristics affect their execution, and they often explicitly interact with that environment.
Such an (execution) environment becomes a primary abstraction that can have its own
dynamics, independent of the intrinsic dynamics of the system and its entities. As a
consequence, we must be able to cope with uncertainty and unpredictability when building
systems that interact with their environment. This situated ness often implies that only local
information is available for the entities in the system or the system itself as part of a group of
systems.
Locality in control: When Computing systems and components live and interact in an open
world, the concept of global flow of control becomes meaningless. So Independent
computing systems have their own autonomous flows of control, and their mutual
interactions do not imply any join of these flows. This trend is made stronger by the fact that
not only do independent systems have their own flow of control, but also different entities in
a system have their own flow of control.
Locality in interaction: physical laws enforces locality of interactions automatically in a
physical environment.. In a logical environment, if we want to minimize the conceptual and
management complexity we must also favor modeling the system in local terms and limiting
the effect of a single entity on the environment. Locality in interaction is a strong requirement
when the number of entities in a system increases, or as the dimension of the distribution
scale increases. Otherwise tracking and controlling concurrent and autonomously initiated
interactions is much more difficult than in object-oriented and component-based applications.
The reason for this is that autonomously initiated interactions imply that we can not know
what kind of interaction is done and we have no clue about when a (specific) interaction is
initiated. Need for global Autonomy: the characteristics described so far, make it difficult to
understand and control the global behavior of the system or a group of systems. Still, there is
a need for a coherent global behavior. Some functional and non functional requirements that
have to be solved by computer systems are so complex that a single entity can not provide it.
We need systems consisting out of multiple entities which are relatively simple and where the
global behavior of that system provides the functionality for the complex task.
3 Architecture for Autonomic Computing
Autonomic systems are composed from autonomic elements and are capable to carry out
administrative functions, managing their behaviors and their relationships with other systems
and applications by reducing human intervention in accordance with high-level policies.
Copyright ICWS-2009
Autonomic Computing System can make decisions and manage themselves in three scopes.
These scopes in detail discussed in [6].
Resource Element Scope: In resource element scope, individual components such as servers
and databases manage themselves.
Group of Resource Elements Scope: In group of resource elements scope, pools of grouped
resources that work together perform self-management. For example, a pool of servers can
adjust work load to achieve high performance. Business Scope: overall business context can
be self-managing. It is clear that increasing the maturity levels of Autonomic Computing will
affect on level of making decision.
3.1 Autonomic Element
Autonomic Elements (AEs) are the basic building blocks of autonomic systems and their
interactions produce self managing behavior. Each AE has two parts: Managed Element
(ME) and Autonomic Manager (AM) as shown in figure. Sensors retrieve information about
the current state of the environment of ME and then compare it with expectations that are
held in knowledge base by the AE. The required action is executed by effectors. Therefore,
sensors and effectors are linked together and create a control loop.

Fig-1 The figure1 description is as follows
Managed Element: It is a component from system. It can be hardware, application software,
or an entire system.
Autonomic Manager: These execute according to the administrator policies and implement
self-management. An AM uses a manageability interface to monitor and control the ME. It
has four parts: monitor, analyze, plan, and execute.
Monitor: Monitoring Module provides different mechanisms to collect, aggregate, filter,
monitor and manage information collected by its sensors from the environment of a ME.
Analyze: The Analyze Module performs the diagnosis of the monitoring results and detects
any disruptions in the network or system resources. This information is then transformed into
events. It helps the AM to predict future states.
Plan: The Planning Module defines the set of elementary actions to perform accordingly to
these events. Plan uses policy information and what is analyzed to achieve goals. Policies can
Knowledge
Analyze Plan
Monitor Execute
Sensors Effectors
Managed Element
Copyright ICWS-2009
be a set of administrator ideas and are stored as knowledge to guide AM. Plan assigns tasks
and a resource based on the policies, adds, modifies, and deletes the policies. AMs can
change resource allocation to optimize performance according to the policies.
Execute: It controls the execution of a plan and dispatches recommended actions into ME.
These four parts provide control loop functionality.
3.2 AC Toolkit
IBM assigns autonomic computing maturity levels to its solutions. There are five levels total
and they progressively work toward full automation [5].
Basic Level: At this level, each system element is managed by IT professionals. Configuring,
optimizing, healing, and protecting IT components are performed manually.
Managed Level: At this level, system management technologies can be used to collect
information from different systems. It helps administrators to collect and analyze
information. Most analysis is done by IT professionals. This is the starting point of
automation of IT tasks.
Predictive Level: At this level, individual components monitor themselves, analyze changes,
and offer advices. Therefore, dependency on persons is reduced and decision making is
improved.
Adaptive Level: At this level, IT components can individually or group wise monitor, analyze
operations, and offer advices with minimal human intervention. Autonomic Level: At this
level, system operations are managed by business policies established by the administrator. In
fact, business policy drives overall IT management, while at adaptive level; there is an
interaction between human and system.
4 Evaluation Metrics
Advances in computing, communication, and software technologies and evolution of Internet
have resulted in explosive growth of information services and their underlying
infrastructures. Operational environment become more and more complex and
unmanageable. Hence As with increase in complexity their evaluation is increasingly
important. The evaluation metrics can be classified into two categories, one is at Component
Level in which each unit ability to meet its goal is measured and the other is Global Level in
which are for measuring the overall autonomic system performance [7]. This section lists sets
of metrics and means by which we can compare such systems. All these metrics will fall in
any one of two categories and some times in both.
a Scalability: Providing rich services to the users requires the computing systems to be
scalable.
b Heterogeneity: A computing system should have the ability to run in heterogeneous
operating systems.
c Survivability: computing system should be aware of operating environment. The
future operating environment is unpredictable, and the computing systems should be
able to survive even under the extreme conditions.
Copyright ICWS-2009
d Reliability: New services are always built on top of existed components, thus the
reliability of system components becomes more important.
e Adaptability: computing systems, services and applications requires the systems and
software architectures to be adaptive in all their attributes and functionalities. We
separate out the act of adaptation form the monitoring and intelligence that causes the
system to adapt. Some systems are designed to continue execution whilst
reconfiguring, while others cannot. Furthermore the location of such components
again impacts the performance of the adaptively process. That is, a component object,
which is currently local to the system verses a component (such as a printer driver for
example), having to be retrieved over the Internet, will have significantly differing
performance. Perhaps more future systems will have the equivalent of a pre-fetch of
components that are likely to be of use and are preloaded to speed up the re-
configuration process.
f Quality of Service (QoS): It is a highly important metric in autonomic systems as they
are typically designed to improve some aspect of a service such as speed and
efficiency or performance. QoS reflect the degree to which the system is reaching its
primary goal. Some systems wish to improve the users experience with the system in
self-adaptive or personalized GUI design for disabled people. This metric is tightly
coupled to the application area or service that is expected of the system. It can be
measured as a global or component level goal metric.
g Cost: Due to dynamic computing environment the numbers of connected computing
devices are growing in the network every year. As a result, manually managing and
controlling of these complex computing systems become difficult and human labor
cost is exceeding equipment cost manually by human operators [8].
Autonomicity costs, the degree of this cost and its measurement is not clear-cut. For
many commercial systems the aim is to improve the cost of running an infrastructure,
which includes primarily people costs in terms of systems administrators and
maintenance. This means that the reduction in cost for such systems cannot be
measured immediately but over time and as the system becomes more and more self-
managing. Cost comparison is further complicated by the fact that adding
Autonomicity means adding intelligence, monitors and adaptation mechanismsand
this cost. A class of application very fitting to autonomic computing is that of
Ubiquitous computing which typically consists of networks of sensors working
together to create intelligent homes monitor the environment. This sort of application
relies on self reliance, distributed self-configuration intelligence and monitoring.
However many of the nodes in such a system is limited in resources and can be
wireless, which means that the cost of autonomous computing involves resource
consumption such as battery power.
h Abstraction: Computing Systems hide their complexity from end users, leveraging the
resources to achieve business or personal goals, without involving the user in any
implementation details.
i Granularity: The granularity of autonomicity is an important issue when comparing
autonomic systems. Fine grained components with specific adaptation rules will be
highly flexible and perhaps adapt to situations better, however this may cause more
Copyright ICWS-2009
overhead in terms of the global system. That is, if we assume that each finer grained
component requires environmental data and is providing some form of feedback on its
performance then potentially there is more monitoring data or at least environmental
information flowing around the global system. Of course may not be the case in
systems where the intelligence is more centralized.Many current commercial
autonomic endeavors are at the thicker grained service level. Granularity is important
where unbinding, loading and rebinding a component took a few seconds. These few
seconds are tolerable in a thick-grained component based architecture where the
overheads can be hidden in the systems overall operation and potentially change is
not that regular. However in finer-grained architectures, such as an Operating System
or Ubiquitous computing where change is either more regular or the components
smaller, the hot swap time is potentially too much.
j Robustness: Typically many autonomic systems are designed to avoid failure at some
level. Many are designed to cope with hardware failure such as a node in a cluster
system or a component that is no longer responding. Some avoid failure by retrieving
a missing component. Either way the predictability of failure is an aspect in
comparing such systems. To measure this, the nature of the failure and how
predictable that failure is, needs to be varied and the systems ability to cope
measured.
k Degree of Autonomy: Related to failure avoidance, we can compare how autonomous
a system is. This would relate to AI and agent-based autonomic systems primarily as
their autonomic process is usually to provide an autonomous activity. For example the
NASA pathfinder must cope with unpredicted problems and learn to overcome them
without external help. Decreasing the degree of predictability in the environment and
seeing how the system copes could measure this. Lower predictability could even
reach it having to cope with things it was not designed to. A degree of proactivity
could also compare these features.
l Reaction Time: Related to cost and sensitivity, these are measurements concerned
with the system reconfiguration and adaptation. The time to adapt is the measurement
of the time a system takes to adapt to a change in the environment. That is, the time
taken between the identification that a change is required until the change has been
effected safely and the system moves to a continue state. Reaction time can be seen to
partly envelop the adaptation time. This is the time between when an environmental
element has changed and the system recognizes that change, decides on what
reconfiguration is necessary to react to the environmental change and get the system
ready to adapt. Further the reaction time affects the sensitivity of the autonomic
system to its environment.
m Sensitivity: This is a measurement of how well the self adaptive system fits with the
environment it is sitting in. At one extreme a highly sensitive system will notice a
subtle change as it happens and adapt to improve itself based on that change.
However in reality, depending on the nature of the activity, there is usually some form
of delay in the feedback that some part of the environment has changed effecting a
change in the autonomic system. Further the changeover takes time. Therefore if a
system is highly sensitive to its environment potentially it can cause the system to be
constantly changing configuration etc and not getting on with the job itself.
Copyright ICWS-2009
n Stabilization: Another metric related to sensitivity is stabilization. That is the time
taken for the system to learn its environment and stabilize its operation. This is
particularly interesting for open adaptive systems that learn how to best reconfigure
the system. For closed autonomic systems the sensitivity would be a product of the
static rule/constraint base and the stability of the underlying environment the system
must adapt to.
5 Conclusion
In this paper, we have presented the essence of the autonomic computing and development of
such systems. It gives the reader a feel for the nature of these types of systems and also we
presented some typical examples to illustrate the complexities in trying to measure the
performance of such systems and to compare them.
This paper lists set of metrics and means for measuring the overall autonomic system
performance at global level as well as at component level.
Finally, these metrics together form some sort of benchmarking tool to derive new autonomic
systems or we can augment existing autonomic system by incorporating these metrics, which
measures various autonomic characteristics.
References
[1] S. Hariri and M. Parashar. Autonomic Computing: An overview. In Springer-Verlag Berlin Heidelberg,
pages 247259, July 2005.
[2] Kephart J.O., Chess D. M.. The Vision of Autonomic Computing. Computer, IEEE, Volume 36, Issue 1,
January 2003, Pages 41-50.
[3] Sterritt R., Bustard D. Towards an Autonomic Computing Environment. University of Ulster, Northern
Ireland.
[4] Bantz D.F. et al. Autonomic personal computing. IBM Systems Journal, Vol 42, No 1, 2003.
[5] Bigus J.P. et al. ABLE: A toolkit for building multiagent autonomic systems. IBM Systems Journal, Vol.
41, No. 3, 2002.
[6] IBM An architectural blueprint for autonomic computing, April 2003.
[7] [Bletsas E.N., McCann, J.A AEOLUS: An Extensible Webserver Benchmarking Tool submitted to 13th
IW3C2 and ACM World Wide Web Conference (WWW04), New York City, 17-22 May 2004.
[8] McCann J.A,. Crane J.S.,'Kendra: Internet Distribution & Delivery System an introductory paper', Proc.
SCS EuroMedia Conference, Leicester, UK, Ed. Verbraeck A., Al-Akaidi M., Society for Computer
Simulation International, January 1998. pp 134-140.
Feature Selection for High Dimensional Data:
Empirical Study on the Usability of Correlation
& Coefficient of Dispersion Measures

Babu Reddy M. Thrimurthy P. Chandrasekharam R.
LBR College of Engineering Acharya Nagarjuna University LBR College of Engineering
Mylavaram521 230 Nagarjuna Nagar522 510 Mylavaram521 230
m_babureddy@yahoo.com profpt@rediffmail.com rcs1948@yahoo.com

Abstract

Databases in general are modified to suit to the new requirements on serving
the users. In that process, the dimensionality of data also gets increased. While
dimensionality gets increased every time, there is a severe damage causes to
database in terms of redundancy. This paper addresses the usefulness of
eliminating highly correlated and redundant attributes in increasing the
classification performance. An attempt has been made to prove the usefulness
of dimensionality reduction by applying LVQ(Learning Vector Quantization)
method on two Benchmark datasets of Lung cancer Patients and Diabetic
Patients. We adopt the Feature Selection methods which are used for
machine learning tasks for facilitating in reducing dimensionality, removing
inappropriate data, increasing learning accuracy, and improving
comprehensibility.
1 Introduction
Feature selection is one of the prominent preprocessing steps to machine learning. It is a
process of choosing a subset of original features so that the feature space is condensed
according to a certain evaluation criterion. Feature selection has been a significant field of
research and development since 1970s and proved very useful in removing irrelevant and
redundant features, increasing the learning efficiency, improving learning performance like
predictive accuracy, and enhancing comprehensibility of learned results [John & Kohavi,
1997; Liu & Dash,1997; Blum & Langley, 1997]. In present day applications such as genome
projects [Xing et al., 2001], image retrieval [Rui et al., 1998-99], and customer relationship
management [Liu & Liu, 2000], text categorization [Pederson & Yang, 1997], the size of
database has become exponentially large. This immensity may cause serious problems to
many machine learning algorithms in terms of efficiency and learning performance. For
example, high dimensional data can contain high degree of redundant and irrelevant
information which may greatly influence the performance of learning algorithms. Therefore,
while dealing with high dimensional data, feature selection becomes highly necessary. Some
of the recent research efforts in feature selection have been focused on these challenges [Liu
et al., 2002 & Das, 2001; Xing et al., 2001]. In the following, basic models of feature
selection have been reviewed and supporting justification was given for choosing filter
solution as a suitable method for high dimensional data.
Feature selection algorithms can be divided into two broader categories, namely, the filter
model and the wrapper model [Das, 2001; John & Kohavi, 1997]. The filter model relies on
408 Feature Selection for High Dimensional Data: Empirical Study on the Usability of Correlation
Copyright ICWS-2009
general characteristics of the training data to select some features without involving any
learning algorithm. The wrapper model relies on a predetermined learning algorithm and uses
its performance to evaluate and select the features. For each new subset of features, the
wrapper model needs to learn a classifier. It tends to give superior performance as it finds
tailor-made features which are better suited to the predetermined learning algorithm, but it
also tends to be more computationally expensive [Langley, 1994]. For the increased number
of features, the filter model is usually a choice due to its computational efficiency.
Different feature selection algorithms under filter model can be further classified into two
groups, namely subset search algorithms and feature weighting algorithms. Feature weighting
algorithms allocate weights to features individually and grade them based on their relevance
to the objective. A feature will be selected subject to a threshold value. If its weight of
relevance is greater than a threshold value, the corresponding feature will be selected. Relief
[Kira & Rendell, 1992] is a well known algorithm that relies on relevance evaluation. The
key idea of Relief is to estimate the relevance of features according to their classification
capability, i.e. how well their values differentiate between the instances of the same and
different classes. Relief randomly samples a number p of instances from the training set and
updates the relevance estimation of each feature based on the difference between the selected
instance and the two nearest instances of the same and opposite classes. Time complexity of
Relief for a data set with M instances and N features is O(pMN). By assuming p as a
constant, the time complexity becomes O(MN), which makes it very scalable to high
dimensional data sets. But, Relief does not help in removing redundant features. As long as
features are relevant to the class concept, they will all be selected even though many of them
are highly correlated to each other [Kira & Rendell, 1992]. Experiential evidence from
feature selection literature shows that, both the irrelevant and redundant features will affect
the efficiency of learning algorithms and thus should be eliminated as well [Hall 2000; John
& Kohavi, 1997].
Subset search algorithms recognize subsets directed by an evaluation measure/goodness
measure [Liu & Motoda, 1998] which captures the goodness of each subset.
Some other evaluation measures in removing both redundant and irrelevant features include
the correlation measure [Hall, 1999; Hall, 2000], consistency measure [Dash et al., 2000]. In
[Hall 2000], a correlation measure is applied to evaluate the goodness of feature subsets
based on the hypothesis that a good feature subset is one that contains features highly
correlated with the class, yet uncorrelated with each other. Consistency measure attempts to
find an optimum number of features that can separate classes as consistently as the complete
feature set can. In [Dash et al, 2000], different search strategies, like heuristic, exhaustive and
random search, are combined with this evaluation measure to form hybrid algorithms. The
time complexity is exponential in terms of data dimensionality for exhaustive search and
quadratic for heuristic search. The complexity can be linear to the number of iterations in a
random search, but experiments show that in order to find an optimum feature subset, the
number of iterations required is mostly at least quadratic to the number of features [Dash et
al., 2000]. Section 2 discusses the required mathematical preliminaries. Section 3 describes
the procedure that has been adopted. Section 4 presents the simulation results of an empirical
study. Section 5 concludes this work with key findings and future directions.
Feature Selection for High Dimensional Data: Empirical Study on the Usability of Correlation 409
Copyright ICWS-2009
2 Preliminaries
We adopt the following from the literature:
2.1 Correlation-Based Measures
In general, a feature is good if it is highly correlated with the class but not with any of the
other features. To measure the correlation between two random variables, broadly two
approaches can be followed. One is based on classical linear correlation and the other is
based on information theory. Under the first approach, the most familiar measure is linear
correlation coefficient. For a pair of variables (X, Y), the linear correlation coefficient r is
given by the formula
r =

2 2
) y (y ) x (x
) y )(y x (x
i i i i
i i i i

(1)
where x
i
is the mean of X, and y
i
is the mean of Y. The value of the correlation coefficient r
lies between -1 and 1, inclusive. If X and Y are completely correlated, r takes the value of 1
or -1; if X and Y are totally independent, r is zero. Correlation measure is a symmetrical
measure for two variables. There are several benefits of choosing linear correlation as a
feature goodness measure for classification. First, it helps to identify and remove features
with near zero linear correlation to the class. Second, it helps to reduce redundancy among
selected features. It is known that if data is linearly separable in the original representation, it
is still linearly separable if all but one of a group of linearly dependent features are removed
[Das, 1971]. But, it is not safe to always assume linear correlation between features in the
real world. Linear correlation measures may not be able to capture correlations that are not
linear in nature. Another limitation is that the calculation requires all features contain
numerical values.
To overcome these problems, correlation measure can be chosen based on Entropy, a
measure of uncertainty of a random variable. The entropy of a variable X is defined as

E(X) =
(2)
and the entropy of X after observing values of another variable Y is defined as
E(X|Y ) =
j i
j i j i j y x P y x P y P )) / ( ( log ) / ( ) ( 2

(3)
where P(x
i
) is the aforementioned probabilities for all values of X, and P(x
i
|y
i
) is the
posterior probabilities of X given the values of Y. The amount by which the entropy of X
decreases reflects additional information about X provided by Y and is called information
gain (Quinlan, 1993), given by
IG (X|Y ) = E(X) E(X|Y) (4)
i
i i x P x P )) ( ( log ) ( 2
Copyright ICWS-2009
According to this measure, a feature Y is more correlated to feature X than to feature Z, if
IG (X|Y ) > IG(Z|Y).
Since the information gain is symmetrical for two random variables X and Y, Symmetry is
desired property for a measure of correlations between features. The problem with
Information Gain measure is that it is biased in favor of greater valued features. A measure of
Symmetrical Uncertainty can compensate the Information Gains bias and normalizes its
values to [0,1]. Value 1 indicates that the value of feature X can completely predicts the
value of feature Y. And 0 indicates that the two features X and Y are independent.
SU(X, Y) = 2[IG (X|Y) / (E(X)+E(Y))] (5)
2.2 CFS: Correlation-Based Feature Selection
The key idea of CFS algorithm is a heuristic evaluation of the merit of a subset of features.
This heuristic takes into account the usefulness of individual features for predicting the class
label along with the level of inter-correlation among themselves.
If there are n possible features, then there are 2
n
possible subsets. To find the optimal subset,
all the possible 2
n
subsets should be tried. This process may not be feasible.
Various heuristic search strategies like hill climbing and Best First Search [Ritch and Knight,
1991] are often used. CFS starts with an empty set of features and uses a best first forward
search (BFFS) with terminating criteria of getting consecutive non-improving subsets.
3 Process Description
a) In this paper, the usefulness of correlation measure and variance measures in
identifying and removing the irrelevant and redundant attributes has been studied by
applying Learning Vector Quantization (LVQ) method on two benchmark micro array
datasets of Lung Cancer patients and Pima-Indian Diabetic patients. The considered
benchmark data sets have class labels also as one of the attribute. The performance of
LVQ method in supervised classification has been studied with the original data set
and with a reduced dataset in which few irrelevant and redundant attributes have been
eliminated.
b) on Lung Cancer Data set, some features whose coefficient of dispersion is very low
have been discarded from the further processing and results are compared.
Let F = { F
11
F
21
F
31
..F
N1

F
12
F
22
F
32
..F
N2

F
13
F
22
F
33
..F
N3

- - - - - - - - -- - - - --
- - - - - - - - - - - - - -
F
1M
F
2M
F
3M
..F
NM }
Let the feature set contains N features (attributes) and M instances (records).
Copyright ICWS-2009
Coefficient of Dispersion (CD
Fi
) =
Fi
/ i F where i F is the arithmetic average of a particular
feature i.
Fi
=
M
F F j i ij ) (
j=1 to N
If ((CD
Fi
) < ), feature F
i
can be eliminated from further processing. It requires only Linear
time complexity (O(M), where as the other methods like FCBF or CBF with modified
pairwise selection requires quadratic time complexity(i.e. O(MN)).
LVQ has great significance in Feature Selection and Classification tasks. The LVQ method
has been applied on the benchmark datasets of Diabetic patients [8] and Lung cancer Patients
[4]. And an attempt has been made to identify some of the insignificant/redundant attributes,
by means of Class correlation (C-Correlation), inter-Feature correlation (F-Correlation) and
Coeffecient of Dispersion among all the instances of a fixed attribute. This may help us
towards better performance in terms of classification efficiency in a supervised learning
environment. Classification efficiency has been compared by considering the original dataset
and the corresponding reduced dataset with less number of attributes. Better performance has
been noticed by eliminating the unnecessary/insignificant attributes.
a) Pima-Indian Diabetic Data SetCorrelation Coefficient as a measure for
Dimensionality reduction
Database Size: 768 Learning rate: 0.1
Learning Rate: 0.1 No. of classes: 2 (0 & 1)
No. of attributes: 8 No. of iterations performed: 30
No. of recognized
data items
Efficiency
Ex. Time
(in Secs)
S. No
Training
Inputs (%)
Testing
Inputs (%)
Original
Data Set
Reduced
Data
Set(corr)
Original
Data Set
Reduced
Data Set
(corr)
Original
Data Set
Reduced
Data Set
(corr)
1 10 90 458 688 66.2808 99.5658 6.516 6.609
2 20 80 459 689 74.7557 112.215 7.219 6.328
3 30 70 460 689 85.5019 128.0669 7.172 6.812
4 40 60 461 689 100.00 149.4577 7.891 7.812
5 50 50 472 689 122.9167 179.4271 5.859 6.968
6 60 40 460 689 149.8371 224.43 7.672 7.219
7 70 30 470 689 204.3478 299.5652 6.14 5.719
8 80 20 470 689 305.1948 447.4026 6.687 7.218
9 90 10 473 689 614.2857 894.8052 5.25 5.843
b) Lung cancer Data set - Correlation Coefficient as a measure for Dimensionality
reduction
Database Size: 73 Instances Learning rate: 0.1
Learning Rate: 0.1 No. of classes: 3
No. of attributes: 326(Class attribute is included) No. of iterations performed: 30
Copyright ICWS-2009
No. of recognized
data items
Efficiency
Ex. Time
(in Secs)
S. No
Training
Inputs
(%)
Testing
Inputs
(%)
Original
Data Set
Reduced Data
Set(Corr. with
Class label)
Original
Data Set
Reduced
Data Set
(Corr. with
Class label)
Original
Data Set
Reduced
Data Set
(Corr. with
Class label)
1 10 90 14 22 1.2121 33.3333 9.109 14.58
2 20 80 14 19 24.1379 32.7586 8.109 26.03
3 30 70 14 25 27.451 49.0196 6.812 46.08
4 40 60 14 23 31.8182 52.2727 6.328 46.55
5 50 50 14 26 37.8378 70.273 5.171 13.71
6 60 40 13 32 44.8276 110.3448 6.328 23.23
7 70 30 14 27 63.6364 122.7273 8.906 32.14
8 80 20 14 27 93.333 180.00 212.672 53.90
9 90 10 14 32 200.00 457.1429 6.0 39.25
c) Lung cancer Data setCoefficient of Dispersion as a measure for Dimensionality
reduction
No. of recognized
data items
Efficiency
Ex. Time
(in Secs)
S. No
Training
Inputs
(%)
Testing
Inputs
(%)
Original
Data
Set
Reduced
Data Set
(Variance)
Original
Data Set
Reduced
Data Set
(Variance)
Original
Data Set
Reduced
Data Set
(Variance)
1 10 90 14 22 21.2121 33.333 9.109 5.359
2 20 80 14 19 24.1379 32.7586 8.109 7.89
3 30 70 14 25 27.451 49.0196 6.812 8.016
4 40 60 14 23 31.8182 52.2727 6.328 7.937
5 50 50 14 26 37.8378 70.2703 5.171 5.203
6 60 40 13 19 44.8276 65.5172 6.328 7.281
7 70 30 14 24 63.6364 109.0909 8.906 8.953
8 80 20 14 26 93.333 173.333 212.672 8.313
9 90 10 14 33 200.00 471.4286 6.0 6.515
The following graphs show the advantage of the dimensionality reduction method used on the
two benchmark sets in terms of Efficiency of Classification and Execution Time.

Diabetic Dataset
0
200
400
600
800
1000
90 80 70 60 50 40 30 20 10
%of Training Inputs
E
f
f
e
c
i
e
n
c
y

o
f

C
l
a
s
s
i
f
i
c
a
t
i
o
n
Original Data Set
Reduced Data Set

Fig: A1 (Correlation Measure) Fig: A2 (Correlation Measure)
Diabetic Dataset
0
1
2
3
4
5
6
7
8
9
90 80 70 60 50 40 30 20 10
%of Training Inputs
E
x
e
c
u
t
i
o
n

T
i
m
e
Original Data Set
Reduced Data Set

Copyright ICWS-2009

Fig. B1: (Correlation Measure) Fig. B2: (Correlation Measure)

Fig. C1: (Coeff of Dispersion) Fig. C2: (Coeff. Of Dispersion)
It has been clearly observed that the Efficiency of classification is encourageable after
reducing the dimensionality of data sets. Because of the dynamic load on the processor at the
time of running the program, few peaks have been observed in the execution time. This can
be eliminated by running the program in an ideal standalone environment.
5 Conclusion and Future Directions
Improvement in the efficiency of classification has been observed by using correlation and
variance as measures to reduce the dimensionality. The existing correlation based feature
selection methods are working around the features with acceptable level of correlation among
themselves. But so far very less emphasis has been given on the independent feature
integration and its effect on C-Correlation. Useful models can be identified to study the
goodness of the combined feature weight of statistically independent attributes and pair-wise
correlations can also be considered for further complexity reduction. And the impact of
learning rate and threshold value on the classification performance can also be studied.
References
[1] Langley P & Sage S(1997): Scaling to domains with many irrelevant features- In R. Greiner(Ed),
Computational Learning Theory and Natural Learning systems(Vol:4), Cambridge, MA:MIT Press.
[2] Blake C and Merz C(2006): UCI repository of Machine Learning Databases Available at:
http://ics.uci.edu/~mlearn/MLRepository.html
[3] M. Dash and H Liu; Feature Selection for classification and Intelligent data analysis: An International
Journal, 1(3), pages: 131-157, 1997.
Copyright ICWS-2009
[4] Huan Liu and Lei yu, Feature Selection for High Dimensional Data: A Fast Correlation-Based Filter
Solutions, Proceedings of the 20
th
International Conference on Machine Learning (ICML-2003),
Washington DC 2003.
[5] Vincent Sigillito: UCI Machine Learning Repository: Pima Indian Diabetes Data Available at:
http://archives.uci.edu/ml/datasets/Pima+Indians+Diabetes.
[6] Huan Liu and Lei yu, Redundancy based feature Selection for micro-array data- In proceedings of
KDD04, pages: 737-742, Seattle, WA, USA, 2004.
[7] Imola K. Fodor A survey of dimension reduction techniques. Centre for Applied Scientific Computing,
Lawrence Livermore National Laboratories.
[8] Jan O. Pederson & Yang. Y A comparative study on feature selection in text categorization: Morgan
Kaufmann Publishers, pages: 412-420, 1997.
[9] Ron Kohavi & George H John AIJ Special issue on relevance wrappers for feature subset selection: 2001
[10] M. Dash & H. Liu Feature selection for classification: Journal of Intelligent Data Analysis: Pages: 131-
156(1997).
[11] Langley. P Selection of relevant features in machine learning Proceedings of the AAAI Fall
Symposium on relevance 1994; pages: 140-144.
[12] SN Sivanandam, S. Sumathi and SN Deepa Introduction to Neural Networks using Matlab-6.0; TMH-
2006.
[13] L.A. Rendell and K. Kira - A practical approach to feature selection- International conference on machine
learning; pages: 249-256(1992).
[14] Fundamentals of Mathematical Statistics-S.C.Gupta & V.K. Kapoor (Sulthan Chand & Sons)
[15] Mark A. Hall Correlation based feature selection for discrete & numeric class machine learning; pages:
359-366(2000); Publisher: Morgan Kaufman.
Extreme Programming: A Rapidly Used Method
in Agile Software Process Model

V. Phani Krishna K. Rajasekhara Rao
S.V. Engg. College for Women K. L. College of Engineering
Bhimavaram 534204 Vijayawada
phanik16@yahoo.co.in rajasekhar.kurra@klce.ac.in

Abstract

Extreme Programming is a discipline of software development based on
values of simplicity, communication, feedback, and courage. It works by
bringing the whole team together in the presence of simple practices, with
enough feedback to enable the team to see where they are and to tune the
practices to their unique situation. This paper discusses how tight coupling,
redundancy and interconnectedness became a strong foe to the software
development process.
1 Introduction
Pragmatic Dave Thomas stated that if a piece of software that was as tightly coupled as
Extreme Programming, the user would be fired. Inspite of having several advantages of
Extreme Programming the striking characteristics of XP, redundancy and interconnectedness
became decry in software design.
In the present paper we will discuss how that tight coupling could have a strong negative
impact for a software process. For any process, Extreme or not, to be really useful and
successful in a variety of situations for different teams, we have to understand how to tailor it.
Every project team inevitably takes the permission to make digital or hard copies of all or
part of this work for personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies bear this notice
and the full citation on the first page. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
The problem is that most of the time we do tailoring blindly. We may have an idea what
problem were trying to solve by adding some new practice, or some reason that we dont
need a particular artifact. But process elements dont exist in isolation from one another.
Typically, each provides input, support, or validation for one or more other process elements,
and may in turn depend on other elements for similar reasons.
Is this internal coupling as bad for software processes as it is for software? This is an
important question not just for Extreme Programming, but for all software processes. Until
we understand how process elements depend upon and reinforce one another, process design
and tailoring will continue to be the hit-or-miss black art.
Extreme Programming is an excellent subject for studying internal process dependencies.
One reason is that it acknowledges those dependencies and tries to enumerate them [Beck,
416 Extreme Programming: A Rapidly Used Method in Agile Software Process Model
Copyright ICWS-2009
99]. Additionally, XP is unusual in covering not just the management of the project, but day-
to-day coding practices as well. It provides an unusually broad picture of the software
development process.
2 Tightly Coupled
The published literature about Extreme Programming is incomplete in several ways. If we
follow discussions of how successful teams actually apply XP, well see that there are many
implicit practices, including the physical layout of the team workspace and fixed-length
iterations. Likewise, since relationships between practices are more difficult to see than the
practices themselves, its probable that there are unidentified relationships between the
practicesperhaps even strong, primary dependencies.
However, just diagramming the twelve explicit XP practices and the relationships
documented in Extreme Programming Explained shows the high degree of
interconnectedness, as seen in Figure 1.
Rather than add additional complications to the problem right from the start, it will be better
to focus on the relationships Beck described. The change we made from the beginning was to
split the testing practice into unit testing and acceptance testing. They are different
activities, and the XP literature emphasizes the differences in their purpose, timing, and
practice, so it seemed appropriate to treat them as distinct practices. Therefore, instead of the
original twelve practices of Extreme Programming, this analysis deals with the thirteen
shown in Figure 2.

Fig. 1: The Original 12 Practices And Their Dependencies.
Once the complex web of dependencies is shown so clearly, its easy to understand Dave
Thomas point and the challenge implicit in it. Does a chosen software process can be
customized in an XP context? If one of the XP practices has to be modified or omitted, how
can we understand what were really losing? If we notice a problem on our project that XP
Extreme Programming: A Rapidly Used Method in Agile Software Process Model 417
Copyright ICWS-2009
isnt adequately addressing, how can we fit a new practice into this web? That would be our
goalunderstanding these dependencies well enough to permit informed adjustment. The
point is not to decouple Extreme Programming.
Many processes try to deal with the problem of redundancy by strengthening the practices.
But such measures are costly in terms of time and effort, and they probably also harm team
moral and cohesion. Strength of the XP approach is that the practices play multiple roles. In
most cases when an XP practice serves to compensate for the flaws of another practice, the
redundant compensation is merely a secondary role of the practice. This helps keep the
number of practices to a minimum, and has the added benefit of using core team members in
enforcement roles without making them seem like enforcers.

Fig. 2: The Thirteen Practices.
Without some coupling, even in software designs, nothing will ever get done. The trick is to
build relationships between components when they are appropriate and helpful, and avoid
them otherwise. The coupling within XP is only harmful if it makes the process difficult to
change.
3 Teasing Out the Tangles
Are there strongly connected subcomponents that have weaker connections between them?
For deriving answer for the above question, we have to surf the dependency graphs and move
the nodes around in order to find some hint. This process is similar to the metallurgical
process of annealing, where a metal is heated and then slowly cooled to strengthen it and
reduce brittleness.
Copyright ICWS-2009
The process allows the molecules of the metal, as it cools, to assume a tighter, more nearly
regular structure. Some automated graph-drawing algorithms employ a process of simulated
annealing, jostling the nodes of the graph randomly and adjusting position to reach an
equilibrium state that minimizes the total length of the arcs in the graph [Kirkpatrick, 83].
Attempting through figure2 didnt give any hint. So this is of no use. Hence we can try to
visualize clusters of dependencies by arranging the practices in a circle and changing the
order to bring closely related practices together. What did practices that were close to each
other on the circle have in common? What distinguished practices on opposite sides of the
circle?

Fig. 3: Before and after a "circular topological sort."
The low-level programming practices depend on each other more than they depend on the
product-scale practices like the planning game and short releases.. There are nine practices
that seem to operate at particular scales, as illustrated in Figure 4.
Each of these practices seems to provide feedback about particular kinds of decisions, from
very small to the large, sweeping decisions. Of course, that leaves four other practices out,
which is a problem when were trying to understand all of the practices and how they relate.
All of the dependencies within XP are not of the same kind.
For example, consider the bidirectional dependency between pair programming and unit
testing. How does pair programming help unit testing? It strengthens unit testing by
suggesting good tests, and by encouraging the unit-testing discipline. It also helps to ensure
that the unit-testing process is dealing with well-designed code, making the testing process
itself more efficient and productive.
Now turn it around. How does unit testing support pair programming? It guides the
programmers by helping them structure their work, setting short-term goals on which to
focus. It guides their design work as well; unit testing has well known benefits as a design
technique. It also defends against shortcomings of pair programming (even two minds dont
write perfect code) by catching errors.
Copyright ICWS-2009

Fig. 4: Scale-defined practices
Do the relationships at larger scales look similar? Another bidirectional dependency on a
larger scale exists between on-site customer and acceptance testing. The relationship between
the two is clearly different in details from the one we just explored between pair
programming and unit testing, but it seems to me to be similar in terms of the respective roles
of the two practices.
Having an on-site customer strengthens acceptance testing by guiding the development of
tests, and by helping maintain correspondence between stories and tests. In the opposite
direction, acceptance testing guides feature development (again by providing goals) and
defends against the weaknesses of on-site customer, providing a concrete, executable record
of key decisions the customer made that might otherwise be undocumented. It also provides a
test bed for the consistency of customer decisions.
Smaller-scale practices strengthen larger-scale practices by providing high-quality input. In
other words, smaller-scale practices take care of most of the small details so that the larger-
scale practices can effectively deal with appropriately scaled issues. In the reverse direction,
larger-scale practices guide smaller-scale activities, and also defend against the mistakes that
might slip through.
Re-factoring, forty-hour weeks, simple design, and coding standards seem to all have a
strengthening role. One way of looking at the strengthening dependencies is to see them as
noise filters. The noise refers to the accidental complexity: the extra complexity in our
systems over and above the essential complexity that is inherent in the problem being solved.
In a software system, that noise can take many forms: unused methods, duplicate code,
misplaced responsibility, inappropriate coupling, overly complex algorithms, and so on. Such
noise obscures the essential aspects of the system, making it more difficult to understand,
test, and change.
The four practices that operate independent of scale seem to be aimed at reducing noise,
improving the overall quality of the system in ways that allow the other practices to be more
effective. Refactoring is an active practice that seeks to filter chaotic code from the system
whenever it is found. Simple design and coding standards are yardsticks against which the
systems quality can be measured, and help guide the other practices to produce a high
Copyright ICWS-2009
quality system. Finally, forty-hour week helps eliminate mistakes by reducing physical and
mental fatigue in the team members. The four noise-filtering practices, along with their
interdependencies, are shown in Figure 5.

Fig. 5: Noise Filters.
Those four noise-filtering practices help many of the other practices to operate more
effectively by maximizing clarity and reducing complexity in the code. They help minimize
the accidental complexity in the system in favor of the essential complexity..
4 A Feedback Engine
The nine practices are characterized not only by the scale of entity they work with;
additionally, they function primarily within a certain span of time. Not surprisingly, the
practices that operate on small-scale things also operate very quickly. The correspondence
between practices and time scales is shown in Figure 6.
The nesting of XPs feedback loops is the fundamental structural characteristic of Extreme
Programming. All of the explicit dependencies between individual practices that have been
identified by Beck and others are natural consequences of this overall structure.

Fig. 6: Practices and time scales.
Copyright ICWS-2009
5 Cost of Feedback
[Bohem,81] observations of projects led him to conclude that, as projects advance through
their lifecycles, the cost of making necessary changes to the software increases
exponentially[Bohem,81]. This observation led to a generation of processes that were
designed to make all changesall decisionsas early in the process as possible, when
changes are cheaper.
Many in the agile community have observed that [Bohem, 81] study dealt primarily with
projects using a waterfall-style process, where decisions were made very early in the project.
Those decisions were often carefully scrutinized to identify mistakes, but the only true test of
software is to run it. In classic waterfall projects, such empirical verification typically didnt
happen until near the end of the project, when everything was integrated and tested. Agile,
iterative processes seem to enjoy a shallower cost-of-change Curve, suggesting that perhaps
Boehms study was actually showing how the cost of change increased as a function of the
length of the feedback loop, rather than merely the point in the project lifecycle.
The analysis of the cost of change curve is not new to the agile process community.
Understanding XPs structure sheds new light on how the process manages that curve. With
its time-and scale-sensitive practices and dependencies, XP is an efficient feedback engine.
They do in very cost-effective way. In case of smaller decisions, XP projects get that
feedback continuously, minute by-minute, through interactions within programming pairs and
through unit testing.
Larger decisions, such as the selection of features to help solve a business problem and the
best way to spend project budget, are quite costly to validate. Therefore XP projects validate
those decisions somewhat more slowly, through day-to-day interaction with customers,
giving the customer control over each iterations feature choice, and by providing a release
every few weeks at most. At every scale, XPs practices provide feedback in a way that
balances timeliness and economy [cockburn, 02].
6 Defense in Depth
Another traditional view of the purpose and function of a software processclosely related
to managing the cost of changeis that it is defensive, guarding against the introduction of
defects into the product.
Our model of XPs inner structure also makes sense when measured against this view. In fact,
it resembles the timeworn security strategy of defense in depth. Extreme Programming can be
seen as a gauntlet of checks through which every line of code must pass before it is ultimately
accepted for inclusion in the final product.
At each stage, it is likely that most defects will be eliminated but for those that slip through,
the next stage is waiting. Furthermore, the iterative nature of XP means that in most cases
code will be revisited, run through the gauntlet again, during later iterations.
7 Conclusion
Extreme Programming has some tight coupling between its practices. But the redundant,
organic interconnectedness of XP is the source of a lot of its robustness and speed. All
Copyright ICWS-2009
those dependencies between practices have a structure that is actually fairly simple. That
structure, once identified, provides crucial guidance for those who need to tailor and adjust
the software process.
The feedback engine, with its nested feedback loops, is an excellent model for a process
designed to manage the cost of change and respond efficiently to changing requirements. This
is the essence of agility: letting go of the slow, deliberate decision-making process in favor of
quick decisions, quickly and repeatedly tested. The feedback loops are optimized to validate
decisions as soon as possible while still keeping cost to a minimum.
References
[1] [Beck,99] Beck, K. Extreme Programming Explained: Embrace Change. Addison-Wesley, Reading, MA,
1999.ssss
[2] [Bohem,81] Boehm, B. Software Engineering Economics. Prentice Hall, Englewood Cliffs, NJ, 1981.
[3] [cockburn,02] Cockburn, A. Agile Software Development. Addison-Wesley, Boston, 2002.
[4] [Kirkpatrick,83] Kirkpatrick, S., Gelatt Jr., C.D., and Vecchi, M.P. Optimization by Simulated Annealing.
Science, 4598 (13 May 1983), 671680.
Data Discovery in Data Grid Using Graph Based
Semantic Indexing Technique

R. Renuga Sudha Sadasivam
Coimbatore institute of Technology PSG College of technology
Coimbatore Coimbatore
renugacit@yahoo.co.in sudhasadasivam@yahoo.com
S. Anitha N.U. Harinee
Coimbatore institute of technology Coimbatore institute of technology
R. Sowmya B. Sriranjani
Coimbatore institute of technology Coimbatore institute of technology

Abstract

A data grid is a grid computing system that deals with data the controlled
sharing and management of large amounts of distributed data. The process of
data discovery aids in retrieval of requested and relevant data from the data
source. The quality of the search is improved when the semantically related
data is retrieved from a grid. The proposed model of data discovery in the data
grid using the graph based semantic indexing technique in providing an
efficient discovery of data based on the time and other retrieval parameters.
Since there often exists some semantic correlation among the specified
keywords, this paper proposes a model for more effective discovery of data
from a data grid by utilizing the semantic correlation to narrow the scope of
the search. The indexing phase makes use of two data structures. One is a
hash-based index that maps concepts to their associated operations, allowing
efficient evaluation of query concepts. The other is a graph-based index that
represents the structural summary of the semantic network of concepts, and is
used to answer the queries. The grid environment is established using gridsim
simulator.
Keywords: Semantic search, context classes, graph based indexer, gridsim.
1 Introduction
Grid computing [Kesselman and Kauffmann, 1999] is applying the resources of many
computers in a network to a single problem at the same time - usually to a scientific or
technical problem that requires a great number of computer processing cycles or access to
large amounts of data. The usage of grid has been predominant in the science and engineering
research arena for the past decade. It concentrates on providing collaborative problem-solving
and resource sharing strategies. There are two basic types of grids: Computational grid, Data
grid. This paper focuses on the data grid and discovery of data from this grid environment.
The data discovery is a process by which a data that is requested is retrieved from the grid.
424 Data Discovery in Data Grid Using Graph Based Semantic Indexing Technique
Copyright ICWS-2009
The traditional keyword based search does not advocate a complete result generation on a
requested query whereas the semantic search aims at providing exhaustive search results. The
keyword search employs an index based mechanism which has a number of disadvantages.
Keyword indices suffer because they associate the semantic meaning of web pages with
actual lexical or syntactic content. Hence there is an inclination towards the semantic search
methodology in the recent years. [Makela, 2005]; [Guha at.el, 2003] Semantic Search
attempts to augment and improve traditional search results by making an efficient use of the
ontology. [Ontology, 1996]
The search for a particular data in a data pool is both one of the most crucial applications on
the grid and also an application where significant need for melioration is always necessary.
The addition of explicit semantics can improve the search. A semantic approach proposed in
this paper exploits and augments the semantic correlation among the query key words and
employs an approximate data retrieval technique to discover the requested data. The Search
relies on graph based indexing (Data concept network) in which semantic relations can be
approximately composed, while the graph distance represents the relevance. The data ranking
is done based on the certainty of matching a query.
There are a number of approaches for semantic retrieval of data. The tree based approach for
instance takes into account the only hierarchical relationships between the data and calculates
the semantic correspondence there by introducing a number of disadvantages. But the graph
based approach overcomes the disadvantages because it considers both hierarchical and non
hierarchical relationships. [Maguitman at.el, 2005]
The contributions of this proposed model are
The model allows the user to query a data using a simple and extensible query
language in an interactive way.
The model provides approximate compositions by deriving semantic relations
between the data concepts based on ontology.
The model can also be extended to use clusters in order to supply a compact
representation of the index.
2 System Overview
2.1 Introduction
The search is mainly grounded on the textual matter given by the user. It ascertains the query
to extract its entailing and the information in the documents are explored to find the users
needs. The documents are ranked according to their relevance on par with the query.
The design criteria are based upon the
Usability
Robustness
Predictability
Scalability
Transparency
Data Discovery in Data Grid Using Graph Based Semantic Indexing Technique 425
Copyright ICWS-2009
The user query is converted into a formal query by matching the keywords with the concepts
whereas the concepts are the nodes in a network which is constructed by referring to the
ontology.
The mapping technique is done by adding connectors and annotating the terms in the user
query. The abstract query is constructed which is the mathematical representation of the
formal query. The semantic relationship between the concepts can be correlated by the notion
context classes.
C-mapping is a hash based indexing which is used for efficient evaluation of queries. C-
Network is constructed using C-mapping. Using this network the related documents are
retrieved from the grid and ranked. The knowledge inference is used, which makes the
ontology to get modernized dynamically.
Thus whenever a query is given, the documents are ranked according to their relevance and
returned to the user. The architecture of the data discovery is shown in the figure 1.

Fig.1: The Data Discovery Architecture
3 Data Discovery Methodology
Semantic search methodology using graph based technique is designed in this paper for data
discovery. The user query forms the rudiments of the search. This proposed technique
processes the query and index the documents based on their relevance. The basis for
searching is the context classes, C-mapping, C-network.
3.1 Query Interface
Users communicate with the search engine through a query interface. The keywords are
extracted from the query and the concepts are built by referring to the ontology [Tran at.el,
2007]. Now, the user queries are transformed into formal queries by automatically mapping
keywords to concepts by the query interface. In order to formalize a simple query into a
query expression, each of the keywords is mapped to a concept term by using content
matching techniques. If more than one concept is matched with the keyword, that keyword
with the highest matching score is used in the query evaluation. Queries entered via the
interface undergo two additional processing steps.
Copyright ICWS-2009
Step: 1 Query terms are connected automatically using a conjunctive connector.
Step: 2 Concept terms are annotated with a property category (input, output or operation) defining which
property will be matched with the term.
The result is a virtual data concept which is associated with a certainty, which is determined
according to the semantic correspondence between the nodes and operations concepts.
3.2 Context Classes
The proposed method for analyzing relations between concepts depends on the notion of
context classes, which form groups of concepts that allow the investigation of relations
between them. For any given concept, define a set of context classes, each of which defines a
subset of the concepts in ontology according to their relation to the concept. Given a query
keyword associated with a concept c, we define a set of concepts Exact(c) as c itself and
concepts which are equivalent to c. The Exact class may contain concepts that have
monovular semantic meaning. The other context classes contain concepts with related
meaning. For each concept c in ontology O the following sets of classes are defined.
For example, consider an example, A diamond ring
Now the keywords in the above query are diamond and ring. These keywords are referred to
the ontology and the concepts like jewelry, occasions, wedding, gift, gold, Kohinoor etc are
abstracted. Now the keywords and concepts are matched and the formal query A ring made
up of diamond is formed as in figure.2.

Fig. 2: Formal Query
The semantic relationships between two concepts are based on the semantic distance between
them. Given an anchor concept c, and some arbitrary concept c, the semantic correspondence
function d(c,c) [Toch at.el,2007] is defined as,
D (c,c)=1, where c belongs to Exact(c)
D (c,c)= 1/ 2lognlog (1+), where c belongs to General(c) or ClassesOf(c) or
Properties(c)
D (c,c)= 1/2nlog 1+, where c belongs to Specific(c) or Instances(c) or
InvertProperties (c)
D(c,c)= 1/2log(n1+n2)log (1+), where c belongs to Siblings(c)
D(c,c) = 0, where c belongs to Unrelated(c)
Data Discovery in Data Grid Using Graph Based Semantic Indexing Technique 427
Copyright ICWS-2009
Where n is the shortest path between c and c, and is the difference between the average
depth of the ontology and depth of the upper concept.The log bases and are used as
parameters in order to set the magnitude of the descent function. Thus the similarity between
two concepts is set to 1 when the concepts are having highest similarity and if the concepts
are unrelated than the similarity is set to 0.
3.3 Indexing and Query Evaluation
The objective of the index is to enable efficient evaluation of queries with respect to
processing time and storage space. The index is composed of two data structures:
a C-Mapping
b C-Network.
3.3.1 C-Mapping
C-Mapping is a hash-based index that maps concepts to their associated operations, allowing
efficient evaluation of query concepts. Each mapping is associated with a certainty function
in [0, 1] reflecting the semantic affinity between the concept and the concepts of the
operation. Context classes are used in order to construct the key set of C-Mapping, and to
assign the operations associated with each concept.
3.3.2 C-Network
C-Network is a graph-based index that represents the structural summary of the data concept
network, and is used to answer queries that require several atomic operations. C-Mapping is
expanded with additional concepts whose mapping certainty is higher than a given threshold,
in order to retrieve approximate data. C-Network represents the structural summary of the
data concept network using a directed graph. Given two operations, the objective of C-
Network is to efficiently answer whether a composite concept, starting with the first
operation and ending with the second, can be constructed, and to calculate the certainty of the
composition. The design of C-Network is based on principles taken from semantic routing in
peer-to-peer networks.
Algorithm:
Get the query from the user.
Extract the keywords from the query and build the concepts by referring to the ontology.
Convert the general query into formal query by matching the keywords in the query with the
concepts.
Considering each keyword in the query
Formal query=Match (keyword, concepts);
Using content matching technique
Construct context classes and calculate the semantic correlation between the concepts
For each concept c in formal query
{
Associate a certainty value by referring to the C-Mapping table;
If c is linked with c in the C-Network then add c to final set of related concepts; }
Rank the final set of related concepts;
Copyright ICWS-2009
4 Conclusion
This paper has presented a method of discovery of data in a grid using Semantic search
method. The search technique is implemented by doing the following steps.
Splits the given query into keywords and extract the concepts
Forms a formal query
Construct the context classes, C-mapping and C-network
Ranks the documents
The above proposed model is under implementation.
5 Acknowledgement
The authors would like to thank Dr. R. Prabhakar, Principal, Coimbatore Institute of
Technology, Dr. Rudra Moorthy, Principal, PSG College of Technology and Mr.
Chidambaram Kollengode, YAHOO Software Development (India) Ltd, Bangalore for
providing us the required facilities to do the project. This project is carried out as a
consequence of the YAHOOs University Relation Programme.
References
[1] [Kesselman and Kauffmann,1999] Foster, I., Kesselman, C. Morgan Kaufmann The Grid: Blueprint for a
New Computing Infrastructure, San Francisco, 1999.
[2] [Makela,2005] Eetu Makela, Semantic Computing Research Group,Helsinki Institute for Information
Technology (HIIT). Survey of Semantic Search Research, 2005.
[3] [Guha at.el,2003] R. Guha, Rob McCool, Eric Miller. Semantic Search WWW2003, May 20-24,
2003.Budapest, Hungary.ACM 1-58113-680-3/03/0005.
[4] [Maguitman at.el,2005] Ana G. Maguitman, Filippo Menczer, Heather Roinestad and Alessandro
Vespignani. Algorithmic Detection of Semantic Similarity, 2005
[5] [Tran at.el,2007] Thanh Tran, Philipp Cimiano, Sebastian Rudolph and Rudi Studer. Ontology-based
Interpretation of Keywords for Semantic Search, 2007
[6] [Toch at.el,2007] Eran toch and Avigdor gal, Iris Reinhartz-Berger, Dov Dori. A Semantic Approach to
Approximate Service Retrieval, ACM Trans. Intern. Tech. 8, 1, Article 2 November 2007, pages 1-31.
[7] [Ontology,1996] Ontology-Based Knowledge Discovery on the World-Wide Web To Appear in:
Proceedings of the Workshop on Internet-based Information Systems, AAAI-96 (Portland, Oregon), 199
Design of Devnagari Spell Checker for Printed
Document: A Hybrid Approach

Shaikh Phiroj Chhaware Latesh G. Mallik
G.H. Raisoni College of Engineering G.H. Raisoni College of Engineering
Nagpur, India Nagpur, India
firoj466@yahoo.com lateshmallik@yahoo.com

Abstract

Natural language processing plays an important role in perfect analysis of
language related issues. Nowadays with the advent in Information
Technology, in India where the majority of peoples are Hindi language
speaking, a perfect Devnagari Spell Checker is required for word processing a
document in Hindi language. The one of the challenging field is how to
implement a perfect spell checker for the Hindi language for doing spell
checking in the printed document as we generally do for English like language
in Microsoft word. This paper is aimed to develop a system for spelling check
for the Devnagari text. The proposed approach consist of a development of
encrypted database for the font specific word database and a spell check
engine which will match the printed word from the available database of
words and then for non-word, it presents a list of most appropriate
suggestions based on n-gram distance calculation methods. Application of the
spell checker can be treated as a stand-alone capable of operating on a block of
text, or as part of a larger application, such as a word processor, email client,
electronic dictionary, or search engine.
Keywords: Devnagari script, Devnagari font, word database, N-Gram distance calculation
method, spell checker.
1 Introduction
The most common mode of interaction with computer is through keyboard. Spell checker
system has a variety of commercial and practical applications in correctly writing the
documents, reading forms, manuscripts and their archival. Standard Hindi text is known by
the name of Devngari script. Also the text which is either printed or handwritten needs to
430 Design of Devnagari Spell Checker for Printed Document: A Hybrid Approach
Copyright ICWS-2009
have some sort proof reading to avoid any erroneous matter. There are widely commercially
available OCR systems present in the market which can recognize the document printed or
handwritten for English language. The need is arises for the local languages and the
languages which are specific to a particular based on religion, area of locality, society and
other issues. An example which looks similarly as appears for English Spell checker is
presented for the Devnagari spell checker.
2 Spell Checking Issues
The earliest writing style programs checked for wordy, trite, clichd or misused phrases in a
text. This process was based on simple pattern matching. The heart of the program was a list
of many hundreds or thousands for phrases that are considered poor writing by many experts.
The list of suspect phrases included alternate wording for each phrase. The checking program
would simply break text into sentences, check for any matches in the phrase dictionary, and
flag suspect phrases and show an alternative.
These programs could also perform some mechanical checks. For example, they would
typically flag doubled words, doubled punctuation, some capitalization errors, and other
simple mechanical mistakes.
True grammar checking is a much more difficult problem. While a computer programming
language has a very specific syntax and grammar, this is not so for natural languages. Though
it is possible to write a somewhat complete formal grammar for a natural language, there are
usually so many exceptions in real usage that a formal grammar is of minimal help in writing
a grammar checker. One of the most important parts of a natural language grammar checker
is a dictionary of all words in the language.
A grammar checker will find each sentence in a text, look up each word in the dictionary, and
then attempt to parse the sentence into a form that matches a grammar. Using various rules,
the program can then detect various errors, such as agreement in tense, number, word order,
and so on.
3 How Does The Spell Checker Work?
Initially the Spell Checker reads extracted words from the document, one at a time.
Dictionary examines the extracted words. If the word is present in the Dictionary, it is
interpreted as a valid word and it seeks the next word.
If a word is not present in dictionary, it is forwarded to the Error correcting process. The spell
checker comprises three phases namely text parsing, spelling verification and correction, and
generation of suggestion list. To aid in these phases, the spell checker makes use of the
following.
i Morphological analyzer for analyzing the given word
ii Morphological generator for generating the suggestions.
In this context, the spell checker for Hindi needs to tackle the rich morphological structure of
Hindi. After tokenizing the document into a list of words, each word is passed to the
morphological analyzer. The morphological analyzer first tries to split the suffix. It is
designed in such a way that it can analyze only the correct words. When it unable to split the
Design of Devnagari Spell Checker for Printed Document: A Hybrid Approach 431
Copyright ICWS-2009
suffix due to mistake, it passes the word to spelling verification and correction phase to
correct the mistake.
4 Spelling Verification and Correction
a. Correcting Similar Sounding Letters
Similar sounding letter can cause incorrect spelling of words. For example consider the word
Thaalam. Here the letter La may be misspelled as la. Suggestions are generated by
examining the entire possible similar sounding letters for the erroneous word.
b. Checking the Noun
Tasks in noun correction include Case marker correction, plural marker checking,
postposition checking, adjective checking and root word correction.
c. Checking the Verb
Verb checking tasks include Person, Number & Tense marker checking and root word
checking.
d. Correcting the Adjacent Key Errors
User can mistype one letter instead of one letter. So we have to consider all the possible
adjacent keys of that particular letter. If any adjacent key of the mistyped letter matches with
the original letter then that letter is replaced instead of mistyped one and the dictionary is
checked.
5 Proposed Plan of Work
The errors in the input are made either due to human mistakes or limitations of the software
systems. Many spelling checking programs are available for detecting these errors. There are
two approaches for judging the correctness of the spelling of a word
1. Estimates the likelihood of a spelling by its frequency of occurrence which is derived
from the transition probabilities between characters. This requires a priori statistical
knowledge of the language.
2. The correctness is judged by consulting the dictionary.
The spelling correction programs also offer suggestions for correct words which are based on
the similarity with the input word using the word dictionary. Here a mechanism is required to
limit the search space. A number of strategies have been suggested for partitioning the
dictionary based on length of the word, envelop and selected characters. Here the following
work is considered:
1. Word based approach for correction of spelling.
2. Correctness judged by consulting the dictionary
3. For substitution of correct word, a list of alternative words should be displayed spell
check program.
Copyright ICWS-2009
4. Then the user will select the word from the available list or he may update the
dictionary with a new word.
5. The correctness of the newly inserted word will be judged by the user himself.
6 Research Methodology to be Employed
The correction method presented here uses a partitioned Hindi word dictionary. The
partitioning scheme has been designed keeping special problems in mind which Devanagari
script poses. An input word is searched in the selected partitions of the dictionary. An exact
match stops further search. However while looking for an exact match, the best choices are
gathered. The ranking of the words is based on their distances from the input word. If the best
match is within a preset threshold distance, further search is terminated. However, for short
words, no search terminating threshold is used. Instead, we try various aliases which are
formed from the output of classification process. The output of the character classification
process is of three kinds:
A character is classified to the true class - correct recognition.
A character is classified such that the true class is not the top choice substitution error.
The character is not classified to a known class - reject error.
A general system design for the Devnagari Spell Checker is as depicted as below:
Operation

Simple spell checkers operate on individual words by comparing each of them against the
contents of a dictionary, possibly performing stemming on the word. If the word is not found
it is considered to be a error, and an attempt may be made to suggest a word that was likely to
have been intended. One such suggestion algorithm is to list those words in the dictionary
having a small Levenshtein distance from the original word.
When a word which is not within the dictionary is encountered most spell checkers provide
an option to add that word to a list of known exceptions that should not be flagged.
Design
A spell checker customarily consists of two parts:
1. A set of routines for scanning text and extracting words, and
2. An algorithm for comparing the extracted words against a known list of correctly
spelled words (ie., the dictionary).
Design of Devnagari Spell Checker for Printed Document: A Hybrid Approach 433
Copyright ICWS-2009
The scanning routines sometimes include language-dependent algorithms for handling
morphology. Even for a lightly inflected language like English, word extraction routines will
need to handle such phenomena as contractions and possessives. It is unclear whether
morphological analysis provides a significant benefit.
The word list might contain just a list of words, or it might also contain additional
information, such as hyphenation points or lexical and grammatical attributes. As an adjunct
to these two components, the program's user interface will allow users to approve
replacements and modify the program's operation. One exception to the above paradigm are
spell checkers which use based solely statistical information, for instance using n-grams. This
approach usually requires a lot of effort to obtain sufficient statistical information and may
require a lot more runtime storage. These methods are not currently in general use. In some
cases spell checkers use a fixed list of misspellings and suggestions for those misspellings;
this less flexible approach is often used in paper-based correction methods, such as the see
also entries of encyclopedias.
References
[1] R.C. Angell, G.E. Freund and P. Willet, (1983) "Automatic spelling corection using a trigram similarity
measure", Information Processing and Management. 19: 255-261.
[2] V. Cherkassky and N. Vassilas (1989) "Back-propagation networks for spelling correction". Neural
Network. 1(3): 166-173.
[3] K.W. Church and W.A. Gale (1991) "Probability scoring for spelling correction". Statistical Computing.
1(1): 93-103.
[4] F.J. Damerau (1964) "A technique for computer detection and correction of spelling errors". Commun.
ACM. 7(3): 171-176.
[5] R.E. Gorin (1971) "SPELL: A spelling checking and correction program", Online documentation for the
DEC-10 computer.
[6] S. Kahan, T. Pavlidis and H.S. Baird (1987) "On the recognition of characters of any font size", IEEE
Trans. Patt. Anal. Machine Intell. PAMI-9. 9: 174-287.
[7] K. Kukich (1992) "Techniques for automatically correcting words in text". ACM Computing Surveys. 24(4):
377-439.
[8] V.I. Levenshtein (1966) "Binary codes capable of correcting deletions, insertions and reversals". Sov. Phys.
Dokl., 10: 707-710.
[9] U. Pal and B.B. Chaudhuri (1995) "Computer recognition of printed Bangla script" Int. J. of System
Science. 26(11): 2107-2123.
[10] J.J. Pollock and A. Zamora (1984) "Automatic spelling correction in scientific and scholarly text".
Commun. ACM-27. 4: 358-368.
[11] P. Sengupta and B.B. Chaudhuri (1993) "A morpho-syntactic analysis based lexical subsystem". Int. J. of
Pattern Recog. and Artificial Intell. 7(3): 595-619.
[12] P. Sengupta and B.B. Chaudhuri (1995) "Projection of multi-worded lexical entities in an inflectional
language". Int. J. of Pattern Recog. and Artificial Intell. 9(6): 1015-1028.
[13] R. Singhal and G.T. Toussaint (1979) "Experiments in text recognition with the modified Viterbi
algorithm". IEEE Trans. Pattern Analysis Machine Intelligence. PAMI-1 4: 184-193.
[14] E.J. Yannakoudakis and D. Fawthrop (1983) "An Intelligent spelling corrector". Information Processing
and Management. 19(12): 101-108.
[15] P. Kundu and B.B. Chaudhuri (1999) "Error Pattern in Bangla Text". International Journal of Dravidian
Linguistics. 28(2): 49-88.
[16] Naushad UzZaman and Mumit Khan, A Bangla Phonetic Encoding for Better Spelling Suggestions, Proc.
7th International Conference on Computer and Information Technology (ICCIT 2004), Dhaka, Bangladesh,
December 2004.
[17] Naushad UzZaman and Mumit Khan, A Double Metaphone Encoding for Bangla and its Application in
Spelling Checker, Proc. 2005 IEEE International Conference on Natural Language Processing and
Knowledge Engineering, pp. 705-710, Wuhan, China, October 30 - November 1, 2005.
Copyright ICWS-2009
[18] Naushad UzZaman and Mumit Khan, A Comprehensive Bangla Spelling Checker, Proc. International
Conference on Computer Processing on Bangla (ICCPB-2006), Dhaka, Bangladesh, 17 February, 2006.
[19] Naushad UzZaman, Phonetic Encoding for Bangla and its Application to Spelling Checker, Name
Searching, Transliteration and Cross Language Information Retrieval, Undergraduate Thesis (Computer
Science), BRAC University, May 2005.
[20] Munshi Asadullah, Md. Zahurul Islam, and Mumit Khan, Error-tolerant Finite-state Recognizer and String
Pattern Similarity Based Spell-Checker for Bengali, to appear in the Proc. of International Conference on
Natural Language Processing, ICON 2007, January 2007.
Remote Administrative Suite for Unix-Based Servers

G. Rama Koteswara Rao G. Siva Nageswara Rao K. Ram Chand
Dept. of CS P.G. Dept., P.B. P.G. Centre
P.G. Centre, P.B.S. College Siddhartha College, Vijayawada ASN College, Tenali
koti_g @yahoo.com sivanags@india.com ramkolasani @yahoo.com
Abstract
This paper deals with the methodologies that help in enhancing the capabilities
of the server. An attempt is made to develop software that eases the burden of
routine administrative functions. This results in increasing the overall
throughput of the server.
1 Introduction
In this paper, we deal with client-server technology. We develop methods to enhance the
capabilities of a client in accessing a server on static and dynamic administrative services.
Generally, a server administrator has the privilege of capturing everything that is happening
on the server side.
This paper discusses two processes, running one at server and another at selected client. The
client side process sends an IP packet, with a request for desired service. The process running
on the server side acts like a gateway and examines the incoming packet. This gateway
process processes the request.
2 Client Side Software
Features that incorporated in developing client side software include the following among
serveral others.
User and Group Management
Remote Script Execution with Feedback
File System Monitoring
Monitoring Paging and Swap Space
Monitoring System Load
Process Management
File Locking
Device Drivers
Database Administration
3 Roles of Clients
A main feature of the client is to give a convenient User interface, hiding the details of how
the server 'talks' to the user. The client needs to first establish a connection with the server,
436 Remote Administrative Suite for Unix-Based Servers
Copyright ICWS-2009
given its address. After the connection is established, the client needs to be able to do two
things
Receive commands from the user, translate them to the server's language (protocol) and send
them to the server.
Receive messages from the server, translate them into human-readable form, and show them
to the user. Some of the messages will be dealt-with by the client automatically, and hidden
from the user. This is based on the Client designer's choice.
4 Algorithm Developed for Client side Software Functions
1.1 get the server's address from a working address that can be used to talk over the
Internet.
1.2 connect to the server
1.3 while (not finished) do:
1.3.1 wait until there is information either from the server, or from the user.
1.3.2 If (information from server) do
1.3.2.1 parse information, show to user, update local state information, etc.
1.3.3 else {we've got a user command}
1.3.3.1 parse command, send to server, or deal with locally.
1.4 done
5 Roles of Servers
A servers main feature is to accept requests from clients, handle them, and send the results
back to the clients. The Server side process checks the 8-bit unused field of IP packet to
confirm that the request is from a valid client. We discuss two kinds of servers: a single-client
server, and a multi-client server.
5.1 Single Client Servers
Single client server responds only to one client at a given time. It acts as follows :
1 Accept connection requests from a Client.
2 Receive requests from the Client and return results.
3 Close the connection when done, or clear it if it's broken from some reason.
Following is the basic algorithm a Single-Client Server performs:
1.1 bind a port on the computer, so Clients will be able to connect
1.2 forever do:
1.2.1 listen on the port for connection requests.
1.2.2 accept an incoming connection request
1.2.3 if (this is an authorized Client)
1.2.3.1 while (connection still alive) do:
1.2.3.2 receive request from client
Remote Administrative Suite for Unix-Based Servers 437
Copyright ICWS-2009
1.2.3.3 handle request
1.2.3.4 send results of request, or error messages.
1.2.3.5 done
1.2.4 else
1.2.4.1 abort the connection
1.2.5 done
5.2 Multi Client Servers
Multi-Client server responds to several clients at a given time. It acts as follows:
1. Accept new connection requests from Clients.
2. Receive requests from any Client and return results.
3. Close any connection that the client wants to end.
Following is the basic algorithm a Multi-Client Server performs:
1.1 bind a port on the computer, so Clients will be able to connect
1.2 listen on the port for connection requests.
1.3 forever do:
1.3.1 wait for either new connection requests, or requests from existing Clients.
1.3.2 if (this is a new connection request)
1.3.2.1 accept connection
1.3.2.2 if (this is an un-authorized Client)
1.3.2.2.1 close the connection
1.3.2.3 else if (this is a connection close request)
1.3.2.3.1 close the connection
1.3.2.4 end if
1.3.3 end if
1.3.4 else { this is a request from an existing Client connection}
1.3.4.1 receive request from client
1.3.4.2 handle request
1.3.4.3 send results of request, or error messages
1.3.5 end if
1.4 done
6 File System Monitoring
Monitoring complete file systems is the most common monitoring task. On different flavors
of Unix the monitoring techniques are the same, but the commands and fields in the output
vary slightly. This difference is due to the fact that command syntax and the output columns
vary depending on the flavour of the Unix system being used.
We have developed software script for monitoring the file system usage.
The outcome of our software that is developed using serveral methods are as follows:
Copyright ICWS-2009
6.1 Percentage of used space method.
Example:
/dev/hda2 mounted on /boot is 11%
6.2 Megabytes of Free Space Method
Example:
Full FileSystem on pbscpg55046.pbscpg
/dev/hda3 mounted on / only as 9295 MB Free Space
/dev/hda2 mounted on /boot only as 79 MB Free Space
6.3 Combining Percentage Used 6.1 and Megabytes of Free Space 6.2.
6.4 Enabling the Combined Script to Execute on AIX, HP_UX, Linux and Solaris.
7 Monitoring Paging and Swap Space
Every Systems Administrator attaches more importance to paging and swap space because
they are supposed to be the key parameters to fix a system that does not have enough
memory. This misconception is thought to be true by many people, at various levels, in a lot
of organizations. The fact is that if the system does not have enough real memory to run the
applications, adding more paging and swap space is not going to help. Depending on the
applications running on the system, swap space should start at least 1.5 times physical
memory. Many high-performance applications require 4 to 6 times real memory so the actual
amount of paging and swap space is variable, but 1.5 times is a good place to start.
A page fault happens when a memory segment, or page, is needed in memory but is not
currently resident in memory. When a page fault occurs, the system attempts to load the
needed data into memory, this is called paging or swapping, depending on the Unix system
being used. When the system is doing a lot of paging in and out of memory, this activity
needs monitoring. If the system runs out of paging space or is in a state of continuous
swapping, such that as soon as a segment is paged out of memory it is immediately needed
again, the system is thrashing. If this thrashing condition continues for very long, there is a
possible risk of the system crashing. One of the goals of the developed software is to
minimise the page faults.
Each of four Unix flavors, AIX, HP-UX, Linux, and Solaris, use different commands to list
the swap space usage, the output for each command and OS varies also. The goal of this
paper is to create all-in-one shell script that will run on any of our four Unix flavors. A
sample output of the script is presented below.
Paging Space Report for GRKRAO
Thu Oct 25 14:48:16 EDT 2007
Total MB of Paging Space : 33MB
Total MB of Paging Space Used : 33MB
Total MB of Paging Space Free : 303MB
Percent of Paging Space Used : 10%
Percent of Paging Space Free : 90%
Copyright ICWS-2009
8 Monitoring System Load
1. First is to look at the load statistics produced.
2. Second one is to look at the percentages of CPU usage for system/kernel,
user/applications, I/O wait state and idle time.
3. The final step in monitoring the CPU load to find hogs.
Most systems have a top like monitoring tool that shows the CPUs, processes, users in
descending order of CPU usage.
9 File Locking
File locking allows multiple programs to cooperate in their access to data. This paper looks
at the following two schemes of file locking.
1. A simple binary semaphore scheme
2. A more complex file locking scheme of locking different parts of a file for either
shared or exclusive access
10 Device Drivers
Device Drivers are needed to control any peripherals connected to a server. This paper
focuses on the following aspects of device drivers where an authorized client can control
devices connected to the server.
1. Registering the device
2. Reading from a device and Writing to a device
3. Getting memory in device driver
11 Database Administration
C Language is used to access MySQL. In this paper, the following databaseadministrative
features are implemented to be run at an authorized client
1. Create a new database
2. Delete a database
3. Change a password
4. Reload the grant tables that control permissions
5. Provide the status of the database server
6. Repair any data tables
7. Create users with permissions
There are three basic things to look at when monitoring the load on the system.
Most systems have a top like monitoring tool that shows the CPUs, processes, users in
descending order of CPU usage.
Copyright ICWS-2009
12 Using the Algorithm that is described in serial number 4 above, we developed the
following C program code:
Sample Client Program
#include <sys/socket.h>
#include<netinet/in.h>
#include<arpa/inet.h>
#include<stdio.h>
int main(int argc,char **argv)
{
int sockfd,n,len;
char buf[10240];
struct sockaddr_in servaddr;
if(argc!=2) perror("invalid IP");
if((sockfd=socket(AF_INET,SOCK_STREAM,0))<0) perror("socket error");
bzero(&servaddr,sizeof(servaddr));
servaddr.sin_family = AF_INET;
servaddr.sin_port = htons(13);
if(inet_pton(AF_INET,argv[1],&servaddr.sin_addr) <= 0) perror("SERVER
ADDR ");
if(connect(sockfd,(struct sockaddr*)&servaddr,sizeof(servaddr)) < 0)
perror("connect error");
buf[0] = '\0';
printf("Enter the Directory name \n");
scanf("%s",buf);
if(write(sockfd,buf,100) < 0) { printf("write error "); exit(1); }
if((len = read(sockfd,buf,100)) < 0) { printf("read error \n"); exit(1); }
else { printf(" Inode Number = %s\n", buf); }
else { printf(" No of links = %s\n", buf); }
else { printf("Size of file in bytes = %s\n", buf); }
else { printf("UID = %s\n", buf); }
else { printf("GID = %s\n", buf); }
if((len = read(sockfd,buf,100)) < 0) { printf("read error \n"); exit(1);
}
else { printf("Type and Permissions = %s\n", buf); }
else { printf("Last Modification Time = %s\n", buf); }
else { printf("Last Access Time = %s\n", buf); }
exit(1);
}
Copyright ICWS-2009
13 Using the Algorithm that is described in serial number 5 above, we developed the
following C program code :
Sample Server Program
#include<sys/socket.h>
#include<arpa/inet.h>
#include<stdio.h>
#define MAXLINE 10024
#define LISTENQ 10
int main(int argc,char **argv)
{ int listenfd,connfd,len,i;
struct sockaddr_in servaddr;
struct stat statbuf;
char buff[MAXLINE],buff1[MAXLINE];
DIR *dir;
struct dirent *direntry;
listenfd = socket(AF_INET,SOCK_STREAM,0);
bzero(&servaddr,sizeof(servaddr));
servaddr.sin_family = AF_INET;
servaddr.sin_addr.s_addr = htonl(INADDR_ANY);
servaddr.sin_port = htons(13);
bind(listenfd,(struct sockaddr*)&servaddr,sizeof(servaddr));
listen(listenfd,LISTENQ);
connfd = accept(listenfd,(struct sockaddr*)NULL,NULL);
if((len = read(connfd,buff,100)) < 0) { printf("read error \n"); exit(1);
}
else printf("%s\n",buff);
lstat(buff,&statbuf);
sprintf(buff, "%d", statbuf.st_ino);
if(write(connfd,buff,100) < 0) { printf("write error "); exit(1); }
sprintf(buff, "%d", statbuf.st_nlink);
sprintf(buff, "%d", statbuf.st_size);
sprintf(buff,"%d", statbuf.st_uid);
sprintf(buff,"%d", statbuf.st_gid);
sprintf(buff,"%o", statbuf.st_mode);
sprintf(buff,"%s", ctime(&statbuf.st_mtime));
sprintf(buff,"%s", ctime(&statbuf.st_atime));
close(connfd);
exit(1);
}
Copyright ICWS-2009
Sample Outputs
[root@grkrao 01Oct]#./a.out
Message From Client : /etc/passwd

[grkrao@grkraoclient 01Oct]#./a.out 123.0.57.44
Enter the File name : /etc/passwd

Message From Server :

Inode Number = 1798355
No of links = 1
Size of file in bytes = 3263
UID = 0
GID = 0
Type and Permissions = 100644
Last Modification Time = Fri Apr 20 16:12:06 2007
Last Access Time = Tue Apr 24 11:46:33 2007
14 Extensions
Adding authentication to individual client requests
Restricting clients to make specific requests
Making a selected client work as a proxy server for administration
Embedding both the server and client side software
References
[1] W. Richard Stevens, Advanced Programming in Unix Environment, Pearson Education, pp 91136
[2] W. Richard Stevens, Unix N/W Programming Vol-1: Networking APIs: Socket and XTI, Pearson
Education, pp 3-140
[3] Uresh Vahalia, Unix Internals: New Frontiers, Pearson Education, pp 4350
[4] W. Richard Stevens, Unix Network Programming, PHI, pp 2583

Development of Gui Based Software Tool
for Propagation Impairment Predictions
in Ku and Ka Band-Traps

Sarat Kumar K. Vijaya Bhaskara Rao S.
Advanced Centre for Atmospheric Sciences Advanced Centre for Atmospheric Sciences
(ISRO Project) (ISRO Project)
Sri Venkateswara University Sri Venkateswara University
Tirupati-522502, India Tirupati-522502, India
D. Narayana Rao H.
HYARC, Nagoya University, Japan

Abstract

The presence of the atmosphere and weather conditions may have a significant
detrimental effect on the transmission/ reception performance of earth-satellite
links operating in the millimeter wavelength range. To attain a fine frequency
planning of operating or future satellite communication systems, the predictive
analysis of propagation effects should be performed with payload in the
suitable orbit slot and, for the earth-segment, with the antenna characteristics
and meteorological parameters being specific to the station site. In line with
this methodology, the Tropical Rain Attenuation Predictions and Simulations
(TRAPS) a MATLAB based GUI software was developed for the real time
processing of input propagation parameters supplied during the execution
from a comprehensive suite of prediction models. It also consists of database
for assessing atmospheric propagation impairments for different locations in
India which are stored for viewing offline, which helps the engineer in
developing reliable communication systems operating in higher frequency
bands. The concentration on this paper is more on the software tool rather than
propagation impairments. For information on propagation impairments refer
[Crane R K, 1996; W L Stutzman, 1993; R L Olsen, 1978].
Keywords: Ku and Ka band, Rain Attenuation, Propagation Impairments GUI Based
Software Tool, TRAPS.
1 Introduction
The prediction of radio propagation factors for radio systems planning is invariably
undertaken with the aid of models representing a mathematical simplification of physical
reality. In some cases the models may be simply formulae derived empirically as a result of
experimental link measurements, but in many cases the models are derived from analysis of
the propagation phenomena in media with time-varying physical structure. In all cases the
models require input data (usually meteorological data) so as to be able to make a prediction.
Usually there are important constraints on situations in which various models may be
applied and also the input data will have to meet certain conditions in order to sustain a
certain accuracy of prediction. [ITU-R models P 618-7, 837-3, 839-2].
444 Development of Gui Based Software Tool for Propagation Impairment Predictions in Ku and Ka
Copyright ICWS-2009
The paper describes an initial design study for an intelligent (computer-based) model
management system for radio propagation prediction on earth-space links. The system
includes "deep knowledge" about the relationships between the fundamental concepts used by
radio propagation experts. It manages application of the models in particular regions of the
world and is able to provide intelligent database support. It is intended for use both by the
propagation expert as an aid to developing and testing models and also by the radio system
planner as means of obtaining routine propagation predictions using state-of-the-art
knowledge on appropriate models and data.
In predicting propagation factors for radio system design we are concerned with the following
actions:
Definition of link geometry
Evaluation of antenna performance on link budget, including footprint calculation and
polarisation properties
Selection of propagation models appropriate to particular frequencies and locations
Evaluation of propagation factors for specific system requirements (outage
probabilities)
Conventionally these activities are carried out with the aid of a computer, usually using a
series of separate programs linked together or written by the user. In the case of propagation
models we may be concerned with the application of straightforward empirical or semi-
empirical formulae. For frequencies above 10 GHz, empirical or semi-empirical formulae for
propagation prediction, covering the following factors: bore sight error, attenuation and
fading (including scintillation), cross-polarisation, delay spread, antenna noise, interference.
The formulae for these factors generally pose a simple task to program and evaluate, but the
knowledge associated with the conditions under which each Step may be applied and the
input data requirements are often complex, the more so because many of the formulae are
empirical and only apply within strict limits, for example on frequency, elevation or type of
climate.
What is required as a tool for the system designer and propagation expert is an intelligent
kind of data management system able to store and retrieve the formulae and associated
conditions for propagation prediction, to select those models appropriate to the particular
system requirements and local conditions, to select the best available data for use with the
models and then to calculate the specific propagation factors.
Conventionally a propagation expert or system designer may retain computer source code or
compiled code on his chosen machine, for a range of propagation problems, including the
antenna and link geometry calculations mentioned earlier. As this formula becomes larger,
the task of maintenance of the code, of linking elements together, of remembering the
required data formats and output display possibilities grows multiplicatively. If this type of
system provides an in house consultancy type of service, with many experts contributing to
the pool of formulae and data over a period of time, then the situation can get out of hand.
What we are proposing is a system of managing these mundane tasks, enhanced with vital
knowledge on the conditions for application of formulae and selection of data.
What we require is an intelligent system capable of linking together concepts (e.g. models
and data requirements) and applying rule based reasoning. These requirements lead us to
Development of Gui Based Software Tool for Propagation Impairment Predictions in Ku and Ka 445
Copyright ICWS-2009
consider the latest generation of intelligent knowledge based system tools, based on an
object-oriented approach with a reasoning toolkit.
A software system which allows us to define the key concepts in a particular field in terms of
equations, conditions or rules, plus descriptions or explanations, allows us to define various
types of relationships between these concepts, to associate properties via a particular type of
relationship and to perform goal oriented or data driven reasoning, should prove to be a
powerful tool in addressing our specialised and well contained propagation factor prediction
problem. [Pressman RS, 1987]

Fig. 1: Schematic one to achieve the principal objects and their dependencies

Fig. 2: Proposed System Architecture using all the models information available
The principle objects in a propagation prediction system and their dependencies are
illustrated in figure 1. Each block represents an object (or class of objects) and consists of
three parts; objects name, operations or methods and data items. We note that objects are
defined for the (satellite) system model, for the propagation factors, for antenna footprints
and for three databases (for radio, site and meteorological data). The object-oriented
environment forms an inner shell, interfacing to the computer operating system and other
utilities as shown in figure 2
2 Implementation Using Models
Processing a propagation prediction task is a lot like assembling models mutually into a
rather complex tree graph where each model may rely on one or many other lower-level
models, then executing the models by successive layers of node-points till end results are
produced as outputs of the main calculation schema. A model embedded in this structure,
Copyright ICWS-2009
although being called in different radio-propagation contexts, may frequently receive
unchanged values for some parameters of its argument list. Thus, it seemed beneficial to
develop a model implementation that applies to different parameter types for the same
general action.
In adopting MATLAB as programming language for implementing propagation models, the
inherent MATLAB's ability to perform matrix computations has been exploited at all the
calculation levels by overloading model functions through interfaces, each interface defining
a particular combination of input parameters types. Every in-house subroutine quoted inside a
model algorithm is adequately defined in terms of the type of input parameters it supports.
The model in turn is built in such a way that the function output can be processed in a
consistent and error-free manner, and each variable returned by the model has the expected
MATLAB object format. The intended outcome is that model functions are safely executed
with different combinations of scalars, vectors, matrices or multi-dimensional arrays as
arguments. [P. Marchand et al., 2002]
Furthermore, the interface mechanism allows the same function to operate on a common data
abstraction but with multiple operands types, and relieves the programmer of the complexity
of assigning specific names to functions that perform the same model for different legitimate
use cases of parameter sets.
3 Functional Description
TRAPS consisting of an interface supporting the collection of input parameters to a
prediction task, execution of this task and visualization of its results. Numerous models have
been implemented from scratch in MATLAB in such a way that they can be integrated into
so-called TRAPS, using the integrated potential features of MATLAB, which supports the
full range of prediction tasks for geo stationary satellites. The application then provides a
task-oriented abstraction on top of this integrated MATLAB schema, supporting the selection
of a specific usage scenario (interface) of the TRAPS schema and the collection of its
expected input parameters in the proper format. [C. Bachiller et al., 2003] The user interacts
with the system through a sequence of screens supporting the selection of satellite(s), site(s)
or region, the input of statistical parameters, the choice of effects and models to be calculated,
the triggering of the calculation and the visualization of results.
From the end users viewpoint, TRAPS is an application that is accessible from MATLAB
platform dependent and supports three modes of operation: Satellite mode, Location mode,
Contour mode. The user may select a location mode to consider the influence of link
geometry and radio transmission parameters on signal degradations with a tight control on the
statistical parameters relating to the activity of atmospheric factors in the surroundings of the
ground station. The single-site mode interface offers a set of handy link parameter tables
through which any conceivable pattern of slant-path connections from a common earth-
station site can be defined. The link-parameters interface page includes a user-held list of
parameters, defined during previous sessions and stored there for subsequent reuse. [P Daniel
et al., 2004]
In the satellite calculation mode, the TRAPS interface asks for selection of required satellite
for analysis of propagation impairments. Alternatively, the selection of a single satellite can
Copyright ICWS-2009
be combined with an arbitrary number of earth stations. Site data and a topographic database
are available for support. For broadcasting applications specifically, the propagation
capability of the downlink satellite channel can be examined over any geographical region
within the overall satellite coverage area. After the calculations have been performed, the
user is able to store the input parameters of the calculation and the numerical and graphical
results into the systems database, where they can be accessed at any later point in time.
The windows structure is illustrated in figure 3. The location mode, satellites mode, and
contour mode calculations are the three main courses of action in the interface, and they make
use of a set of common pages whose contents is dynamically adapted to the context of the
calculation in progress.
After opening the TRAPS, the user is presented with a welcome page that proposes a choice
between the satellites mode, location mode, and contour mode.
In general, performing each step of a calculation will enable the next icon in the sequence of
steps that should be followed. It is possible at any point in time to go back to a previous step
by clicking on the icon associated with this step. Modifying the choices made at a certain step
may necessitate that the user goes through the following steps again. Figures 4-10 shows the
windows seen during execution for obtaining the attenuation parameters.

Fig. 3: Start Window of TRAPS

Fig. 4: Satellite Mode Window of TRAPS
SATELLITE MODE

Copyright ICWS-2009

Fig. 5: Satellite Mode Window for choosing the location: TRAPS

Fig. 6: Window for choosing the parameter for calculation: TRAPS

Fig. 7: Window for viewing the required parameter plots: TRAPS

Fig. 8: Window showing the plot after choosing the parameter: TRAPS
Copyright ICWS-2009

Fig. 9: Location Mode Main Window TRAPS

Fig. 10: Contour Mode Results Window TRAPS
4 Technical Overview - Traps
TRAPS supports a fully GUI based architecture. The TRAPS was developed using the
features of MATLAB 2007b. The calculation models are run using the built-in functions, not
on external to the application. If the user has a local copy of MATLAB, he may download
results in MAT file format too. However, this is unnecessary if the user is solely interested in
viewing results in graphical form, since all the most relevant graphs are generated
automatically without extra user intervention upon completion of the calculation.
Whenever a user initiates a calculation request, the application generates parameters and
invokes the execution of the compiled schema through the MATLAB complier. When it is
scheduled for execution, the compiled schema reads its parameters MAT file, analyzes its
contents in order to determine its course of action (which effects should be calculated, with
which models and data sets), performs a set a calculations and then stores the numerical
results in another MAT file. It also generally produces a set of graph files for results
visualization.
The development of TRAPS was based entirely on the software package MATLAB and
specifically, on the GUI (Graphical User Interface) environment offered. No knowledge of
this software is required from the user. The inputs of the tool are also briefly described. This
graphic tool can be used either for the design of Satellite Communications or for research
purposes.
Copyright ICWS-2009
5 Summary and Conclusions
The performance of complex wave propagation prediction tasks requires propagation
engineers to use a significant number of mathematical models and data sets for estimating the
various effects of interest to them, such as attenuation by rain, by clouds, scintillation, etc.
Although the models are formally described in the literature and standardized e.g. by the
ITU-R, there is no easy-to-use, integrated and fully tested implementation of all the relevant
models, on a common platform. On the contrary, the models have usually been implemented
using different tools and languages such as MATLAB, IDL, PV-Wave, C/C++ and
FORTRAN. It is often necessary to have a good understanding of a models implementation
in order to use it correctly, otherwise it may produce errors or, worse, wrong results when
supplied with parameters outside of an expected validity range. Some models only support
scalar values as inputs while others accept vectors or matrices of values, e.g. for performing a
calculation over a whole region as opposed to a discrete point. These issues worsen whenever
the engineer wishes to combine several models, which is necessary for most prediction tasks.
In addition, assessing the validity of results produced by the combination of multiple models
is also a complex issue, especially when their implementation originates from various parties.
Finally, as no common user interface is supported, the combination of models requires
tedious manipulations and transformations of model inputs and outputs.
The paper has outlined the design features and specifications of TRAPS, Tropical Rain
Attenuation Predictions and Simulations, used for propagation prediction on slant paths. The
software tool achieves the integration of propagation models and radio meteorological
datasets and provides support for the analysis of tropospheric effects on a large variety of
earth-space link scenarios. The actual value of the TRAPS software application can be
reckoned by the functionality of the GUI interface, its efficiency in performing advanced
model calculations and the content of results being returned. Additional capabilities of
TRAPS include offline viewing of results already stored in data base and run time processing
for the requirement output by giving the inputs. The web based software tool is under
development, which will enable propagation engineers to predict the attenuation by executing
online with specified input parameters of model.
6 Acknowledgement
The authors would like to thank the Advanced Centre for Atmospheric Sciences project
supported by Indian Space Research Organisation (ISRO), and Department of Physics, Sri
Venkateswara University, Tirupati.
References
[1] [C. Bachiller, H. Estehan, S. Cogollos, A. San Blas, and V. E. Boria] Teaching of Wave Propagation
Phenomena using MATLAB GUIs at the Universidad Politecnica of Valencia, IEEE Antennas and
Propagation Magazine, 45, 1, February 2003, pp. 140143.
[2] [ITU-R] Characteristics of Precipitation for Propagation Modeling, Propagation in Non-Ionized Media,
Rec. P.837-3, Geneva, 2001.
[3] [ITU-R] Propagation Data and prediction Methods Required for the Design of Earth-Space
Telecommunication Systems, Propagation in Non-Ionized Media, Rec. P.6187, Geneva, 2001.
[4] [ITU-R] Rain Height Model for Prediction Methods, Propagation in Non-Ionized Media, Rec. P.839-3,
Geneva, 2001.
Copyright ICWS-2009
[5] [P. Marchand and 0. T. Holland] Graphics and GUIs with MATLAB, Third Edition, Boca Ratan, CRC
Press, 2002.
[6] [Pantelis-Daniel. M. Arapoglou', Athanasios D. Panagopoulosi3', George E. Chatzaraki', John D.
Kanellopoulos', and Panayotis G. Cottis'] Diversity Tech iques for Satellite Communications: An
Educational Graphical Tool, lEEE Antenna. and Propagation Magazine, Vol. 46, No. 3. June 2004
[7] [Pressman R S] Software Engineering, TMH, 2
nd
Edition 1987.
[8] [R. K. Crane] Electromagnetic Wave Propagation through Rain, New York, Wiley, 1996.
[9] [R. L. Olsen, D. V. Rogers., and D. B. Hodge] The
R
b relation in calculation of rain attenuation, IEEE
Trans. On Antennas and Propagation, Vol. 26, No. 2, pp. 318329, 1978.
[10] [W. L. Stutzman] The special section on propagation effects on satellite communication links,
Proceedings of the IEEE, Vol. 81, No. 6, 1993, pp. 850855.
Semantic Explanation of Biomedical
Text Using Google

B.V. Subba Rao K.V. Sambasiva Rao

Department of IT M.V.R. College of Engineering,
P.V.P Siddhartha Institute of Technology Vijayawada
Vijayawada-520007 Krishna Dt., A.P
bvsrau@gmail.com kvsambasivarao@rediffmail.com

Abstract

With the rapid increasing quantity of biomedical text, there is a need for
automatic extraction of information to support biomedical researchers. So
there is a need for effective Natural Language Processing tools to assist in
organizing, retrieving this information. Due to incomplete biomedical
information databases, the extraction is not straightforward using dictionaries,
and several approaches using contextual rules and machine learning have
previously been proposed. Our work is inspired by the previous approaches,
but is novel in the sense that it is using Google for semantic explanation of the
biomedical words. The semantic explanation or annotation accuracy obtained
52% on words not found in the Brown Corpus, Swiss-Prot or LocusLink
(accessed using Gsearch.org) is justifying further work in this direction.
Keywords: Biomedical text, Google, Data Mining, Semantic explanation.
1 Introduction
With the increasing importance of accurate and up-to-date databases for biomedical research,
there is a need to extract information from biomedical research literature, e.g. those indexed
in MEDLINE [8]. Examples of information databases are LocusLink, UniGene and Swiss-
Prot [3]. Due to the rapidly growing amounts of biomedical literature, the information
extraction process needs to be automated. So far, the extraction approaches have provided
promising results, but they are not sufficiently accurate and scalable.
Methodologically all the suggested approaches belong to the information extraction field, and
in the biomedical domain they range from simple auto- automatic methods to more
sophisticated, but manual, methods. Good examples are: Learning relationships between
proteins/genes based on co-occurrences in MEDLINE abstracts [10] (e.g. manually
developed information extraction rules (e.g. information extraction [2] (e.g. protein names)
classifiers trained on manually annotated training corpora (e.g. [4]), and our previous work on
classifiers trained on automatically annotated training corpora).
Examples of Biological name entities in a textual context are i) duodenum, a peptone meal
in the ii) subtilisin plus leucine amino-peptidase plus prolidase followed Semantic
Annotation is an important part of information extraction is to know what the information is,
e.g. knowing that the term gastrin is a protein or that Tylenol is a medication. Obtaining
and adding this knowledge to given terms and phrases is called semantic tagging or semantic
annotation annotation.
Semantic Explanation of Biomedical Text Using Google 453
Copyright ICWS-2009
1.1 Research Hypothesis

Fig. 1: Google is among the biggest known as Information haystacks.
Google is probably the worlds largest available source of heterogeneous electronically
represented information. Can it be used for semantic tagging of textual entities in biomedical
literature? and if so, how? The rest of this paper is organized as follows. Section 2 describes
the materials used, section 3 presents our method, section 4 presents empirical results, section
5 describes related work, and the section 6 presents conclusion and future work.
2 Materials
The materials used included biomedical (sample of MEDLINE abstract) and general English
(Brown) textual corpora, as well as protein databases. See below for a detailed overview.
2.1 Medline Abstracts-Gastrin-Selection
The US National Institutes of Health grants a free academic license for PubMed/MEDLINE
[9, 10]. It includes a local copy of 6.7 million abstracts, out of the 12.6 million entries that are
available on their web interface. As subject for the expert validation experiments we used the
collection of 12.238 gastrin-related MEDLINE abstracts that were available in October 2005.
2.2 Biomedical Information Databases
As a source for finding already known protein names we used a web search system called
Gsearch, developed at Department of Cancer Research and Molecular Medicine at NTNU. It
integrates common online protein databases, e.g. Swiss-Prot, LocusLink and UniGene.
454 Semantic Explanation of Biomedical Text Using Google
Copyright ICWS-2009
2.3 The Brown Corpus
The Brown repository (corpus) is an excellent resource for training a Part Of Speech (POS)
tagger. It consists of 1,014,312 words of running text of edited English prose printed in the
United States during the calendar year 1961. All the tokens are manually tagged using an
extended Brown Corpus Tagset, containing 135 tags. The Brown corpus is included in the
Python NLTK data-package, found at Sourceforge.
3 Our Method
We have taken a modular approach where every sub module can easily be replaced by other
similar modules in order to improve the general performance of the system. There are five
modules connected to the data gathering phase, namely data selection, tokenization, POS-
tagging, Stemming and Gsearch. Then the sixth and last module does a Google search for
each extracted term. See.Figure 2.
3.1 Data Selection
The data selection module uses PubMed Entrez online system to return a set of PubMed IDs
(PMIDs) for a given protein, in our case gastrin (symbol GAS). The PMIDs are matched
against our local copy of MEDLINE, to extract the specific abstracts.
3.2 Tokenization
The text is tokenized to split it into meaningful tokens, or words. We use the White Space
Tokenizer from NLTK with some extra processing to adapt to the Brown Corpus, where
every special character (like ( ) -, and.) is treated as a separate token. Words in parentheses
are clustered together and tagged as a single token with the special tag Paren.
3.3 POS Tagging
Next, the text is tagged with Part-of-Speech (POS) tags using a Brill tagger trained on the
Brown Corpus. This module acts as an advanced stop-word-list, excluding all the everyday
common American English words from our protein search. Later, the actually given POS tags
are used also as context features for the neighboring words.
3.4 Porter-Stemming
We use the Porter Stemming Algorithm to remove even more everyday words from the
possibly biological term candidate list. If the stem of a word can be tagged by the Brill
tagger, then the word itself is given the special tag STEM, and thereby transferred to the
common word list.

Copyright ICWS-2009

Fig. 2: Overview of Our Methodology (named Biogoogle)
3.5 Gsearch
Identifies and removes already known entities from the search, but after the lookup in
Gsearch, there are still some unknown words that are not yet stored in our dictionaries or
databases, so in order to do any reasoning about these words it is important to know which
class they belong to. Therefore, in the next phase they are subjected to some advanced
Google searching, in order to determine this.
3.6 Google Class Selections
We have a network of 275 nouns, arranged in a semantic network on the form X is a kind of
Y. These nouns represent the classes that we want to annotate each word with. The input to
this phase is a list of hitherto unknown words. From each Word a query on the form in the
example below is formed (query syntax: Word is (an an|a) a)). ). Then these queries are
fed to the PyGoogle module which allows 1000 queries to be run against the Google search
engine every day with a personal password key. In order to maximize the use of this quota,
the results of every query are cached locally, so that each given query will be executed only
once.
456 Semantic Explanation of Biomedical Text Using Google
Copyright ICWS-2009
If a solution to the classification problem is not present among the first 10 results returned,
the result set can be expanded by 10 at a time, at the cost of one of the thousand quota-queries
every time.

Each returned hit from Google contains a snippet with the given query phrase and
approximately 10 words on each side of it. We use some simple regular grammars to match
the phrase and the words following it. If the next word is a noun it is returned. Otherwise,
adjectives are skipped until a noun is encountered, or a miss is returned.
4 Empirical Results
Table1: Semantic classification of untagged words
Classifier TP/TN FP/FN Precision/Recall
F-Score CA
Biogoogle 24/80 31/65 43.6/27.0 33.3 52.0
5 Related Work
Our specific approach was on using Google for direct semantic annotation (search- searching
for is-a relations) of tokens (words) in biomedical corpora. We havent been able to find other
work that does this, but Dingare et al. is on using the number of Google hits as input features
for a maximum entropy classifier used to detect protein and gene names[1]. Our work differs
since we use Google to directly determine the semantic class of a word (searching for is-a
relationships and parsing text (filtering adjectives) after ships (a a/an) in Word is (a a|an)
an), as, opposed to Dingare et al.s indirect use of Google search as a feature for the
information extraction classifier. A second difference between the approaches is that we
search for explicit semantic annotation (e.g. word is a protein) as opposed to their search
for hints (e.g. word protein). The third important difference is that our approach does
automatic annotation of corpuses, whereas they require pre-tagged (manually created)
corpuses in their approach. Other related works include extracting protein names from
biomedical literature and some on semantic tagging using the web. Under, a brief overview of
related work is given.
5.1 Semantic Annotation of Biomedical Literature
Other approaches for (semantic) annotation (mainly for protein and gene names) of
biomedical literature include: a) Rule-based discovery of names (e.g. of proteins and genes).
b) Methods for discovering relationships of proteins and genes [2].
c)Classifier approaches (machine learning) with textual context as features, [4, 5] d)Other
approaches include generating probabilistic rules for detecting variants of biomedical terms
The paper by Cimiano and Staab [6] shows that a system similar to ours works, and can be
taken as a proof that automatic extraction using Google is a useful approach. Our systems
differ in that we have 275 different semantic tags, while they only use 59 concepts in their
ontology. They also have a table explaining how the number of concepts in a system
influences the recall and precision in several other semantic annotation systems.
Copyright ICWS-2009
This paper presents a novel approach - Biogoogle - using Google for semantic annotation of
entities (words) in biomedical literature.
We got empirically promising results - 52% semantic annotation accuracy ((TP+TN)/N,
TP=24,TN=80,N=200) in the answers provided by Biogoogle compared to expert
classification performed by a molecular biologist. This en- encourages further work possibly
in combination with other approaches (e.g. rule and classification based information
extraction methods), in order to improve and the overall accuracy (both with respect to
precision and recall). Disambiguation is another issue that needs to be further investigated.
Other opportunities for future work include:
Improve tokenization. Just splitting on whitespace and punctuation characters is not good
enough. In biomedical texts non-alphabetic characters such as brackets and dashes need to be
handled better. Improve stemming. The Porter algorithm for English language gives mediocre
results on biomedical terms (e.g. protein names).Do spell-checking before a query is sent to
Google, e.g. allowing minor variations of words (using the Levenshtein Distance).Search for
other semantic tags using Google, e.g. is a kind of and resembles, as well as negations
(is not a)., Investigate whether the Google ranking is correlated with the accuracy of the
proposed semantic tag. Are highly ranked pages better sources than lower ranked ones. Test
our approach on larger datasets, e.g. all available MEDLINE abstracts. Combine this
approach with more advanced natural language parsing techniques in order to improve the
accuracy.
In order to find multiword tokens, one could extend the search query (X is (an an|a)) to also
include neighboring words of X, and then see how this affects the number of hits returned by
Google. If there is no reduction in the number of hits, this means that the words are always
printed together and are likely constituents in a multiword token. If you have only one actual
hit to begin with, the certainty of the previous statement is of course very weak, but with
increasing number of hits, the confidence is also growing.
References
[1] [Steffen Bickel, Ulf Brefeld, 2004] A Support Vector Machine classifier for gene name recognition. In
Proceedings of the EMBO Works Workshop: A Critical Assessment of Text hop: Mining Methods in
Molecular Biology.
[2] [C. Blaschke, C. Ouzounis, 1999] Automatic Extraction of biological information from scientific text:
Protein-protein interactions. In Proceedings of International Conference on Intelligent Systems for
Molecular Biology, pages 6067. AAAI.
[3] [B. Boeckmann, Estreicher, Gasteiger, MJ Martin, K Michoud, I. Phan, S. Pilbout, and M. Schneider, 2003]
The SWISS-PROT protein knowledgebase and its supplement. Nucleic Acids Research, pages 365370,
January 2003.

Embedded Systems
Smart Image Viewer Using Nios II Soft-Core
Embedded Processor Based on FPGA Platform

Swapnili A. Dumbre Pravin Y. Karmore R.W. Jasutkar
GH Raisoni College Alagappa University GH Raisoni College
Nagpur Karaikudi Nagpur
swapnili_s1@yahoo.com pravinkarmore@yahoo.com r_jasutkar@yahoo.com

Abstract

This paper proposes the working of an image viewer which uses the very
much advanced technologies and automation methods for hardware and
software designing.
There are two goals for the project. The basic goal is to read the contents of
SD card, using the SD Card reader on the DE2[1] board, decode and display
all the JPEG images in it on the screen one after the other as a slideshow using
the onboard VGA DAC.
The next and more aggressive goal once this is achieved is to have effects in
the slide show like fading, bouncing etc.
The main function of the software is to initialize and control the peripherals
and also to decode the JPEG image once it is read from the SD card[9]. The
top level idea is to have two memory locations. One is where the program sits
(Mostly SRAM) and the other (Mostly SDRAM) is where the image buffer is
kept so that the video peripheral can read from it.At the top level, the c-
program reads the JPEG image from the SD Card, decompresses it and asks
the video peripheral not to read from the SDRAM[2] any more and it starts
writing to the SDRAM the new decoded image. After it is done, it informs the
Video peripheral to go ahead and read from SDRAM again and it starts to
fetch the next image from the SD Card and begins to uncompress it[6][7].
Keywords: We are using SD Card reader, with the help of which memory can be read using
software protocol that is File transfer protocol proposed hardware can be used with the
Quartus II tool and SOPC builder is used to read the IP core. Simultaneously on same system
software is developed by Nios II[8] soft core embedded processor and C, C++ embedded
language.
1 Proposed Plan of Work
The basic idea is to have two peripherals, 1) to control the onboard SD card reader and 2) to
control the VGA DAC.
There are two goals for the project. The basic goal is to read the contents of SD card, using
the SD Card reader on the DE2 board, decode and display all the JPEG images in it on the
screen one after the other as a slideshow using the onboard VGA DAC. The next and more
aggressive goal once this is achieved is to have effects in the slide show like fading, bouncing
etc.
462 Smart Image Viewer Using Nios II Soft-Core Embedded Processor Based on FPGA Platform
Copyright ICWS-2009
Main objective is development of fast accurate, time, cost, effort and efficient project using
advanced technology.
Alteras powerful development tools Facilities
Let you create custom systems on a programmable chip, making FPGAs the platform of
choice.
Increase productivity
Whether you are a hardware designer or software developer, we have tools to provide you
with unprecedented time and cost savings.
Protect your software investment from processor obsolescence
Altera's embedded solutions protect the most expensive and time consuming part of your
embedded designthe software.
Scale system performance
Increase your performance at any phase of the design cycle by adding processors, custom
instructions, hardware accelerators, and leverage the inherent parallelism of FPGAs.
Reduce cost
Reduce your system costs through system-level integration, design productivity, and a
migration path to high-volume Hard Copy ASICs.
Establish a competitive advantage with flexible hardware
Choose the exact processor and peripherals for your application. Deploy your products
quickly, and feature-fill
2 Proposed Hardware for System

Smart Image Viewer Using Nios II Soft-Core Embedded Processor Based on FPGA Platform 463
Copyright ICWS-2009
3 Research Methodology to be Employed
For this project we are using FPGA and soft-core embedded processor that is NIOS II[8]
from Altera for development of system. We are using DE-2 development platform for
physical verification of the proposed application.

4 Conclusion
The traditional snapshot viewer have so many disadvantages like poor picture quality, poor
performance, platform dependencies and they are unable to stand against changing
technologies This advanced technologies and methodologies leads to Protection of software
investment from processor obsolescence It Increases productivity. Performance and
efficiency of Software. It reduces cost of Project and removes all the problems occur with
traditional image viewer it have an additional facility of adding effects in the slide show like
fading, bouncing etc.
References
[1] Using the SDRAM Memory on Alteras DE2 Board.
[2] 256K x 16 High Speed Asynchronous CMOS Static RAM With 3.3V Supply: Reference Manual.
[3] Avalon Memory-Mapped Interface Specification.
[4] FreeDOS-32: FAT file system driver project page from Source Forge.
[5] J. Jones, JPEG Decoder Design, Sr. Design Document EE175WS00-11, Electrical Engineering Dept.,
University of California, Riverside, CA, 2000.
[6] Jun Li, Interfacing a MultiMediaCard to the LH79520 System-On-Chip.
[7] Engineer-to-Engineer Note Interfacing MultiMediaCard with ADSP-2126x SHARC Processors.
[8] www.altera.com/literature/hb/qts/ qts_qii54007.pdf
[9] www.radioshack.com/sm-digital-concepts-sd-card-reader
[10] http://focus.ti.com/lit/ds/symlink/ pci7620.pdf.
[11] ieeexplore.ieee.org/iel5/30/31480/ 01467967.pdf
[12] www.ams-tech.com.cn/Memory-card
SMS Based Remote Monitoring and
Controlling of Electronic Devices

Mahendra A. Sheti N.G. Bawane
G.H. Raisoni College of Engg G.H. Raisoni College of Engg.
Nagpur (M.S.) 440016 India Nagpur (M.S.) 440016 India
mahindrra@rediffmail.com narenbawane@rediffmail.com

Abstract

In todays world mobile phone has become the most popular communication
device, as it offers effective methods of communication to its users. The most
common service provided by all network service providers is Short Message
Service (SMS). As Short Message Service is cost effective way of conveying
data, researchers are trying to apply this technology in the areas that are not
explored by network service providers. One of such areas, that Short Message
Service can be used, as a remote monitoring and controlling technology. By
sending a specific SMS messages one can not only monitor and control
different electrical/electronic devices from any place in the world, but also
able to get alerts regarding catastrophic events.
A stand alone Embedded System (here we call it as Embedded Controller) can
be developed to monitor and control the electrical/electronic devices through
specific SMS. The same Embedded Controller can detect the catastrophic
events like fire, earthquake, and burglary events. Implementation of such
system is possible by using a programmed microcontroller, relays, and sensors
like PIR sensor, Vibration sensor, Fire sensor and GSM modem which can be
used to send and receive the SMS. The Programmed Embedded Controller
acts as a mediator between mobile phone and electrical/electronic devices and
performs the monitoring and controlling functions as well as catastrophic
event detection and notification. In this paper we are presenting a design of an
embedded system that will monitor and control the electric/electronic devices
and will notify the catastrophic events (like Fire, burglary) by the means of
SMS and will provide light when user arrives home at night.
1 Introduction
In this paper we are trying to explore the context of embedded systems and mobile
communication which will make human life much easier. Imagine that you are driving from
your office to your home, while driving you realized that you forgot to switch of the Air
Conditioner. Now in this case you have to either go back to office or if somebody is there in
office then you have to call him and tell him to switch of the Air Conditioner. If both the
options are not possible for you then what? Is there any option exists that enables you to get
the status of Air Conditioner, which is installed in your office and control it from the location
where you are? Yes! At this situation remote monitoring and controlling comes in mind [1,
2].
SMS Based Remote Monitoring and Controlling of Electronic Devices 465
Copyright ICWS-2009
The purpose behind developing Embedded Controller is to remotely monitor and control the
electrical/electronic devices through SMS, which is proven to be cost effective method of
data communication in recent days [3]. Such system can be helpful in not only remotely
switching ON/OFF the devices but also in security or safety in industries to detect the
catastrophic conditions and alerting the user through SMS message [5, 8, 4].
Before discussing the actual system let us brief out existing trends for remote monitoring and
controlling.
2 Recent Trends for Remote Monitoring & Controlling
Most of the electrical/electronic devices are provided with their own remote controllers by
the manufacturer but the limitation is the distance, as we are interested in controlling the
devices from a long distance hence we will discuss only the technology which enables the
long distance monitoring and controlling [1], [9].
After the introduction of Internet in 1990s it became a popular medium for remote
communication. When researchers realized that it could become an effective medium for
remote monitoring and controlling then the concept of Embedded Web Server came [6].
Embedded Web Server enabled its users to remotely monitor and control the devices over
Internet [6].
But this technology is costlier as it requires an always-on connection for Embedded Web
Server. Again user has to pay for the Internet access. Though Embedded Web Server can be
useful for complex operations but it is really costly for simple controlling and monitoring
applications [6]. Apart from the cost the factor that imposes limits on use of this technology
is accessibility of Internet. Unless and until Internet accessibility is not there user cannot able
to access the system.
Alternative to Internet based monitoring is making use of GSM technology, which is almost
available in all the part of world even in remote locations like hill areas. GSM service is
available in four frequency bands 450 MHz, 900 MHz, 1800 MHz, and 1900 MHz
throughout the world. One of the unique benefits of GSM service is its capability for
international roaming because of the roaming agreements established between the various
GSM operators worldwide [7, 12]. Short Message Service is one of the unique features of
GSM technology and can be effectively used to transmit the data from one mobile phone to
another mobile phone and it was first defined as a part of GSM standard in 1985 [10].
3 Design of an Embedded Controller
As SMS technology is cheap, convenient and flexible way of conveying data as compare to
Internet technology, it could be used as a cost effective and more flexible way of remote
monitoring and controlling [7,10]. Hence a stand alone embedded system- Embedded
Controller can be designed and developed to monitor and control the electrical/electronic
devices through SMS with following features.
1. Remotely switch ON and OFF any electrical/electronic device by sending a specific
SMS message to Embedded Controller.
2. Monitor the status of the device whether ON or OFF. For this purpose the Embedded
Controller will generate a reply message to users request including the status of the
device and will send it to requested mobile phone.
466 SMS Based Remote Monitoring and Controlling of Electronic Devices
Copyright ICWS-2009
3. The Embedded Controller will send alerts to the user regarding the status of the
device. It is essential in cases where the device needs to be switched ON/OFF after a
certain time.
4. The Embedded Controller will notify the power cut and power on conditions to the
user.
5. Providing Security Lightning to deter burglars, or providing lights when user comes at
home in late night for this purpose PIR sensor is used.
6. Providing Security alerts in case of catastrophic events like fire and burglary.
7. For Security purpose the already stored mobile phone numbers are allowed to use the
system.
8. Any Mobile Phone with SMS feature can be used with the system.
9. Status of Device is displayed at Embedded Controller through LEDs for local
monitoring.
10. Atmels AT 89C52 microcontroller is used to filter the information and perform the
required functions.
3.1 System Architecture

Fig. 1: System Architecture
Figure 1 shows the system architecture of the remote monitoring and controlling system. The
Embedded Controller is the heart of the system; it will perform all the system functions. As
shown in figure 1 the user who is authorized to use the system is allowed to send the specific
SMS message to the GSM modem, the SMS travels through GSM network to the GSM
modem as in [11], the Embedded Controller periodically reads the first location of SIM
(Subscriber Identity Module) which is present inside the GSM modem as in [11], as soon as
Copyright ICWS-2009
Embedded Controller finds the SMS message, it start to process the SMS message and then
takes necessary action and gives reply back to the user as per the software program
incorporated in the ROM of microcontroller.
3.2 Hardware Design
Figure below shows the block diagram of an Embedded Controller which consists of a micro
controller, GSM modem, Relays, Devices that are to be controlled, PIR sensors, ADC,
Temperature sensor, Buzzer, LCD, LEDs etc.

Fig. 2: Hardware Design of Embedded Controller
Here in our design we are using the following main hardware components:
The AT 89C52 micro-controller:
Used for processing the commands and controlling the different external devices
connected as per the SMS received.
ANALOGIC 900/1800 GSM modem:
This GSM/GPRS terminal equipment is a powerful, compact and self-contained unit
with standard connector interfaces and has an integral SIM card reader. It is used for
receiving the SMS from the mobile device and then to transmit to the AT 89C52. It is
also used to send SMS Reply back to user.
A MAX232 chip:
This converter chip is needed to convert TTL logic from a Microcontroller (TxD and
RxD pins) to standard serial interfacing for GSM modem (RS232)
ULN 2003A:
The IC ULN 2003A is used for inductive load driving.
RELAY:
Used to achieve ON/OFF switching.
Copyright ICWS-2009
PIR Sensor: A PIR Sensor is a motion detector, which detects the heat emitted
naturally by humans and animals.
ADC (Analog to Digital Converter):
The ADC0808, data acquisition component is a monolithic CMOS device with an 8-
bit analog-to-digital converter, 8-channel multiplexer and microprocessor compatible
control logic.
Temperature Sensor (LM 35):
The LM35 series are precision integrated-circuit temperature sensors, whose output
voltage is linearly proportional to the Celsius (Centigrade) temperature.
Fire Sensor, Vibration Sensors etc can also be used.
BUZZER:
A buzzer is connected to one of the I/O ports of the Microcontroller. As soon as the
signal about successful search is received, a logical level from the Microcontroller
instructs the buzzer to go high, according to the programming, alerting the operator.
LCD (Liquid Crystal Display):
Use to display the various responses for crosschecking purpose.
Power supply:
Use to provide the power to various hardware components as per the requirement.
LEDs:
Used as status indicators
Mobile Phone:
Any mobile phone with SMS feature can be used for sending the commands (SMS).
Control Equipment/device:
Control Equipment is the device that can be controlled and monitored, e.g. Tube light.
3.3 Software Design
The two software modules excluding hardware module are as follows.
3.3.1 Communication Module
This module will be responsible for the communication between GSM modem and the AT
89C52 microcontroller. Major functionalities that are implemented in this module includes:
Detecting connection between GSM modem and the microcontroller, Receiving data from
modem, Sending data to modem etc. Additionally the status of modem, receiving/sending
process, status of controlled device can be displayed on LCD.
3.3.2 Controlling Module
This module will take care of all the controlling functions. For example after extracting the
particular command like SWITCH ON FAN, the module will activate the corresponding
Copyright ICWS-2009
port of the microcontroller so that the desired output can be achieved. It will also be
responsible for providing the feedback to user. If catastrophic conditions are detected by the
sensors then it will alert the user and will turn off the devices to avoid future danger.
4 Internal Operation of the System
The program written to achieve the desired functionality is incorporated in ROM of AT
89C52 microcontroller. For communication with GSM modem and reading, deleting, and
sending the SMS messages we are using GSM AT commands [4], [11]. AT Stands for
Attention. GSM AT commands are the instructions that are used to control the modem
functions. AT commands are of two types Basic AT Commands & Extended AT Commands.
Standard GSM AT commands used with this system are as follows:
AT+CMGR Read Message.
AT+CMGS Send Message
AT+CMGD Delete Message
AT+CMGF Select PDU mode or Text Mode
At the startup or on reset of the system, the microcontroller will first detect whether
connection with GSM modem is established or not by sending the command ATE0V0 to the
GSM Modem. Then after successful connection with GSM modem the AT 89C52
microcontroller reads the first location in GSM modems SIM (subscriber identity module)
card by sending AT+CMGR=1 command for an incoming SMS in each 2 seconds. The SIM
is present inside the GSM modem.
The controlling of electrical/electronic devices can be accomplished by decoding the SMS
received, comparing it with already stored strings in the microcontroller and accordingly
providing an output on the ports of a microcontroller. This output is used to switch on/off a
given electrical/electronic device.
For getting status of the device the associated pin of microcontroller is checked for active
high or low signal, and as per the signal status of the device is provided to the user of the
system by sending an SMS.
The program or software is loaded into AT 89C52, and then the circuit is connected to the
modem. Initially the SMS received at GSM modem is transferred to the AT 89C52 with the
help of a MAX 232 chip The microcontroller periodically reads the 1
st
memory location of
the GSM modem to check either SMS has been received or not (programmed for every two
second). Before implementing the control action, the microcontroller extracts the senders
number from the SMS and verifies if this number has the access to control the device or not.
If the message comes from invalid number then it deletes the message and doesnt take any
action. If the message comes from an authorized number then it takes the necessary action.
Generally sending and receiving of SMS occurs in two modes, Text mode and Protocol Data
Unit (PDU) mode [4], [11]. Here we are using the Text Mode. In text mode the message is
displayed as it is in Text. In PDU mode the entire message is given as the string of hexa-
decimal numbers. The AT 89C52 microcontroller performs all the functions of system. It
reads, and extracts control commands from the SMS message and processes it according to
the request.
Copyright ICWS-2009
The main functions implemented using GSM AT commands are as shown below:
1.Connection to GSM Modem
To detect whether the GSM modem is connected or not, the command ATE0V0 is sent to the
GSM modem. In response to this GSM modems sends the reply as 0 or 1, which indicates the
connection establishment if 1 or fail to connect if 0.
2.Deleting SMS from SIM Memory
To delete the SMS from SIM memory the command AT+CMGD=1,0 is used which deletes
the SMS messages which are present in SIM inbox. The same command is used to delete the
received SMS messages after processing the message.
3.Setting SMS Mode
To select the PDU/Text mode command
AT+CMGF=<0/1> is used
0 indicates PDU Mode
1 Indicates Text Mode
AT+CMGF=1 // selects text mode
4.Reading the SMS
To read the SMS message from SIM we are sending the read command as shown bellow to
GSM modem periodically (after every 2 seconds)
AT+CMGR=1
Response from modem is as follows:
+CMGR: RECUNREAD,9989028959,98/10/01, 18:22:11+00, This is the message
5.Processing The SMS Message
After Reading the SMS message, the message is processed by microcontroller to extract the
users number and command as per the program written. The microcontroller does not takes
any action if the received is not from a valid user & if it does not contains any command,
each command is predefined in program. The matching command is only accepted and
microcontroller takes the action as per the defined command.
6.Sending SMS
Command syntax in text mode:
AT+CMGS= <da> <CR> Message is typed here <ctrl-Z / ESC >
For example to send a SMS to the mobile number 09989028959 we have to send the
command in this format AT+CMGS= 09989028959<CR> Please call me soon <ctrl-Z> the
fields <CR> and <ctrl-Z> are programmed as string and used while generating the SMS
message. On successful transmission GSM Modem responds with ok otherwise with
Error if not transmitted
Detection of Catastrophic Events
With Embedded Controller one can use various sensors to detect the catastrophic events or
some special events as shown in figure 2. Here we are using the PIR sensors, which enable to
Copyright ICWS-2009
provide the security feature. Passive Infra Red (PIR) is an electronic device which is
commonly used to provide light and to detect motion. Whenever a suitably large (and
therefore probably human) warm body moves in the field of view of the sensor,
a floodlight is switched on automatically and left on for a fixed period of time - typically 30-
90 seconds[13]. This can be used to deter burglars as well as providing lighting when you
arrive home at night [13].
The first PIR Sensor is located at outside the door which provides light to user when he
comes home in late night & which in turn helps in deterring the burglar indicating
somebodys presence. Even if burglar manage to get inside the home/office another PIR
sensor detects the burglar and sends alert to user as well as to nearest police personnel for
taking further actions.
The temperature sensor is set to detect a certain level of temperatures. If the temperature
sensor detects the temperature greater than set point then the Embedded Controller sends an
alert message to user as an indication of fire and to the fire station so that further needed
action can be taken.
We can call it as an intelligent embedded system because the microcontroller is programmed
in such a way that whenever the sensor detects any catastrophic event or burglary possibility
then in that case it sends the notification not only to the user but also to the preventive service
providers like to fire station in case of fire detection, or police station in case of burglary
possibility along with the location where the event is happening, again the PIR sensor
provides the light to user when he comes home late in night and stands in front of the door.
5 Conclusion
An Embedded Controller can be designed and developed to control and monitor electrical/
electronic devices through SMS using AT 89C52 microcontroller remotely using specific
SMS messages. Also Embedded Controller will notifies the catastrophic events like fire,
burglary by the means of SMS and provides light when user arrives home at night.
Such an Embedded Controller can be used in variety of applications [8], [14]. The Embedded
Controller will be useful for the following users or systems.
A. General Public (Home users)
B. Agriculturists (Agricultural users)
C. Industries (Industrial users)
D. Electricity Board Administrators
E. Electricity Board Officers
F. Municipalities/ Municipal Corporations/ panchayats.
References
[1] Dr. Milkael Sjodin Remote Monitoring and Control Using Mobile Phones Newline Informations White
Paper www.newlineinfo.se
[2] S.R.D. Kalingamudali, J.C. Harambearachchi, L.S.R. Kumara, J.H.S.R. De Silva,R. M.C.R.K. Rathnayaka,
G. Piyasiri, W.A.N. Indika, M.M.A.S. Gunarathne, H.A.D.P.S.S. Kumara, M.R.D.B. Fernando Remote
Controlling and Monitoring System to Control Electric Circuitry through SMS using a
icrocontroller,Department of Physics, University of Kelaniya, Kelaniya 11600, Sri Lanka Sri Lanka
Copyright ICWS-2009
Telecom, 5th Floor, Headquarters Building, Lotus Road, Colombo 1, Sri Lanka kalinga@kln.ac.lk,
jana@eng.slt.lk
[3] Guillaume Peersman and Srba Cvetkovic The Global System for Mobile Communications Short Message
Service, The University Of Hefeiled Paul Griffthsa And Hugh Spear, Dialogue Communicaions Ltd.
[4] G. Peersman, P. Griffiths, H. Spear, S. Cvetkovic and C. Smythe A tutorial overview of the short message
service within GSM, COMPUTING & CONIXOL ENGINEERING JOURNAL AIKIL 2000.
[5] Rahul Pandhi, Mayank Kapur, Sweta Bansal, Asok Bhatacharya, A Novel Approach to Remote Sensing
and Control, Delhi University Delhi, Proceedings of the 6th WSEAS Int. Conf. on Electronics, Hardware,
Wireless and Optical Communications, Corfu Island, Greece, February 1619, 2007.
[6] Eka Suwartadi, Candra Gunawan, First Step Towards Internet Based Embedded Control System,
Laboratory for control and computer systems, Department of Electrical Engineering Bandung Institute of
Technology, Indonesia.
[7] Cisco Mobile Exchange Solution Guide.
[8] http://www.cisco.com/univercd/cc/td/doc/product/wireless/moblwrls/cmx/mmg_sg/cmxsolgd.pdf
[9] Daniel J.S. Lim1, Vishy Karri2 Remote Monitoring and Control for Hydrogen Safety via SMS School of
Engineering, University of Tasmania Hobart, Australia jsdlim@utas.edu.au & Vishy.Karri@utas.edu.au
[10] http://en.wikipedia.org/wiki/Remote_control
[11] http://en.wikipedia.org/wiki/Short_message_service
[12] http://www.developershome.com/sms/atCommandsIntro.asp
[13] http://en.wikipedia.org/wiki/Global_System_for_Mobile_ Communications
[14] http://www.reuk.co.uk/PIR-Sensor-Circuits.htm
[15] Dr. Nizar Zarka, Jyad Al-Houshi, Mohanad Akhkobek, Temperature Control Via SMS, Communication
department, Higher Institute for Applied Sciences and Technology (HIAST) P.O. Box. 31983, Damascus-
Syria, Phone. +963 94954925, Fax. +963 11 2237710 e-mail. nzarka@scs-net.org
An Embedded System Design for Wireless
Data Acquisition and Control

K.S. Ravi S. Balaji Y. Rama Krishna
K.L. College of Engg. K.L. College of Engg. KITE Womens College of
Vijayawada Vijayawada Professional Engineering Sciences
sreenivasaravik@yahoo.co.in Sahbad, Ranga Reddy

Abstract

One of the major problems in industrial automation is monitoring and
controlling of parameters in remote and hard-to-reach areas, as it is difficult
for an operator to go there or even to implement and maintain wired systems.
In this scenario the Wireless Data Acquisition and Control (DAQC) systems
are very much useful, because the monitoring and controlling will be done
remotely through a PC. A small scale Embedded system is designed for
wireless data acquisition and control. It acquires temperature data from a
sensor and send the data to a desktop PC in wireless format continuously at an
interval of one minute and user can start and control the speed of a dc motor
whenever required from this PC using wireless RF communication. Hardware
for the Data Acquisition and Control (DAQC) is designed using integrated
programming and development board and software is developed in embedded
C using CCS_PICC IDE.
1 Introduction
Small scale embedded systems are designed with single 8-bit or 16-bit microcontrollers. They
have little hardware and software complexities and involve board-level design. The
embedded software is usually developed in embedded C [Rajkamal, 2007]. Data acquisition
is a term that encompasses a wide range of measurement applications, which requires some
form of characterization, monitoring or control. All data acquisition systems either measure a
physical parameter (Temperature, pressure, flow etc) or take a specific action (sounding an
alarm, turning ON a light, controlling actuators etc) based on the data received. [Heintz,
2002]. A small scale embedded system is designed for wireless data acquisition and control
using PIC microcontroller, which is a high performance RISC processor.
Implementation of proper communication protocol is very important in a DAQC System, for
transferring data to / from DAQC hardware, Microcontroller and PC. Data communication is
generally classified as parallel communication, serial communication and wireless
communication. However most of the microcontroller based DAQC uses serial
communication using wired or wireless technologies. The popular wired protocols are RS-
232C, I2C, SPI, CAN, Fire wire and USB. But wireless communication eliminates the need
for devices to be physically connected in order to communicate. The physical layer used in
wireless communication is typically either an infrared channel or a radio frequency channel
and typical wireless protocols are RFID, IrDA, Bluetooth and the IEEE802.11. The present
474 An Embedded System Design for Wireless Data Acquisition and Control
Copyright ICWS-2009
DAQC system uses RF wireless communication because of its wide spread technology and
has the advantage of immune to electrical noise interface, no need for line of sight
propagation [Dimitrov, 2006 and Vemishetty, 2005].
2 Hardware Design and System Components
2.1 Block Diagram of DAQC system
Hardware design is the most important part in the development of the data acquisition
systems. The present DAQC system transmits data from an external device or sensor to PC
and from PC to DAQC in wireless format using RF modem. Figure 1 shows the block
diagram connecting various components of the DAQC system and a remote PC placed in two
different labs. The sensor measures physical activity, converts it to analog electrical signal.
The microcontroller acquires analog data from the sensor and converts it into digital format.
The digital data is transferred to the wireless RF modem, which modulates digital data to
wireless signal and transmits it. At the receiving end the receiver receives the wireless data,
demodulates the wireless data into the digital data and transfers it to the PC through the serial
port. The starting and speed control of the DC motor connected to the microcontroller port is
controlled from PC remotely.

Fig. 1: Block diagram of the DAQC System.
Experimental arrangements of the DAQC system in Lab 1 and Lab 2 are shown in figure
2 and 3.

Fig. 2: Experimental Arrangement at Lab1. Fig. 3: Experimental Arrangement at Lab 2.
An Embedded System Design for Wireless Data Acquisition and Control 475
Copyright ICWS-2009
2.2 LM35 - Precision Centigrade Temperature Sensor
The DAQC system monitors the temperature using LM35, a precision integrated-circuit
temperature sensor, whose output voltage is linearly proportional to the Celsius (Centigrade)
temperature. It does not require any external calibration or trimming to provide typical
accuracies of 14C at room temperature and 34C over a full 55 to +150C temperature
range. Low cost is assured by trimming and calibration at the wafer level. The LM35s low
output impedance, linear output, and precise inherent calibration make interfacing to readout
or control circuitry especially easy. It can be used with single power supply, or with dual
power supply. It draws only 60 A from its supply, it has very low self-heating, less than
0.1C in still air [www.national.com/pf/LM/LM35.html]
2.3 PIC 16F877A Flash Microcontroller
The present DAQC system uses PIC 16F877
A
flash microcontroller because of various on-
chip peripherals and RISC architecture. Apart from the flash memory there is a data
EEPROM. Power consumption is very low, typically less than 2mA at 5v and 4MHz, 20 A
at 3v and 32 MHz. There are three timers. Timer 0 is an 8-bit timer/count with a presale.
Timer 1 is 16-bit wide and with presale. 10-bit ADC is an interesting feature. A PWM output
along with a RC low pass filter allows an analog output with a maximum resolution of 10
bits, which is sufficient in many applications. There is a synchronous serial port (SSP) with
SPI master mode and I
2
C master/slave mode. Further universal synchronous receiver
transmitter USART/SCI is supported. A parallel 8-bit slave port (PSP) is supported with
control signals RD, WR and CS signals in case of 40/44-pin version only
[www.pages.drexel.edu/~cy56/PIC.htm].
2.4 Low Power Radio Modem
The Low Power Radio Modem is an ultra low power transceiver, mainly intended for 315,
433, 868 and 915 MHz frequency bands. In the present work 915MHz low power radio
modem is used. It has been specifically designed to comply with the most stringent
requirements of the low power short distance control and data communication applications.
The UHF Transceiver is designed for very low power consumption and low voltage operated
energy meter reading applications. The product is unique with features like compact,
versatile, low cost, short range, intelligent data communication, etc. The product also has 2 /
3 isolated digital inputs and outputs. Necessary command sequences will be supplied to
operate these tele commands from the user host. The modem supports maximum data rates up
to 19.2 kbps [www.analogicgroup.com]
2.5 Target System Design
The present DAQC system is designed using integrated programmer and development board
as target system which supports PIC 16F877
A
(40pin DIP). The target system can be run in
two modes, Program mode and Run mode, by selecting the slide switch. In the program
mode, the flash memory can be programmed with the developed code in hex format from the
PC to the target device. In the run mode the code that was just downloaded can be executed
from a position RESET as a stand alone embedded system. The I / O port pins of the target
device are accessible at connectors for interface to external devices. Figure 4 shows the PIC
16F877A development board
Copyright ICWS-2009

Fig. 4: PIC 16F877A Development Board.
The target system includes two RS232c ports. One controlled by 89C2051 and used for
programming the flash program memory of PIC 16F877A, and second is controlled by a on-
chip USART of PIC 16F877A used for interfacing RF transceiver for the wireless data
transmission with PC. The on-board PIC 16F877A includes, 256x8 bytes EEPROM memory,
8Kx14 words of In-System reprogrammable downloadable flash memory, 368x8 bit RAM,
an external crystal of 6MHz is used for providing system clock for PIC16F877A, 10-bit
ADC, 3timers and programmable watch dog timer, 2capture/compare/PWM modules, 5 I/O
controlled LEDs, on board power regulation with an LED power indication, termination
provided for 5V DC output at 250mA.
3 Functioning of the DAQC System
The DAQC System continuously monitors the temperature by the senor LM35. The sensor
output is digitized by using on chip ADC of 16F877
A.
The control registers ADCON0,
ADCON1 of the A/D converter determines which bits of the A/D port are analog inputs,
which are digital, which bits are used for Vref- and Vref+. PCFG3 PCFG0 are the
configuration bits in ADCON1, these bits determine which of the A/D port analog inputs are.
The most common configuration is 0x00, which sets all 8 analog inputs as analog inputs and
uses Vdd and Vss as the reference voltages. In the present system, the clock for A/D module
is derived from internal RC oscillator and channel 0 is selected, by writing the control word
OXC1 in ADCON0. The digitized value is transferred to remote desktop for display, by using
USART of the PIC Microcontroller and through the RF module connected to one of the serial
port for transmission.
The output of timer based functions is used to provide the base for the PWM function and the
baud rate for the serial port. Each time the timer 2 count matches the count in PR2, timer 2 is
automatically reset. The reset or match signals are used to provide a baud rate clock for the
serial module. The frequency for the PWM signal is generated from Timer 2 output by setting
the PR2 with the value 127. The timer 2 counts up to the value that matches PR2, resets the
timer 2, sets PWM output bit (port C, bit2) and the process restarts. Thus PR2 value control
the OFF period of the PWM signal. As timer 2 counts up again, it will eventually match the
value placed into CCPR1L (Low 8 bits of CCPR1) at which time the PWM output will be
cleared to zero. Thus CCPR1L value controls the ON period of the PWM signal.
During the process if the user wants to start the DC motor, then user has to hit a key on the
keyboard which start, the motor. To increase the speed of the motor, the user has to press U
in the keyboard, whereas to decrease the speed of the motor user has to press D.
Copyright ICWS-2009
Accordingly appropriate control signal is sent to the Microcontroller through RF, which
control the PWM signal generated by Microcontroller and there by controlling the speed of
DC motor. When the user wants to quit the motor control and revert back to monitor
temperature then the user has to press the key Q.
4 Software Development of the DAQC System
Microcontrollers can be programmed using the assembly language (or) using the high level
language such as C, BASIC. In the present system, the software is developed in Embedded
C using CCS PICC IDE. The IDE allows the user to built projects, add source code files
to the projects, set compiler options for the projects, compile projects into executable
program files. The executable files are then loaded into the target microcontroller. The CCS-
PICC compiler uses preprocessor commands and a data base of microcontroller information
to guide the generation of the software associated with onchip peripherals [Barnett and
Thomson].The source code for the DAQC system is given in Figure 5.
#include<16F877A.H>
#use delay(clock=6000000)
#use rs232(baud=9600, xmit=PIN_C6,rcv=PIN_C7)
unsigned char c,adcval,serval,pwmval=0;
main()
{
setup_adc_ports( RA0_RA1_RA3_ANALOG );
setup_adc(ADC_CLOCK_INTERNAL );
setup_ccp1(CCP_PWM);
setup_timer_2(T2_DIV_BY_1, 127, 1);
set_adc_channel(0);
printf("\r\n welcome to sfs \r\n");
delay_ms(500);
while(1)
{
adcval= read_adc();
printf("\r\n TEMPERATURE VALUE IS %d ",(adcval*2));
delay_ms( 1000 );
if(kbhit())
{
do
{
serval=getc();
if(serval=='U')
{
pwmval++;
set_pwm1_duty (pwmval);
}
else if(serval=='D')
{
pwmval--;
Copyright ICWS-2009
set_pwm1_duty (pwmval);
}
else
printf("\r\n PRESS 'U' / 'D' TO INCREASE OR DECREASE SPEED OR 'Q' TO QUIT
\r\n");
printf("\r\n CURRENT PWM VALUE IS %f ",(pwmval*.03));}while(serval!='Q');
}
}
}

Fig. 5: Source Code for the DAQC System

5 Results and Conclusion
Once the application code is downloaded into the flash program memory of the PIC 16F877A
using device programmer of CCS PIC C IDE, then the system works independent of PC. It
monitors the temperature continuously and sends the temperature data to the PCs Hyper
Terminal using wireless RF Communication. The temperature data is displayed on PC as
shown in Figure 6. DC Motor control such as starting of DC Motor, increase of speed and
decrease of speed can be achieved from PC in Interrupt mode. Appropriate control signals for
these operations are sent to the target system from PC through wireless RF Communication.
The PWM signal is changed as per the control signal and is used to either increase or
decrease the speed of the DC Motor. The value of the PWM signal is also displayed on the
PCs Hyper Terminal.

Fig. 6: PCs Hyper Terminal (Result Window)
The project Microcontroller Based Wireless Data Acquisition system is designed to
monitor and control the devices through wireless technology, which are located at remote
areas where it is difficult for the user to go and take the data. Also the added advantage in
using wireless technology is, it reduces a lot of connections, also eliminates chance of
electrical noise. Apart from various general microcontrollers available in the market, this
project is implemented with PIC 16F877A because it has various on chip peripherals.
Because of the on chip peripherals, additional circuitry required will be reduced to a greater
extent. The present project acquires data only from 2 physical devices; it can be extended for
Copyright ICWS-2009
various other devices also. This project is implemented with RF technique; the same can be
implemented with IR, Bluetooth, etc. Wireless technology is somewhat limited in bandwidth
and range which sometimes offsets its inherent benefits.
References
[1] [Barnett Cox & OCull, Thomson Delmar Learning] Embedded C Programming and the Microchip PIC.
[2] [David Hentz, 2002] Essential Components of Data Acquisition Systems, Application note 1386, Agilent
Technologies.
[3] [Smilen Dimitrov, 2006] A Simple Practical Approach to wireless Distributed Data acquisition.
[4] [Kalyanramu Vemishetty, 2005] Embedded Wireless Data Acquisition System, Ph.D., Thesis.
[5] [Raj Kamal, 2007] Embedded Systems-Architecture, Programming and Design, McGraw Hill Education,
2
nd
edition.
[6] LM35 data sheet, www.national.com/pf/LM/LM35.html
[7] PIC 16F877A Microcontroller Tutorial revA www.pages.drexel.edu/~cy56/PIC.htm
[8] RF module Spec, www.analogicgroup.com
Bluetooth Security

M. Suman P. Sai Anusha
Dept of ECM Dept of IST
K.L.C.E K.L.C.E
suman_maloji_ecm@klce.ac.in

y6it280@klce.ac.in
M. Pujitha R. Lakshmi Bhargavi
Dept of ECM Dept of ECM
K.L.C.E K.L.C.E.
y6em261@klce.ac.in

y6em231@klce.ac.in

Abstract
Bluetooth is emerging as a pervasive technology that can support wireless
communication in various contexts in everyday life. By installing a Bluetooth
network in your office you can do away with the complex and tedious task of
networking between the computing devices, yet have the power of connected
devices. Enhancement in Bluetooth security has become the stunning
necessity. To improve the security for the pin or to simply curb the possibility
for the generation of hypothesis of the initialization key (K_init) just in less
than a second, this paper proposes the inclusion of two algorithms one in safer
+ algorithm and other in random number generation.
1 Introduction
Bluetooth is a new technology for wireless communication. The target of the design is to
connect different devices together wirelessly in a small environment like in an office or at
home. The BT range restricts the environment, which at the moment is about 10 meters.
Before accepting the technology a close look at the security function has to be taken.
Especially in office the information broadcasted over the Bluetooth Pico net can be sensitive
and requires a good security. Bluetooth employs several layers of data encryption and user
authentication measures. Bluetooth devices use a combination of the Personal Identification
Number (PIN) and a Bluetooth address to identify other Bluetooth devices. Data encryption
can be used to further enhance the degree of Bluetooth security. Establishment of channel
between two blue tooth devices occurs through many stages. It includes the usage of E21,
E22, E1 algorithms which produce keys like authentication keys are based on safer algorithm.
The new logics included will be explained here in part-I and part-II
2 Bluetooth Security
2.1 Part-I
First consider the random number generation algorithm and the enhancement of that logic is
proposed here Random number is the only plain text that is sent by the master device to the
slave device. The cracker first get holds of it by sniffing it and there by makes his assumption
of the PIN and generates a hypothesis for K_INIT. The attacker can now use a brute force
Bluetooth Security 481
Copyright ICWS-2009
algorithm to find the PIN used. The attacker enumerates all possible values of the PIN.
Knowing IN RAND and the BD ADDR, the attacker runs E22 with those inputs and the
guessed PIN, and finds a hypothesis for Kinit. The attacker can now use this hypothesis of the
initialization key, to decode messages 2 and 3. Messages 2 and 3 described in figure -1 that
contain enough information to perform the calculation of the link key Kab, giving the
attacker a hypothesis of Kab. The attacker now uses the data in the last 4 messages to test the
hypothesis: Using Kab and the transmitted AU RANDA (message 4), the attacker calculates
SRES and compares it to the data of message 5. If necessary, the attacker can use the value of
messages 6 and 7 to re-verify the hypothesis Kab until the correct PIN is found.

Fig. 1
The basic method of K_init generation can be figured out as follows:-
During the initial step master generates a 128 bit random number i.e., in_rand and this will be
broadcasted to the slave. Both the master and the slave make use of it and with the help of
Bluetooth device address, pin, in_rand, they will generate k_init (initialization key) using E22
algorithm. Whenever a cracker eaves drop and listens to the in_rand he can make a
hypothesis of k_init by assuming pin numbers of all possibilities. This k_init will be given as
input to E21 algorithm, along with lk_rand and this will result in the generation of Kab. This
Kab along with random number and Bluetooth address will be given as input to E1 algorithm
and consequently SRES will be produced both in master and slave devices. These will be
checked for equality, if equal the particular pin guessed by the pin cracker will be taken for
correct. All this process of cracking blue pin can be completed easily with the help of
algebraic optimizations. So, a slight change in broadcasting the random number is proposed
in this paper.
Now this can be modified for better convenience as: - n Number of random numbers will be
generated. n must be necessarily selected by the master and the slave as per their choice but
in common. Then n number of random numbers will be generated with an interval of 100
milli seconds. All these will be stored in a database with two fields one specifying the index
and the other specifying its corresponding random number. The 100 milli second interval is
just chosen in order to minimize the transfer traffic and also the corresponding data is stored
on the slave device. These set of random numbers enter into a logic known as logic-1 which
will be described in the paper. Finally FIN_RAND will be the output of the logic-1 which in-
turn is given as input to E22 algorithm and the entire after math process is as it is.
Copyright ICWS-2009
2.2 Security Modes
Bluetooth has three different security modes build in it and they are as follows:
Security Mode 1 a device will not initiate any security. A non-secure mode[12].
Security Mode 2 A device does not initiate security procedures before channel establishment
on L2CAP level This mode allows different and flexible access policies for applications,
especially running applications with different security requirements in parallel. A service
level enforced security mode[12].
Security Mode 3 a device initiates security procedures before the link set-up on LPM level is
completed. A link level enforced security mode[12].

2.3 Part-II
A small modification in safer + algorithm is also proposed such that its security level will be
tightened. Safer+ algorithm is the basic algorithm for the algorithms such as E22,E21,E1
which are used in initialization key, authentication key, encryption key generation techniques.
Outline of safer + (K64) algorithm:-
The enciphering algorithm consists of r rounds of identical transformations that are applied in
sequence to the plaintext, followed by an output transformation, to produce the final cipher
text. Our recommendation is to use r = 6 for most applications, but up to 10 rounds can be
used if desired. Each round is controlled by two 8-byte sub keys and the output
Copyright ICWS-2009
transformation is controlled by one 8-byte sub key. These 2r +1 sub keys are all derived from
the 8-byte user-selected sub key K1 in a manner. The output transformation of SAFER K-64
consists of the bit-by-bit XOR ("exclusive or" or modulo-2 sum) of bytes 1, 4, 5 and 8 of the
last sub key, K2r+1, with the corresponding bytes of the output from the r-th round together
with the byte-by-byte byte addition (modulo-256 addition) of bytes 2, 3, 6 and 7 of the last
sub key, K2r+1, to the corresponding bytes of the output from the r-th round. After this
Armenian shuffle and substitution boxes will be included whose output will be given to the
layer which performs bit-by-bit addition/XOR operation. But the problem with this is a
simple algebraic matrix is helpful to trace the encryption faster with the help of look up tables
and some of the optimization techniques. The small change can be given by including a logic-
2 layer and thereby the layer which performs solitaire encryption and sequentially the output
will be given as it is to the pseudo Hadmard transformation layers. Logic-2 will be explained
in the algorithm2 and for better understanding it is illustrated in figure-3. By encrypting the
key itself, working backward to decrypt the key and there by cracking the blue pin will be
stupidly difficult for a blue pin cracker.

Fig. 3:
The PHT-(pseudo-Hadamard transform) used above is a reversible transformation of a bit
string that provides cryptographic diffusion. The bit string must be of even length, so it can
be split into two bit strings a and b of equal lengths, each of n bits. To compute the transform,
a' and b', from these we use the equations:

Copyright ICWS-2009
3 Algorithms
Algorithms For logic-1 and logic-2 can be given below
3.1 Algorithm-1 for Logic-1
Algorithm for random key generation
Input: n random numbers each generated by RAND function (each 128 bits) [ai] (i=1
to n)
Output :encrypted RAND
// the algorithm is started from here {
For i=1 to 128
(bi)1=0
For i=1 to n
Store ai in rand tables
For i=1 to n ; For r=1 to 128 {
If ( (br)I !=0) ; (br)i+1 = (br)i + (-1)n (ar)i
Else (br)i+1= (br)i + (ar)I r++} ; i++}
3.2 Algorithm-2 for Logic-2
Algorithm: modified safer + algorithm
Input: 8bytes after bit addition XOR operation
Output: encrypted 8bytes number // the algorithm is started from here
{
Get 64 bits (8bytes)[p64 p63 p62 p1]
Take 65
th
bit p65=0 and make it MSB
Group 65 bits into 5bits per each group
Name the grps:-a13 a12 a11.a1 Consider a14=00000 a15=00000
for i=1 to i=15
do
bi=a1 mod 26 ; ci=convert bi to corresponding alphabet
done ; for i=1 to 15
do
maski= solatire key generation(ci); addni =maski + bi
di=(addni mod 26) or (addni mod 36)
Copyright ICWS-2009
done ; obtain 75 bits from d15 d14 d1 as k75 k74 k73..k1
perform 11 levels of XORing and
reduce them to 64 bits ; return(encrypted 8bytes) }
4 Conclusion
This work acts as a firewall to the blue-pin crackers who use advanced algebraic
optimizations for the attack. Re-Pairing attacks will also be minimized as there will be no
need to broadcast all the random numbers again. Bluetooth security is not complete, but is
seems like it wasnt meant to be that way. More security can be accomplished easily with
additional software that is all ready available. Further work will be done in the other papers
on the Bluetooth security.
References
[1] Specification of the Bluetooth System, volume 1B, December 1st 1999.
[2] Knowledge Base for Bluetooth information http://www.infotooth.com/
[3] General information on Bluetooth http://www.mobileinfo.com/bluetooth/
[4] Thomas Muller, Bluetooth WHITE PAPER: Bluetooth Security Architecture, Version 1.0, 15 July 1999.
[5] Annikka Aalto, Bluetooth http://www.tml.hut.fi/Studies/Tik110.300/1999/Essays/
[6] Bluetooth information, http://www.bluetoothcentral.com/
[7] Oraskari, Jyrki, Bluetooth 2000 http://www.hut.fi/~joraskur/bluetooth.html
[8] How Stuff Works, information on BT http://www.howstuffworks.com/bluetooth3.htm
[9] Information on Bluetooth (Official Homepage) http://www.bluetooth.com/
[10] Bluetooth Baseband http://www.infotooth.com/tutorial/BASEBAND.htm
[11] Bluetooth Glossary http://www.infotooth.com/glossary.htm#authentication.
[12] Frederik Armknecht. A linearization attack on the Bluetooth key stream generator. Cryptology ePrint
MobiSys 05: The Third International Conference on Mobile Systems, Applications, and Services USENIX
Association 47 Archive, report 2002/191, available from http:// eprint.iacr.org/2002/191/, 02.
[13] Yaniv Shaked and Avishai Wool Cracking the Bluetooth PIN MobiSys 05: The Third International
Conference on Mobile Systems, Applications, and Services, USENIX Association.
Managing Next Generation Challenges and Services
through Web Mining Techniques

Rajesh K. Shukla P.K. Chande G.P. Basal
CIST, Bhopal IIM Indore CSE SATI, Vidisha
rkumar_dmh@rediffmail.com

Abstract

The Web mining technology is different to the pure mining based on database
due to Web Datas semi-structure and heterogeneous (mixed media) character.
With the large size and the dynamic nature of the Web, rapid growing number
of WWW users, hidden information becomes ever increasingly valuable. As a
consequence of this phenomenon, the need for continuous support and
updating of Web based information retrieval systems, mining Web data and
analyzing on-line users behavior and their on-line traversal par tem have
emerged as a new area of research. Web mining is a cross point of database,
information retrieval and artificial intelligence. The research of web mining is
also related to many different research studies, such as database, information
retrieval, artificial intelligence, machine learning, natural language processing
and many others. Data mining for Web intelligence is an important research
thrust in Web technology one that makes it possible to fully use the
immense information available on the Web.
This paper presents a complete frame work for web mining. In this paper we
present a broad overview rather than an in-depth analysis about Web mining,
the taxonomy and the function of Web mining research issues, techniques and
development efforts as well as emerging work in Semantic Web mining.
Keywords: WWW, Knowledge Discovery, Web mining, WCM, WSM and WUM and
Semantic web mining.
1 Introduction
The Data mining is used to identify valid, novel, potentially useful and ultimately
understandable pattern from data collection in database community. Data mining is emerging
research fields based on various kinds of researches, such as machine learning, inductive
learning, knowledge representation, statistics and information visualization, with considering
characteristic features of databases. The World Wide Web is also an extensive source of
information which is rapidly growing. At the same time it is extremely distributed. Web
search is one of the most universal and an influential application on the internet but searching
it exhaustively is inefficient in terms of time complexity. A particular type of data such as
authors lists may be scattered across thousands of independent information sources in many
different formats. Determining the size of the World Wide Web is extremely difficult. In
1999 it was estimated to contain over 350 million pages with growth at the rate of about 1
million pages a day. So The Web can be viewed as the largest data source available and
Managing Next Generation Challenges and Services Through Web Mining Techniques 487
Copyright ICWS-2009
presents a challenging task for effective design and access. One of the main challenges for
large corporations adopting World Wide Web sites is to discover and rediscover useful
information from very rich but also diversified sources in the Web environment.
In order to help the people utilize those resources, the researchers have developed many
search engines, which brought people great convenience, But at the same time, the question
that the search results can not satisfy the demands of the users perfectly because the Web is
structure less, dynamic and more complexity of the Web pages than the text documents. It is
one way to solve those problems to use Web mining through connecting the traditional
mining techniques and the Web.
The Data Mining technology normally adopts data integration method to generate Data
warehouse, on which to dig the Relation Rules, Cluster Characters and get the useful Module
Prediction and knowledge evaluation. Web mining can be viewed as the use of data mining
techniques to automatically retrieve, extract and evaluate information for knowledge
discovery from web documents and services. So application of data mining techniques to the
World Wide Web, referred to as Web mining. With the prompt increasing of information in
the WWW, the Web Mining has gradually become more and more important in Data Mining.
Web mining is a new research issue which draws great interest from many communities. It
has been the focus of several recent research projects and papers because People always hope
to gain some knowledge pattern through searching, mining on web because These useful
knowledge patterns can help us in many ways e.g. to built the efficient web site that can serve
people better. So Web mining is a technique that seeks to extract knowledge from Web data
and combines two of the prominent research areas comprising of data mining and the World
Wide Web (WWW). Web mining can be divided into three classes: web content mining, web
structure mining and web usage mining.
2 Process of Web Mining
Web mining is the process of studying and discovering web user behavior from web log data.
Usually the web log data collection is done over a long period of time (one day, one month,
one year, etc). Later, three steps, namely, preprocessing, pattern discovery and pattern
analysis as shown in figure 1 are carried out. Pre-processing is the process of transforming
the raw data into a usable data model. The pattern discovery step uses several data mining
algorithms to extract the user patterns. Finally, pattern analysis reveals useful and interesting
user patterns and trends. Pattern Analysis is a final stage of the whole web usage mining. The
goal of this process is to eliminate the irrelative rules or patterns and to extract the interesting
rules or patterns from the output of the pattern discovery process. The output of web mining
algorithms is often not clustering, association and sequence analysis. These steps are
normally executed after the web log data is collected.

Fig. 1: General Process of Web Mining
488 Managing Next Generation Challenges and Services Through Web Mining Techniques
Copyright ICWS-2009
The objects of Web mining include: sever logs, Web pages, Web hyperlink structures, on-line
market data, and other information. When people browse Web server, sever will produce
three kinds of log documents: sever logs, error logs, and cookie logs. Through analyzing
these log documents we can mine accessing information.
3 Taxonomy of Web Mining
Web mining can use data mining techniques to automatically discover and extract
information from Web documents/services which can help people extract knowledge,
improve Web sites design, and develop e-commerce better so Web mining is the application
of data mining or other information process techniques to WWW, to find useful patterns.
People can take advantage of these patterns to access WWW more efficiently. Like other data
mining applications, it can port from given structure on data (as in database tables), but it can
also be applied to semi-structured or unstructured data like free-form text.
Most of existing web mining methods are used in Web pages of according with HTML and
the Web pages are all connected by hyperlinks, in which there is very important mining
information. So Web hyperlinks are very authoritative resources and user registrations can
also help to mine better. Since the Document of Web contains semi structured web data
including wave, image and text, thus making the Web data become multi-dimension,
heterogeneous. Web mining research can be classified in to three major categories according
to kind of mined information and goals that particular categories set: Web content mining
(WCM), Web structure mining (WSM), and Web usage mining (WUM) as shown in figure 2.

Fig. 2: Taxonomy of Web Mining
3.1 Web Content Mining
It is the process of information discovery from sources across the World Wide Web. A well-
known problem, related to web content mining, is experienced by any web user trying to find
all and only web pages that interests him from the huge amount of available pages. Therefore
Web content mining refers to the discovery of useful information from web contents,
including text, image, multimedia etc and above all, in these data-types, texts and hyper-links
are quite useful and information rich attributes.. It focuses on the discovery of knowledge
from the content of web pages and therefore Research in web content mining encompasses
resource discovery from the web, document categorization and clustering, and information
extraction from web pages. Agents search the web for relevant information using domain
characteristics and user profiles to Organize and interpret the discovery information. Agents
Copyright ICWS-2009
may be used for intelligent search, for Classification of web pages, and for personalized
search by learning user preferences and discovering web sources meeting these preferences.
Web content mining can take advantage of the semi- structured nature of Web page text; can
be used to detect co-occurrences of terms in texts. For example, trends over time may also be
discovered, indicating a surge or decline in interest in certain topics such as the programming
languages "Java". Another application area is event detection: the identification of stories in
continuous news streams that correspond to new or previously unidentified events. But there
are some problems with the web content mining
1. Current search tools suffer from low precision due to irrelevant results.
2. Search engines arent able to index all pages resulting in imprecise and incomplete
searches due to information overload. The overload problem is very difficult to cope
as information on the web is immensely and grows dynamically raising scalability
issues.
3. Moreover, myriad of text and multimedia data are available on the web prompting the
need for intelligence agents for automatic mining.

3.2 Web Usage Mining
An important area in Web mining is Web usage mining, the discovery of patterns in the
browsing and navigation data of Web users. Web Usage Mining is the application of data
mining techniques to large Web data repositories in order to produce results that can be used
in the design tasks. It is the process of mining for user browsing and access patterns. Web
usage mining has been an important technology for understanding users behaviors on the
Web. There are three objects of web mining include: sever logs, Web pages, Web hyperlink
structures, on-line market data, and other information. Currently, most Web usage mining
research has been focusing on the Web server side.
The main purpose of research in web usage mining is to improve a Web sites service and the
servers performance. Some of the data mining algorithms that are commonly used in Web
Usage Mining are association rule generation, sequential pattern generation, and clustering.
Association Rule mining techniques discover unordered correlations between items found in
a database of transactions. In the context of Web Usage Mining a transaction is a group of
Web page accesses, with an item being a single page access.
A Web Usage Mining system can determine temporal relationships among data items. Web
usage mining focus on analyzing visiting information from logged data in order to extract
usage pattern, which can be classified into three categories: similar user group, relevant page
group and frequency accessing path. These usage patterns can be used to improve Web server
system performance and enhance the quality of service to the end users.
A Web server usually registers a Web log entry for every access of a Web page. There are
many types of Web logs due to different server and different setting parameters. But all the
Web files have the same basic information. Web log is usually saved as text (.txt) file. Due to
large amount of irrelevant information in the Web log, the original log cant be directly used
in the Web log mining procedure. By data cleaning, user identification, session identification
and path complement the information in the Web log can be used as transaction database for
mining procedure.
Copyright ICWS-2009
The Web sites topological structure is also used in session identification and path
complement. Web usage mining focuses on techniques that could predict the behavior of
users while they are interacting with the WWW. Web usage mining collects the data from
Web log records to discover user access patterns of Web pages.
There are several available research projects and commercial products that analyze those
patterns for different purposes. The applications generated from this analysis can be classified
as personalization, system improvement, site modification, business intelligence and usage
characterization. Web usage mining has several applications in e-business, including
personalization, traffic analysis, and targeted advertising. The development of graphical
analysis tools such as Webviz popularized Web usage mining of Web transactions. The main
areas of research in this domain are Web log data preprocessing and identification of useful
patterns from this preprocessed data using mining techniques.
3.3 Web Structure Mining
Using data mining methods we automatically class them into usable web page classification
system organized by hyperlink structure. Web structure mining deals with the connectivity of
websites and the extraction of knowledge from hyperlinks of the web sites therefore Web
structure mining studies the webs hyperlink structure. It usually involves analysis of the in-
links and out-links of a web page, and it has been used for search engine result ranking.
Automatic classification in documents using searching engine Search engine can index a
mass of disordered data on Web. In the beginning the web mining was classified in to web
content mining and web usage mining by Coley and later Kosala and Bloc keel added web
structure mining.
Web structure mining is an approach based on directory structures and web graph structures
of hyperlinks. Web structure mining is closely related to analyzing hyperlinks and link
structure on the web for information retrieval and knowledge discovery. Web structure
mining can be used by search engines to rank the relevancy between websites classifying
them according to their similarity and relationship between them. Personalization and
recommendation systems based on hyperlinks are also studied in web structure mining. Web
structure mining is used for identifying authorities, which are web pages that are pointed to
by a large set of other web pages that make them candidates of good sources of information.
Web structure mining is also used for discovering community networks by extracting
knowledge from similarity links.
Web structure mining is a research field focused on using the analysis of the link structure of
the web, and one of its purposes is to identify more preferable documents. Web structure
mining exploits the additional information that is (often implicitly) contained in the structure
of hypertext. Therefore, an important application area is the identification of the relative
relevance of different pages that appear equally pertinent when analyzed with respect to their
content in isolation.
Domain applications related to web structure mining of social interest are: criminal
investigations and security on the web, digital libraries where authoring, citations and cross-
references form the community of academics and their publications etc. With the growing
interest in Web mining, the research of structure analysis had increased and these efforts had
resulted in a newly emerging research area called Link Mining, which is located at the
Copyright ICWS-2009
intersection of the work in link analysis, hypertext and web mining, relational learning and
inductive logic programming, and graph mining. Recently Getoor and Diehl introduced this
term link mining to put special emphasis on the links as the main data for analysis and
provide an extended survey on the work that is related to link mining. The method is based on
building a graph out of a set of related data and to apply social network theory to discover
similarities. There are many ways to use the link structure of the Web to create notions of
authority. The main goal in developing applications for link mining is to made good use of
the understanding of these intrinsic social organization of the Web
3.4 Semantic Web Mining
Related to web content mining is the effort for organizing the semi-structured web data into
structured collection of resources leading to more efficient querying mechanisms and more
efficient information collection or extraction. This effort is the main characteristic of the
Semantic Web, which is considered as the next web generation. This is the method for
semantic analysis of web pages. Semantic Web is based on ontologies, which are meta-data
related to the web page content that make the site meaningful to search engines.. Analysis of
web pages is performed with regard to unwritten and empirically proven agreement between
users and web designers using web patterns. This method is based on extraction of patterns
which are characteristics for concrete domain. Patterns provide formalization of the
agreement and allow assignment of semantics to parts of web pages. In the Semantic Web,
adding semantics to a Web resource is accomplished through explicit annotation (based on
ontology).
Semantic annotation is the process of adding formal semantics (metadata, knowledge) to the
web content for the purpose of more efficient access and management. Currently, the
researchers are working on the development of fully automatic methods for semantic
annotation. The first one is to simplify the querying, and the second is to improve relevance
of answers. we consider important semantic annotation and tracing user behavior when
querying in search engines Currently, there are two trends in the field of semantic analysis.
One of them provides mechanism to semi-automatic page annotation and creation of semantic
web documents. The second approach prefers an automatic annotation of real internet pages.
Web content-mining techniques can accomplish the annotation process through ontology
learning, mapping, merging, and instance learning. With the Semantic Web, page ranking is
decided not just by the approximated semantics of the link structure, but also by explicitly
defined link semantics expressed in OWL. Thus, page ranking will vary depending on the
content domain. Data modeling of a complete Web site with an explicit ontology can enhance
usage-mining analysis through enhanced queries and more meaningful visualizations.
4 Different Approaches for Information Extraction on the Web
Word-based search in which keyword indices are used to find documents with specified
keywords or topics;
Querying deep Web sources where information hides behind searchable database query forms
and that cannot be accessed through static URL links;
Web linkage pointers are very useful in recent page ranking algorithms used in search
engines.
Copyright ICWS-2009
5 Important Operation on the Web
This is the key component of the web mining. Pattern discovery covers the algorithms and
techniques from several research areas, such as data mining, machine learning, statistics, and
pattern recognition. It has separate subsections as follows.
5.1 Classification
Classification is a method of assigning the data items to one of the predefined classes. There
are several algorithm which can be used to classify the data item or the pages.Some of them
are decision tree classifiers, nave Bayesian classifiers, k-nearest neighbor classifier, Support
Vector Machines etc.
5.2 Clustering
Clustering is a grouping of similar data items or the pages. Clustering of user information or
pages can facilitate the development and execution of future marketing strategies
5.3 Association Rules
Association rule mining techniques can be used to discover unordered correlation between
items found in a database of transactions.
5.4 Statistical Analysis
Statistical analysts may perform different kinds of descriptive statistical analyses based on
different variables when analyzing the session file. By analyzing the statistical information
contained in the periodic web system report, the extracted report can be potentially useful for
improving the system performance, enhancing the security of the system, facilitation the site
modification task, and providing support for marketing decisions.
5.5 Sequential Pattern
This technique intends to find the inter-session pattern, such that a set of the items follows the
presence of another in a time-ordered set of sessions or episodes. Sequential patterns also
include some other types of temporal analysis such as trend analysis, change point detection,
or similarity analysis.
6 Problems with Web Mining
1. Due to lack a uniform structure of web page, web page complexity far exceeds of the
complexity any traditional text document collection. Moreover, the tremendous
numbers of documents in the web site have not been indexed, which makes searching
the data it contains, extremely difficult.
2. The Web constitutes a highly dynamic information source. Not only does the Web
continue to grow rapidly, the information it holds also receives constant updates.
Linkage information and access records also undergo frequent updates.
3. The Web serves a broad spectrum of user communities. The Internets rapidly
expanding user community connects millions of workstations. These users have
Copyright ICWS-2009
markedly different backgrounds, interests, and usage purposes. Many lack good
knowledge of the information networks structure, are unaware of a particular
searchs heavy cost, frequently get lost within the Webs ocean of information, and
can chafe at the many access hops and lengthy waits required to retrieve search results
Only a small portion of the Webs pages contain truly relevant or useful information. A given
user generally focuses on only a tiny portion of the Web, dismissing the rest as uninteresting
data that serves only to swamp the desired search results. How can a search identify that
portion of the Web that is truly relevant to one users interests? How can a search find high-
quality Web pages on a specified topic?
7 Conclusion and Future Directions
In fact, Web mining can be considered as the applications of the general data mining
techniques to the Web. With the information overload, Web mining is a new and promising
research issue to help users in gaining insight into overwhelming information on the Web.
We have discussed the key component i.e. mining process itself of web mining. In this paper,
we presented a preliminary discussion about Web mining, including the definition, the
process, the taxonomy, and introduced a semantic web mining and the link mining. Web
Mining is a new research field that has a great prospect and its technology has wide
application in the world. Such as text data mining on the Web, time and spatial sequence data
mining on the Web, Web mining for the e-commerce system, hyperlink structure mining of
Web site and so on. A lot of work still remains to be done in adapting known mining
techniques as well as developing new ones. Firstly, even though Web contains huge volume
of data, it is distributed on the internet. Before mining, we need to gather the Web document
together. Secondly, Web pages are semi-structured, in order for easy processing; documents
should be extracted and represented into some format. Thirdly, Web information tends to be
of diversity in meaning, training or testing data set should be large enough. Even though the
difficulties above, the Web also provides other ways to support mining, for example, the links
among Web pages are important resource to be used. Besides the challenge to find relevant
information, users could also find other difficulties when interacting with the Web such as the
degree of quality of the information found, the creation of new knowledge out of the
information available on the Web, personalization of the information found and learning
about other users. The increasing demand of Web service can not be matched with the
increase in the server capability and network speed. Therefore, many alternative solutions,
such as cluster-based Web servers, P2P technologies and Grid computing have been
developed to reduce the response time observed by Web users. Accordingly, mining
distributed Web data is becoming recognized as a fundamental scientific challenge. Web
mining technology has still been faced with many challenges. The following issues must be
addressed
1. There is a continual need to figure out new kinds of knowledge about user behavior
that needs to be mined for.
2. There will always be a need to improve the performance of mining algorithms along
both these dimensions.
3. There is a need to develop mining algorithms and develop a new model in an efficient
manner.
Copyright ICWS-2009
4. There is a need of integrated logs where all the relevant information from various
diversified sources in the Web environment can be kept to mine the knowledge more
comprehensively.
References
[1] http://acsr.anu.edu.au/staff/ackland/papers/political_web_graphs.pdf(15th, January, 2007).
[2] Faca, F.M., and Lanzi, P.L. (2005). Mining Interesting Knowledge from Weblogs: A Survey, Data
Knowledge.
[3] Engineering, 53(3):225-241Badia, A., and Kantardzik, M. (2005). Graph Building as a Mining Activity:
Finding Links in the Small. Proceedings of the 3rd International Workshop on Link Discovery, ACM Press,
pages 1724.
[4] Chen, H., and Chau, M. (2004). Web Mining: Machine Learning for Web Applications, Annual Review of
Information Science and Technology (ARIST), 38:289329.
[5] Baldi, P., Frasconi, P., and Smyth, P. (2003). Modeling the Internet and the Web: Probabilistic Methods.
[6] L. Getoor, Link Mining: A New Data Mining Challenge. SIGKDD Explorations, vol. 4, issue 2, 2003.
[7] B. Berendt, A. Hotho, and G. Stumme, Towards Semantic Web Mining, Proc. US Natl Science
Foundation Workshop Next-Generation Data Mining (NGDM), Natl Science Foundation, 2002.
[8] J. Srivastava, P. Desikan, and V. Kumar, Web Mining: Accomplishments and Future Directions, Proc.
US Natl Science Foundation Workshop on Next-Generation Data Mining (NGDM), Natl Science
Foundation, 2002.
[9] Chakrabarti, S., (2000). Data Mining for Hypertext: A Tutorial Survey. ACM SIGDDD Explorations,
1(2):111.
[10] Wang xiao yan, Web Usage Mining, PH.D thesis 2000.
[11] Cooley, R., Mobasher, B. and Srivastava, J., (1997). Web Mining: Information and Pattern Discovery on
the World Wide Web, 9th International Conference on Tools with Artificial Intelligence(ICTAI 97), New
Port Beach, CA, USA, IEEE Computer Society, pages 558567.
.Proceedings of the International Conference on Web Sciences
Internet Based Production and Marketing
Decision Support System of Vegetable
Crops in Central India

Gigi A. Abraham B. Dass
KVK, JNKVV, Jabalpur IDSC, JNKVV, JBP
gigiannee@gmail.com bharati_dass@rediffmail.com

A.K. Rai A. Khare
IDSC, JNKVV, Jabalpur IDSC, JNKVV, Jabalpur

akrai_jnau@yahoo.co.in

Abstract

The ever growing demand of vegetables for domestic consumption and
enormous scope of exports, per hectare yield of vegetables can be increased by
using advanced technology. Research system in horticulture is to provide
technological support in ever expanding vegetable production. To do this
agriculturist must be well informed about the availability of different
techniques such as high yielding good varieties, soil fertility evolution,
fertilizer applications, importance of organic manure, pest management,
harvest and post harvest technologies and marketing. In spite of successful
research on new agricultural practices concerning crop cultivation, the
majority of farmers are not getting upper-bound yield due to several reasons.
One of the reasons is that expert advice regarding crop cultivation is not
reaching farming community in a timely manner. Indian farmers need timely
expert advice to make them more productive and competitive. By exploiting
recent Internet revolution, we are aiming to build a web based information
system, which is an IT based effort to improve the vegetable production and
its marketing. This will provide production and marketing decision support to
vegetable growing farmers in Central India through internet. This system tries
to give the answer that aims to provide the dynamic and functional web based
vegetable production and marketing system support for farmers, agricultural
extension agencies, State agricultural departments, agricultural Universities,
etc. Information system can be developed using PHP and MySQL.
Keywords: Internet, vegetable production, marketing, Decision support
1 Introduction
India has surfaced as a country with a sound foothold in the field of Information technology.
Use of Internet has given the globe a shrinking effect. Every kind of information is only a few
clicks away. The graphical user interface and multimedia has simplified one of the most
complex issues in the world. The time has come to exploit this medium to the best-suited
interests in the other fields of life such as agriculture.
Today one can observe that progress in information technology (IT) is affecting all spheres of
our life. In any sector information is the key for its development. Agriculture is not exception
to it. By giving the relevant and right information in right time to farmers can help agriculture
496 Internet Based Production and Marketing Decision Support System of Vegetable Crops in Central India
Copyright ICWS-2009
a lot. It helps to take timely action, prepare strategies for next season or year, speculate the
market changes, and avoid unfavorable circumstances. So the development of agriculture
may depend on how fast and relevant information is provided to the end users. There are
other traditional methods to provide the information to the end users. Mostly they are
inoculated, un-timed and also communication is one way only. It will take long time to
provide the information and get feedback from the end users. Now its time to look at the new
technologies and methodologies.
The application of information technology to agriculture and rural development has to be
strengthened. It can help in optimal farm production, pest management, agro-environmental
resource management by way of effective information and knowledge transfer. Vegetables
form the most important component of a balanced diet. The recent emphasis on horticulture
in our country consequent to the recognition of the need for attaining nutrition security
provide ample scope for enhancing quality production of vegetable crops. In India there is a
wide gap in available technology and their dissipation to its end users. Due to this, majority
of the farmers are still using traditional agricultural practices. The skills of the farmers have
to be improved in the field of vegetable production and acquiring marketing intelligence. This
is possible only when complete and up to date knowledge of available technologies in all the
aspects of vegetable crop production, is made available to them with an easy and user-
friendly access from knowledge based resource. It is expected that if communications
through the Internet and the World Wide Web are looked into seriously, the efficiency of
delivering information could be increased.
The agriculture extension workers and farmers will be able to use the system for finding
answers to their problems and also for growing better varieties of vegetable crops having very
good marketing potential. Rural people can use the two-way communication through online
service for crop information, purchases of agri-inputs, consumer durable and sale of rural
produce online at reasonable price. The system will be developed in such a way that the user
can interact with the software for obtaining information for a set of vegetable crops.
2 Material and Methods
Data are collected from the farmers by on site inspection of fields and from secondary
sources like expert interviews, internet and literatures and manuals. Data regarding agro-
climatic reasons, economic and field information, crop information, recommended varieties
of each zone, nursery management details, fertilizer management, irrigation, intercropping,
weed management, insect management and disease management. Photographs, video clips
and audio clips are collected. Marketing details for the last 10 years are also collected.
The information collected is stored using MySQL database. The decision support system is
bilingual (English, Hindi), so that the end users can make use it effectively. The system is
developed by PHP (Hypertext Preprocessor) using Apache server.
Information collected from farmers, experts are compiled and stored in a relational database
(MySQL). The decision support system is developed by this database and is available to end
users through internet.
Internet Based Production and Marketing Decision Support System of Vegetable Crops in Central India 497
Copyright ICWS-2009

Fig. 1: Information Flow of the System
3 System Description
This user friendly decision making system is developed in both English and Hindi.

498 Internet Based Production and Marketing Decision Support System of Vegetable Crops in Central India
Copyright ICWS-2009

This system provides production technologies and guidelines to the farmers according to the
agroclimatic zone and crop they cultivated. This also provides marketing information like
present crop demand, price, etc. This also provides information regarding pest, disease, weed,
nutrient management.
Nursery management in vegetable crop is very important aspect and the farmer generally lack
this information to a great extent, the system will cover this feature in details by means of
incorporation of visual clips supported by audio and text. Fertilizer management of the
vegetable crop will also be covered in the system. This will display the rate of fertilizer
application including the organic manure and compost application at the different stages of
plant growth, along with its application methods. Irrigation requirement of vegetable crops
will be incorporated in the software with emphasis on timing and method of application of
irrigation as per the requirement of vegetable crop.
The diagnostic module deals with solving the problem of vegetable growers with the data
base intelligence. In the database of each vegetable crop the different problems, which occur
in particular vegetable crop, will be stored along with their best possible solutions and
remedies. All this will be utilized to give answers to the vegetable growers/farmers.
Internet Based Production and Marketing Decision Support System of Vegetable Crops in Central India 499
Copyright ICWS-2009
4 Conclusion
In this paper, we make an effort to improve the utilization and performance of agriculture
technology by exploiting recent progress in Information technology (Internet) by developing
a decision support system on production technologies and marketing of vegetables. This
system may work as an assist to farmers and experts in furnishing knowledge on various
aspects of vegetable production; however it will not replace the experts. The vegetable sector
suffers from lack of availability of good quality planting materials, low use of hybrid seeds
and poor farm management. Hence the production technology needs improvement
qualitatively and quantitatively, so that the standards of the vegetable produce can be further
improved to cater the acceptability of international market.
In this system, the recent technologies available in the country will be collected and
incorporated in different modules so that user can easily access all the modules or module of
his/her interest. The decision support system for the following vegetables are available;
Tomato, Brinjal, Chilli Cauliflower, Cabbage, Bottle gourd, Cucumber, Sponge gourd, Bitter
gourd, Onion, Garlic, Okra (Bhindi) Garden pea, Cow pea, French bean. Information
technology will play a pivotal role in agriculture extension activities and this will definitely
help the extension workers as most of the village panchyat may have personal computer in
near future. The ICAR is going to connect all the Krishi Vighyan Kendras (KVK) through
Network (VSAT) very soon.
References
[1] S. Chaudhuri, Umeshwar Dayal, V. Ganti (2001), Database Technology for decision support systems, IEEE
Computer, pp.4855, December 2001.
[2] S.P. Ghosh, Research preparedness for accelerated growth of horticulture in India, Reproduced form J.
Appl. Hort., 1(1); 6469, 1999.
[3] P. Krishna Reddy, A novel framework for information technology based agricultural information
dissemination system to improve crop productivity, in proceedings of 27th Convention of Indian
Agricultural Universities Association, December 9-11, 2002, Hyderabad, India, pp. 437-459, published by
Acharya N.G. Ranga Agricultural University Press.
[4] P.K. Agarwal (1999), Building India's national Internet backbone, Communications of the ACM, Vol. 42,
No. 6, June 1999.
[5] Indian Council of Agricultural Research web site, ``http://www.nic.in/icar/''
.Proceedings of the International Conference on Web Sciences
Fault Tolerant AODV Routing Protocol
in Wireless Mesh Networks

V. Srikanth T. Sai Kiran
Department of IST, KLCE Department of IST, KLCE
Srikanth_ist@klce.ac.in y6it282@klce.ac.in
A. Chenchu Jeevan S. Suresh Babu

Department of IST, KLCE Department of IST, KLCE
y6it224@klce.ac.in y6it314@klce.ac.in

Abstract

Wireless Mesh Networks (WMNs) are believed to be a highly promising
technology and will play an increasingly important role in future generation
wireless mobile networks. In any Network finding the destination node is the
fundamental task. This can be achieved by various Routing Protocols. AODV
(Ah-hoc On-demand Distance Vector) is one of the widely used routing
protocol that is currently undergoing extensive research and development.
AODV Routing Protocol is efficient in establishing the path to the destination
node. But when a link in the path crashes or breaks this protocol utilizes more
time for finding another efficient path to destination.
We extend the AODV to resolve the above problem by making use of
sequence numbers and hop counts, which results in increasing the efficiency
of protocol. This paper also explains how the use of sequence numbers will
have additional advantages in finding the shortest path to destination node.
1 Introduction
Wireless mesh networks (WMNs) have emerged as a key technology for next-generation
wireless networking
[1]
. WMN is characterized by dynamic self-organization, self-
configuration and self-healing to enable quick deployment, easy maintenance, low cost, high
scalability and reliable services, as well as enhancing network capacity, connectivity and
resilience.
Routing plays an important role in any type of network. The main task of routing protocols is
the path selection between the source node and the destination node. This has to be done
reliably, fast, and with minimal overhead. In general, routing protocols can be classified into
topology-based and position-based routing protocol
[2]
. Topology-based routing protocols
select paths based on topological information, such as links between nodes. Position-based
routing protocols select paths based on geographical information with geometrical
algorithms. There are routing protocols that combine those two concepts. Topology-based
routing protocols are further distinguished among reactive, proactive, and hybrid routing
protocols. Reactive protocols compute a route only when it is needed. Ex: AODV, DSR
Fault Tolerant AODV Routing Protocol in Wireless Mesh Networks 501
Copyright ICWS-2009
protocols. This reduces the control overhead but introduces latency for the first packet to be
sent due to the time needed for the on-demand route setup. In proactive routing protocols,
every node knows a route to every other node all the time. There is no latency, but permanent
maintenance of unused routes increases the control overhead. Ex: OLSR protocol. Hybrid
routing protocols try to combine the advantages of both the philosophies: proactive is used
for near nodes or often used paths, while reactive routing is used for more distant nodes or
less often used paths.
2 AODV Protocol
AODV is a very popular routing protocol. It is a reactive routing protocol. Routes are set up
on demand, and only active routes are maintained
[3]
. This reduces the routing overhead, but
introduces some initial latency due to the on-demand route setup.
AODV uses a simple requestreply mechanism for the discovery of routes. It can use hello
messages for connectivity information and signals link breaks on active routes with error
messages. Every routing information has a timeout associated with it as well as a sequence
number. The use of sequence numbers allows detecting outdated data, so that only the most
current, available routing information is used. This ensures freedom of routing loops and
avoids problems known from classical distance vector protocols, such as counting to
infinity. When a source node wants to send data packets to a destination node but does not
have a route to Destination in its routing table, then a route discovery has to be done by
Source node. The data packets are buffered during the route discovery.
2.1 Working of AODV Protocol
2.1.1 Broadcasting RREQ Packet
The source node broadcasts a route request (RREQ) throughout the network. In addition to
several flags, a RREQ packet contains the hopcount, a RREQ identifier, the destination IP
address, destination sequence number, originator IP address and originator sequence number.
Hop count gives the number of hops that the RREQ has traveled so far. The RREQ ID
combined with the originator IP address uniquely identifies a route request. This is used to
ensure that a node rebroadcasts a route request only once in order to avoid broadcast storms,
even if a node receives the RREQ several times from its neighbors.
The destination Sequence Number field in the RREQ message is the last known destination
sequence number for this destination and is copied from the Destination Sequence Number
field in the routing table of source node
[4]
. If no sequence number is known, the unknown
sequence number (U) flag MUST be set. The Originator Sequence Number in the RREQ
message is the node's own sequence number, which is incremented prior to insertion in a
RREQ. The RREQ ID field is incremented by one from the last RREQ ID used by the current
node. Each node maintains only one RREQ ID. The Hop Count field is set to zero. Before
broadcasting the RREQ, the originating node buffers the RREQ ID and the Originator IP
address (its own address) of the RREQ for PATH_DISCOVERY_TIME. In this way, when
the node receives the packet again from its neighbors, it will not reprocess and re-forward the
packet. The originating node often expects the destination node for bidirectional
communication. In this case two way path must be discovered between destination and source
node. So the gratuitous flag (G) in the RREQ packet must be set. This indicates that
destination must send a RREP packet to source after discovering a route.
502 Fault Tolerant AODV Routing Protocol in Wireless Mesh Networks
Copyright ICWS-2009
When RREQ packet is received by an intermediate node then it checks the source sequence
number in the RREQ packet with the sequence number in its own routing table. If the
sequence number present in RREQ packet is greater than the sequence number in routing
table then the intermediate node recognizes that a fresh route is required by the source.
Otherwise the packet is discarded. If this condition satisfies then the intermediate node
checks in its routing table whether there is a valid path from it to destination. If there exists a
valid path then it checks for the gratuitous flag in RREQ. If G flag is set the RREP packet is
sent by the intermediate node to the source node and destination. When there exists no path to
destination from intermediate node or when a link in the active route breaks then a RERR
packet is sent to source node indicating that it cannot reach destination. The source node can
restart the discovery process if its still needs a route. The source node or an intermediate node
can rebuilt the route by sending out a route request.
2.1.2 Handling RERR Packet
When a RERR is received by the source then the source again retries by rebroadcasting
RREQ message under some conditions.
Source node should not broadcast more that RREQ_RATELIMIT RREQ messages
per second.
Source node that generates RREQ message waits for NET_TRAVERSAL_TIME
milliseconds to receive any control messages regarding route.
If no control message is received regarding the path source nodes again broadcasts
RREQ message again (2
nd
try). It tries maximum RREQ_RETRIES.
At each new attempt RREQ ID must be incremented.
After sending RREQ packet source node buffers data packets in first in, first out
(FIFO).
To reduce congestion in a network, repeated attempts by a source node uses the
concept of binary exponential back off. The first time a source node broadcasts a
RREQ, it waits for NET_TRAVERSAL_TIME milliseconds for the reception of a
RREP. If a RREP is not received within that time, the source node sends a new
RREQ. When calculating the time to wait for the RREP after sending the second
RREQ, the source node must wait for 2* NET_TRAVERSAL_TIME milliseconds
and so on.
When a Node detects that it cannot communicate with one of its Neighbors. When this
happens it looks at the route table for Route that uses the Neighbor for a next hop and marks
them as invalid. Then it sends out a RERR with the Neighbor and the invalid routes. When
these types of route errors and link breakages dont occur, the RREQ packets reach to the
destination. The same procedure is followed until RREQ packet identifies the destination.
When a new route is discovered to the destination then the destination sequence number is
updated as the maximum of the current sequence number and the destination sequence
number in the RREQ packet. The destination sequence number is incremented by one
immediately before the RREP packet is sent. When the destination increments its sequence
number, it must do so by treating the sequence number value as if it were an unsigned
number. To accomplish sequence number rollover, if the sequence number has already been
Fault Tolerant AODV Routing Protocol in Wireless Mesh Networks 503
Copyright ICWS-2009
assigned to be the largest possible number represented as a 32-bit unsigned integer (i.e.,
4294967295)
[7]
, then when it is incremented it will then have a value of zero (0). On the
other hand, if the sequence number currently has the value 2147483647, which is the largest
possible positive integer if 2's complement arithmetic is in use with 32-bit integers, the next
value will be 2147483648, which is the most negative possible integer in the same numbering
system. The representation of negative numbers is not relevant to the increment of AODV
sequence numbers. This is in contrast to the manner in which the result of comparing two
AODV sequence numbers is to be treated. After setting the destination sequence number the
destination sends RREP packet to the source through the intermediate nodes. In the
intermediate nodes routing table the destination sequence number in the RREP packet is
compared with that of the destination sequence number in the routing table. The intermediate
node updates the destination sequence number in its routing table with sequence number in
RREP packet. Finally, a valid path is set from source to destination.
2.2 Drawback
AODV Routing Protocol is efficient in establishing the path to the destination node. But
when a link in the path crashes or breaks according to this protocol intermediate node sends
RERR packet to the source. The source again broadcasts RREQ packet for finding another
new route to the destination. This process takes more time for finding another efficient path
to destination.
Another drawback in AODV protocol is that whenever there is more traffic in shortest path
then RREQ packet chooses another path for destination. So the resultant path will be the path
other than the shortest path.
3 Proposed Protocol
In the above explained AODV protocol, destination accepts only the first RREQ packet that
comes to it and ignores all the other packets that come to it later. This causes above two
drawbacks. To eliminate these drawbacks and make the above AODV protocol more efficient
in case of link failure we extend the above protocol with some changes.
The idea is introducing a new field 2
nd
route hop in every nodes routing table.
Like the above protocol the RREQ packets are broadcasted by the source node in all
directions
[7]
.
All RREQ packets are received by the destination and the destination will generate
RREP packets for all RREQs by incrementing the destination sequence number only
for the first RREQ packet unlike the above protocol.
All the RREP packets generated by the destination is received by the intermediate
nodes. Intermediate node that receives more than one RREP packets will update their
routing table fields (next hop, 2
nd
next hop) with the paths having least hop count and
next least hop count values respectively.
So when a link breaks at intermediate node without sending a route error to the source
node we can find the alternative path to the destination from that intermediate node
itself. This resolves above two drawbacks.
504 Fault Tolerant AODV Routing Protocol in Wireless Mesh Networks
Copyright ICWS-2009
The following example illustrates the proposed idea.

Actual Routing table for node 2:
Node Next hop Seq # Hop count
1 1 120 1
3 3 136 1
4 3 140 2
5 5 115 1
6 5 141 2
If link from 2 to 5 crashes then RERR is sent to source node and again RREQ is generated.
This can be eliminated by using the following routing table.
Proposed Routing table for node 2:

Node Next hop Seq # Hop count 2
nd
route hop
1 1 120 1 --
3 3 136 1 --
4 3 140 2 --
5 5 115 1 --
6 5 141 2 3

Whenever link from 2 to 5 crashes we choose the alternative path i.e. from 2 to 3, 3 to 4, 4 to
6 to reach the destination without again broadcasting the RREQ packets from the source.
4 Conclusion
The fundamental goal of any routing algorithm is to find the destination in very less span of
time. So the above proposed protocol effectively finds the destination in very less quantum of
time even though when a link breaks along the active path. We try to store more than two
paths by constructing dynamic routing table.
References
[1] Wireless mesh networking by Yan Zhang, Jijun Luo and Honglin Hu Auerbach publications
[2] A Quick guide to AODV routing by Luke Klein-Berndt, NSIT, US Dept of Commerce
[3] Kullberg Performance of the Ad hoc On demand Distance Vector Routing Protocol
[4] Manel Zapata, Secure Ad hoc On-Demand Distance Vector (SAODV) Routing, INTERNET DRAFT
(September 2006) draft-guerrero-manet-saodv-06.txt
[5] Ad hoc On-Demand Distance Vector (AODV) Routing- C. Perkins, Nokia Research Center and S. Das,
University of Cincinnati
[6] IETF Manet Working Group AODV Draft http://www.ietf.org/internet-drafts/draft-ietf-manet-aodv-08.txt
[7] Perkins, C.E., Ad-Hoc Networking, Addison Wesley Professional, Reading, MA, 2001
Author Index
A
Abraham, Gigi A., 495
Ahmad, Nesar, 344
Ahmad, Rehan, 209
Ahmad, S. Kashif, 260
Ahmad, Tauseef, 209
Akhtar, Nadeem, 344
Anitha, S., 423
Anuradha, T., 76
Anusha, P. Sai, 480
Arunakumari, D., 76
B
Babu, D. Ramesh, 283
Babu, G. Anjan, 308
Babu, S. Suresh, 500

Balaji, S., 313, 473
Balasubramanian, V., 8
Baliarsingh, R., 239
Basal, G.P., 486
Bawane, N.G., 273, 464
Bhanu, J. Sasi, 13
Bhargavi, R. Lakshmi, 480
Bhattacharya, M., 338
Bindu, C. Shoba, 195, 351
Bisht, Kumar Saurabh, 46
C
Chakraborty, Partha Sarathi, 295
Chand, K. Ram, 435
Chandrasekharam, R., 407
Chaudhary, Sanjay, 46
Chhaware, Shaikh Phiroj, 429
Choudhury, Shubham Roy, 203
D
Damodaram, A., 366
Das, Pradip K., 338
Dasari, Praveen, 117
Dass, B., 495
David, K., 8
Deiv, D. Shakina, 338
Dumbre, Swapnili A., 461
G
Gadiraju, N.V.G. Sirisha, 39
Garg, Nipur, 57
Gawate, Sachin P., 273
Govardhan, A., 67
Gupta, Surendra, 186
H
Hande, K.N., 109
Hari, V.M.K., 29
Harinee, N.U., 423
Hong, Yan, 301
I
Iqbal, Arshad, 251
J
Jasutkar, R.W., 461
Jeevan, A. Chenchu, 500
Jena, Gunamani, 239
Jiwani, Moiaz, 203
Joglekar, Nilesh, 273
Jonna, Shanti Priyadarshini, 148
Juneja, Mamta, 359
Jyothi, Ch. Ratna, 383
K
Karmore, Pravin Y., 461
Kasana, Robin, 266
Khaliq, Mohammed Abdul, 117
Khan, M. Siddique, 209
Khare, A., 495

Kiran, K. Ravi, 366
Kiran, K.V.D., 283
Kiran, P. Sai, 226
Kiran, T. Sai, 500
Kota, Rukmini Ravali, 29
Kothari, Amit D., 71
Krishna, T.V. Sai, 139
Krishna, V. Phani, 415
Krishna, Y. Rama, 473
Krishnan, R., 216
Kumar, K. Sarat, 443
Kumar, K. Hemantha, 333
Kumar, V. Kiran, 3
Kumaravelan, G., 8
L
Lakshmi, D. Rajya, 366
M
Mallik, Latesh G., 429
Mangla, Monika, 52

N
Niyaz, Quamar, 260

506 Author Index
Copyright ICWS-2009
O
Ong, J.T., 301
P
P.K., Chande, 486
Padaganur, K., 234
Pagoti, A.R. Ebhendra, 117
Patel, Dharmendra T., 71
Patra, Manas Ranjan, 203
Prakash, V. Chandra, 13
Pramod, Dhanya, 216
Prasad, E.V., 101, 161
Prasad, G.M.V., 239
Prasad, Lalji, 180
Prasad, V. Kamakshi, 156
Praveena, N., 313
Pujitha, M., 480
Q
Qadeer, Mohammad A., 209, 251, 260, 266
R
Radhika, P., 21
Rai, A.K., 495
Rajeshwari, 234
Raju, G.V. Padma, 39
Ramadevi, Y., 383
Ramesh, N.V.K., 301
Ramesh, R., 327
Ramu, Y., 83
Rani, T. Sudha, 139
Rao, G. Gowriswara, 351
Rao, H. D. Narayana, 443
Rao, S. Vijaya Bhaskara, 443
Rao, B. Mouleswara, 313
Rao, B. Thirumala, 399
Rao, B.V. Subba, 452
Rao, D.N. Mallikarjuna, 156
Rao, G. Rama Koteswara, 435
Rao, G. Sambasiva, 101
Rao, G. Siva Nageswara, 435
Rao, K. Rajasekhara, 3, 13, 415
Rao, K. Thirupathi, 283, 399
Rao, K.V. Sambasiva, 452
Rao, K.V.S.N. Rama, 203
Rao, P. Srinivas, 90
Rao, S. Srinivasa, 283
Rao, S. Vijaya Bhaskara, 301
Rao, S.N. Tirumala, 101
Ravi, K.S., 301, 376, 473
Reddy, C.H. Pradeep, 327
Reddy, K. Shyam Sunder, 195
Reddy, M. Babu, 407
Reddy, K. Krishna, 376
Reddy, K. Sudheer, 125
Reddy, L.S.S., 393, 399
Reddy, V. Krishna, 399
Reddy, P. Ashok, 125
Reddy, V. Venu Gopalal, 376
Renuga, R., 423
Riyazuddiny, Y. Md., 376
S
Sadasivam, Sudha, 423
Saikiran, P., 399
Samand, Vidhya, 180
Santhaiah, 308
Saritha, K., 366
Sarje, A.K., 174
Sastry, J.K.R., 13
Satya, Sridevi P., 322
Satyanarayan, S. Patil, 234
Sayeed, Sarvat, 266
Shaikh, Sadeque Imam, 168
Shanmugam, G., 301
Sheti, Mahendra A., 464
Shuklam, Rajesh K., 486
Singh, Inderjeet, 57
Soma, Ganesh, 148
Somavanshi, Manisha, 216
Soni, Preeti, 57
Sowmya, R., 423
Srikanth, V., 500
Srinivas, G., 29
Srinivas, M., 161
Srinivasulu, D., 327
Sriranjani, B., 423
Suman, M., 76, 480
Supreethi, K.P., 161
Surendrababu, K., 186
Suresh, K., 67, 90
Surywanshi, J.R., 109
T
Tewari, Vandan, 57
Thrimurthy, P., 407
Author Index 507
Copyright ICWS-2009
V
Varma, G.P.S., 125
Varma, T. Siddartha, 29
Vasumathi, D., 67, 90
Venkateswarlu, N.B., 101
Venkatram, N., 393
Vishnuvardhan, M., 283
Y
Yadav, Rashmi, 180
Yarramalle, Srinivas, 322
Yerajana, Rambabu, 174
Z
Zahid, Mohammad, 251

Web - Sciences Complete Book

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Web - Sciences Complete Book

Uploaded by

Copyright:

Available Formats

ABOUT KLCE THE HOST INSTITUTE

b] and N(Is) is the number of randomized values zi inside interval

1 and a probability distribution (p[0],

R(t) : 100,000 0.2

R(t) : 500,000 0.2

t is just 1%. The disclosure of A

R(t) has caused a

that determines the probability of a false item to be

. The question of optimizing all select-a-size parameters to achieve

, PRECISION (P): A/AB

value of 2.9, 3.1, 3.3 and 3.5 respectively. Selection of the

values were 0.3 and 0. 5 respectively. The MSE

combines two sets using vector subtraction of set

X and for every b

denotes infimum (or

B of f by B smooths the graph of f from below by cutting down

smoothes it from above by filling up its

is uniquely characterized by its kernel, ker

is also increasing and upper semi continuous, then

y be the sequences of bits representing the 4-

You might also like