Professional Documents
Culture Documents
G.KALPANA
(Reg.No.106094050089)
of
JAYA ENGINEERING COLLEGE
A PROJECT REPORT
Submitted to the
FACULTY OF INFORMATION AND COMMUNICATION ENGINEERING
In partial fulfillment of the requirements
for the award of the degree
of
MASTER OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
ANNA UNIVERSITY
CHENNAI 600025
December 2010
ii
M. KUMARAN
R. PRASANNA KUMAR
Department of CSE,
Department of CSE,
Thiruninravur-602024.
Thiruninravur-602024.
iii
INTERNAL EXAMINER
EXTERNAL EXAMINER
iv
ABSTRACT
Record matching is an essential step in duplicate detection as it identifies records representing
same real-world entity. Supervised record matching methods require users to provide training
data and therefore cannot be applied for web databases where query results are generated on-thefly. To overcome the problem, a new record matching method named Unsupervised Duplicate
Elimination (UDE) is proposed for identifying and eliminating duplicates among records in
dynamic query results. The idea of this paper is to adjust the weights of record fields in
calculating similarities among records. Three classifiers namely weight component similarity
summing classifier, support vector machine classifier and one class support vector machine
classifier are iteratively employed with UDE where the first classifier utilizes the weights set to
match records from different data sources. With the matched records as positive dataset and non
duplicate records as negative set, the second classifier identifies new duplicates. Then, one-class
support vector machine classifier is employed for further detecting the duplicates. The iteration
stops when no duplicates can be identified. Thus, this paper takes advantage of dissimilarity
among records from web databases and solves the online duplicate detection problem.
ACKNOWLEDGEMENT
At the outset, I would like to submit my sincere thanks to Prof. Dr. R. Raja, Principal,
for his valuable support. It is with a deep sense of gratitude that I record my sincere thanks to
Asst. Prof. M. Kumaran, Head of the Department, Computer Science and Engineering for his
valuable guidance and support through out the course.
With immense pleasure I regard my deep sense of indebtedness and gratitude to the
coordinators Asst. Prof. A. Fidal Castro and Asst. Prof. V. Vijayaraja who was a source of
inspiration. I take this opportunity to thank my Supervisor Asst. Prof. R. Prasanna Kumar for
motivating me to study of this field and for his illuminating guidance and continuous support in
the planning and execution of this thesis.
I also thank my Parents who aided me in completing the project. To one and all, I owe
acknowledgements, who directly or indirectly aided me in completing the project. Although it is
impossible to give individual thanks to all helpful faculty members and to those in connections,
I take this opportunity to express my gratitude for them.
vi
TABLE OF CONTENTS
PAGE NO.
ABSTRACT
iv
ACKNOWLEDGEMENT
LIST OF FIGURES
viii
ix
1. INTRODUCTION
10
11
12
12
12
13
13
14
15
16
17
18
19
20
20
21
22
vii
22
23
23
25
25
25
29
31
33
5. TESTING
45
46
47
47
47
47
48
6. RESULTS
49
51
7.1 Conclusion
52
53
APPENDIX
55
REFERENCES
59
LIST OF FIGURES
viii
S. No.
Fig. No.
3.1
Figure Name
Page No.
System Architecture
17
3.1.1
19
3.1.2
19
3.1.3
Activity Diagram
20
3.1.4
Sequence Diagram
21
3.1.5
Class Diagram
21
3.2.1
Module 1: Authentication
22
3.2.2
23
3.2.3
24
10
6.1
50
11
A1
Login Page
56
12
A2
Registration Page
56
13
A3
Main Page
57
14
A4
Search Results
57
15
A5
58
16
A6
58
ix
SYMBOLS
ABBREVIATIONS
Entity
Condition
Process
Flow of Operation
Actor
xi