Title, Front Page Etc by Annaunihub - Blogspot.in

CLASSIFIER BASED DUPLICATE RECORD ELIMINATION
FOR QUERY RESULTS FROM WEB DATABASES

By
G.KALPANA
(Reg.No.106094050089)
of
JAYA ENGINEERING COLLEGE
A PROJECT REPORT
Submitted to the
FACULTY OF INFORMATION AND COMMUNICATION ENGINEERING
In partial fulfillment of the requirements
for the award of the degree
of
MASTER OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
ANNA UNIVERSITY
CHENNAI 600025
December 2010
ii
ANNA UNIVERSITY: CHENNAI 600 025

BONAFIDE CERTIFICATE
Certified that this project titled is CLASSIFIER BASED DUPLICATE RECORD
ELIMINATION FOR QUERY RESULTS FROM WEB DATABASES the

bonafide work of Ms. G.KALPANA (10609405005) who carried out the research under my
supervision. Certified further, that to the best of my knowledge the work reported herein does not
form part of any other project report or dissertation on the basis of which a degree or award was
conferred on an earlier occasion on this or any other candidate.
M. KUMARAN
R. PRASANNA KUMAR
Head of the Department,
Supervisor & Asst. Professor,
Department of CSE,
Department of CSE,
Jaya Engineering College,
Jaya Engineering College,
Thiruninravur-602024.
Thiruninravur-602024.
iii
VIVA - VOCE EXAMINATION

The Viva - Voce examination of the project work CLASSIFIER BASED DUPLICATE
RECORD ELIMINATION FOR QUERY RESULTS FROM WEB DATABASES
submitted by G.KALPANA (10609405005), held on .
INTERNAL EXAMINER
EXTERNAL EXAMINER
iv
ABSTRACT
Record matching is an essential step in duplicate detection as it identifies records representing
same real-world entity. Supervised record matching methods require users to provide training
data and therefore cannot be applied for web databases where query results are generated on-thefly. To overcome the problem, a new record matching method named Unsupervised Duplicate
Elimination (UDE) is proposed for identifying and eliminating duplicates among records in
dynamic query results. The idea of this paper is to adjust the weights of record fields in
calculating similarities among records. Three classifiers namely weight component similarity
summing classifier, support vector machine classifier and one class support vector machine
classifier are iteratively employed with UDE where the first classifier utilizes the weights set to
match records from different data sources. With the matched records as positive dataset and non
duplicate records as negative set, the second classifier identifies new duplicates. Then, one-class
support vector machine classifier is employed for further detecting the duplicates. The iteration
stops when no duplicates can be identified. Thus, this paper takes advantage of dissimilarity
among records from web databases and solves the online duplicate detection problem.
ACKNOWLEDGEMENT
At the outset, I would like to submit my sincere thanks to Prof. Dr. R. Raja, Principal,
for his valuable support. It is with a deep sense of gratitude that I record my sincere thanks to
Asst. Prof. M. Kumaran, Head of the Department, Computer Science and Engineering for his
valuable guidance and support through out the course.
With immense pleasure I regard my deep sense of indebtedness and gratitude to the
coordinators Asst. Prof. A. Fidal Castro and Asst. Prof. V. Vijayaraja who was a source of
inspiration. I take this opportunity to thank my Supervisor Asst. Prof. R. Prasanna Kumar for
motivating me to study of this field and for his illuminating guidance and continuous support in
the planning and execution of this thesis.
I also thank my Parents who aided me in completing the project. To one and all, I owe
acknowledgements, who directly or indirectly aided me in completing the project. Although it is
impossible to give individual thanks to all helpful faculty members and to those in connections,
I take this opportunity to express my gratitude for them.
vi
TABLE OF CONTENTS
PAGE NO.
ABSTRACT
iv
ACKNOWLEDGEMENT
LIST OF FIGURES
viii
LIST OF SYMBOLS & ABBREVIATIONS
ix
1. INTRODUCTION
1.1 System Overview
1.2 Objective of the project
1.3 Existing System
1.4 Proposed System
1.5 Literature Survey
1.6 Organization of the Report
10
2. SYSTEM REQUIREMENT SPECIFICATIONS

2.1 External Interface Requirements
2.1.1 Software Interface
11
12
12
2.2 System Features
12
2.3 Other Non-Functional Requirements
13
2.3.1 Software Quality Attributes
13
2.3.2 Performance Requirements
14
2.4 Other Requirements

3. SYSTEM DESIGN
3.1 System Architecture
3.1.1 Data Flow Diagram
3.1.2 Use Case Diagram
3.1.3 Activity Diagram
3.1.4 Sequence Diagram
3.1.5 Class Diagram
3.2 Decomposition Description
15
16
17
18
19
20
20
21
22
vii
3.2.1 Module 1: Authentication

3.2.2 Module 2: Element Identification
3.2.3 Module 3: Unsupervised Duplicate Elimination
3.3 Component Design
3.3.1 Tool Description
3.3.1.1 Java
3.3.1.2 MySQL
22
23
23
25
25
25
29
3.4 Human Interface Design

4. IMPLEMENTATION
31
33
5. TESTING
45
5.1 Unit Testing
46
5.2 Integration Testing
47
5.3 Validation Testing
47
5.4 System Testing
47
5.5 White box Testing
47
5.6 Black box Testing
48
6. RESULTS
49
7. CONCLUSION AND FUTURE WORK
51
7.1 Conclusion
52
7.2 Future Enhancement
53
APPENDIX
55
REFERENCES
59
LIST OF FIGURES
viii
S. No.
Fig. No.
3.1
Figure Name
Page No.
System Architecture
17
3.1.1
Data Flow Diagram(DFD)
19
3.1.2
Use Case Diagram
19
3.1.3
Activity Diagram
20
3.1.4
Sequence Diagram
21
3.1.5
Class Diagram
21
3.2.1
Module 1: Authentication
22
3.2.2
Module 2: Element Identification
23
3.2.3
Module 3: Unsupervised Duplicate Elimination
24
10
6.1
Search Results After Duplicate Elimination
50
11
A1
Login Page
56
12
A2
Registration Page
56
13
A3
Main Page
57
14
A4
Search Results
57
15
A5
Search Results Before Dust Filtering
58
16
A6
Search Results After Dust Filtering
58
LIST OF SYMBOLS & ABBREVIATIONS
ix
SYMBOLS
ABBREVIATIONS
Entity
Condition
Process
Flow of Operation
Actor
HTML-Hyper Text Markup Language

XML-Extensible Markup Language
UDE-Unsupervised Duplicate Elimination
WCSS-Weighted Component Similarity Summing
SVM-Support Vector Machine
OSVM-One class Support Vector Machine
HTTP-Hyper Text Transfer Protocol
FTP-File Transfer Protocol
SMTP-Simple Mail Transfer Protocol
DFD-Data Flow Diagram
GUI-Graphical User Interface
JDBC-Java Data Base Connectivity
xi

Title, Front Page Etc by Annaunihub - Blogspot.in

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Title, Front Page Etc by Annaunihub - Blogspot.in

Uploaded by

Copyright:

Available Formats

CLASSIFIER BASED DUPLICATE RECORD ELIMINATION

FOR QUERY RESULTS FROM WEB DATABASES

ANNA UNIVERSITY: CHENNAI 600 025

Certified that this project titled is CLASSIFIER BASED DUPLICATE RECORD

ELIMINATION FOR QUERY RESULTS FROM WEB DATABASES the

Head of the Department,

Supervisor & Asst. Professor,

Jaya Engineering College,

Jaya Engineering College,

VIVA - VOCE EXAMINATION

LIST OF SYMBOLS & ABBREVIATIONS

1.1 System Overview

1.2 Objective of the project

1.3 Existing System

1.4 Proposed System

1.5 Literature Survey

1.6 Organization of the Report

2. SYSTEM REQUIREMENT SPECIFICATIONS

2.2 System Features

2.3 Other Non-Functional Requirements

2.3.1 Software Quality Attributes

2.3.2 Performance Requirements

2.4 Other Requirements

3.2.1 Module 1: Authentication

3.4 Human Interface Design

5.1 Unit Testing

5.2 Integration Testing

5.3 Validation Testing

5.4 System Testing

5.5 White box Testing

5.6 Black box Testing

7. CONCLUSION AND FUTURE WORK

7.2 Future Enhancement

Data Flow Diagram(DFD)

Use Case Diagram

Module 2: Element Identification

Module 3: Unsupervised Duplicate Elimination

Search Results After Duplicate Elimination

Search Results Before Dust Filtering

Search Results After Dust Filtering

LIST OF SYMBOLS & ABBREVIATIONS

HTML-Hyper Text Markup Language

You might also like