You are on page 1of 11

CLASSIFIER BASED DUPLICATE RECORD ELIMINATION

FOR QUERY RESULTS FROM WEB DATABASES


By

G.KALPANA
(Reg.No.106094050089)
of
JAYA ENGINEERING COLLEGE

A PROJECT REPORT
Submitted to the
FACULTY OF INFORMATION AND COMMUNICATION ENGINEERING
In partial fulfillment of the requirements
for the award of the degree
of

MASTER OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING

ANNA UNIVERSITY
CHENNAI 600025

December 2010

ii

ANNA UNIVERSITY: CHENNAI 600 025


BONAFIDE CERTIFICATE

Certified that this project titled is CLASSIFIER BASED DUPLICATE RECORD

ELIMINATION FOR QUERY RESULTS FROM WEB DATABASES the


bonafide work of Ms. G.KALPANA (10609405005) who carried out the research under my
supervision. Certified further, that to the best of my knowledge the work reported herein does not
form part of any other project report or dissertation on the basis of which a degree or award was
conferred on an earlier occasion on this or any other candidate.

M. KUMARAN

R. PRASANNA KUMAR

Head of the Department,

Supervisor & Asst. Professor,

Department of CSE,

Department of CSE,

Jaya Engineering College,

Jaya Engineering College,

Thiruninravur-602024.

Thiruninravur-602024.

iii

VIVA - VOCE EXAMINATION


The Viva - Voce examination of the project work CLASSIFIER BASED DUPLICATE
RECORD ELIMINATION FOR QUERY RESULTS FROM WEB DATABASES
submitted by G.KALPANA (10609405005), held on .

INTERNAL EXAMINER

EXTERNAL EXAMINER

iv

ABSTRACT
Record matching is an essential step in duplicate detection as it identifies records representing
same real-world entity. Supervised record matching methods require users to provide training
data and therefore cannot be applied for web databases where query results are generated on-thefly. To overcome the problem, a new record matching method named Unsupervised Duplicate
Elimination (UDE) is proposed for identifying and eliminating duplicates among records in
dynamic query results. The idea of this paper is to adjust the weights of record fields in
calculating similarities among records. Three classifiers namely weight component similarity
summing classifier, support vector machine classifier and one class support vector machine
classifier are iteratively employed with UDE where the first classifier utilizes the weights set to
match records from different data sources. With the matched records as positive dataset and non
duplicate records as negative set, the second classifier identifies new duplicates. Then, one-class
support vector machine classifier is employed for further detecting the duplicates. The iteration
stops when no duplicates can be identified. Thus, this paper takes advantage of dissimilarity
among records from web databases and solves the online duplicate detection problem.

ACKNOWLEDGEMENT
At the outset, I would like to submit my sincere thanks to Prof. Dr. R. Raja, Principal,
for his valuable support. It is with a deep sense of gratitude that I record my sincere thanks to
Asst. Prof. M. Kumaran, Head of the Department, Computer Science and Engineering for his
valuable guidance and support through out the course.
With immense pleasure I regard my deep sense of indebtedness and gratitude to the
coordinators Asst. Prof. A. Fidal Castro and Asst. Prof. V. Vijayaraja who was a source of
inspiration. I take this opportunity to thank my Supervisor Asst. Prof. R. Prasanna Kumar for
motivating me to study of this field and for his illuminating guidance and continuous support in
the planning and execution of this thesis.
I also thank my Parents who aided me in completing the project. To one and all, I owe
acknowledgements, who directly or indirectly aided me in completing the project. Although it is
impossible to give individual thanks to all helpful faculty members and to those in connections,
I take this opportunity to express my gratitude for them.

vi

TABLE OF CONTENTS
PAGE NO.
ABSTRACT

iv

ACKNOWLEDGEMENT

LIST OF FIGURES

viii

LIST OF SYMBOLS & ABBREVIATIONS

ix

1. INTRODUCTION

1.1 System Overview

1.2 Objective of the project

1.3 Existing System

1.4 Proposed System

1.5 Literature Survey

1.6 Organization of the Report

10

2. SYSTEM REQUIREMENT SPECIFICATIONS


2.1 External Interface Requirements
2.1.1 Software Interface

11
12
12

2.2 System Features

12

2.3 Other Non-Functional Requirements

13

2.3.1 Software Quality Attributes

13

2.3.2 Performance Requirements

14

2.4 Other Requirements


3. SYSTEM DESIGN
3.1 System Architecture
3.1.1 Data Flow Diagram
3.1.2 Use Case Diagram
3.1.3 Activity Diagram
3.1.4 Sequence Diagram
3.1.5 Class Diagram
3.2 Decomposition Description

15
16
17
18
19
20
20
21
22

vii

3.2.1 Module 1: Authentication


3.2.2 Module 2: Element Identification
3.2.3 Module 3: Unsupervised Duplicate Elimination
3.3 Component Design
3.3.1 Tool Description
3.3.1.1 Java
3.3.1.2 MySQL

22
23
23
25
25
25
29

3.4 Human Interface Design


4. IMPLEMENTATION

31
33

5. TESTING

45

5.1 Unit Testing

46

5.2 Integration Testing

47

5.3 Validation Testing

47

5.4 System Testing

47

5.5 White box Testing

47

5.6 Black box Testing

48

6. RESULTS

49

7. CONCLUSION AND FUTURE WORK

51

7.1 Conclusion

52

7.2 Future Enhancement

53

APPENDIX

55

REFERENCES

59

LIST OF FIGURES

viii

S. No.

Fig. No.

3.1

Figure Name

Page No.

System Architecture

17

3.1.1

Data Flow Diagram(DFD)

19

3.1.2

Use Case Diagram

19

3.1.3

Activity Diagram

20

3.1.4

Sequence Diagram

21

3.1.5

Class Diagram

21

3.2.1

Module 1: Authentication

22

3.2.2

Module 2: Element Identification

23

3.2.3

Module 3: Unsupervised Duplicate Elimination

24

10

6.1

Search Results After Duplicate Elimination

50

11

A1

Login Page

56

12

A2

Registration Page

56

13

A3

Main Page

57

14

A4

Search Results

57

15

A5

Search Results Before Dust Filtering

58

16

A6

Search Results After Dust Filtering

58

LIST OF SYMBOLS & ABBREVIATIONS

ix

SYMBOLS

ABBREVIATIONS

Entity

Condition

Process

Flow of Operation

Actor

HTML-Hyper Text Markup Language


XML-Extensible Markup Language
UDE-Unsupervised Duplicate Elimination
WCSS-Weighted Component Similarity Summing
SVM-Support Vector Machine
OSVM-One class Support Vector Machine
HTTP-Hyper Text Transfer Protocol
FTP-File Transfer Protocol
SMTP-Simple Mail Transfer Protocol
DFD-Data Flow Diagram
GUI-Graphical User Interface
JDBC-Java Data Base Connectivity

xi

You might also like