A Novel Two Approach For Record Pair Classification

A NOVEL TWO STEP APPROACH FOR RECORD PAIR CLASSIFICATION IN RECORD LINKAGE PROCESS
PRESENTED BY, K.GOMATHI M.E(CSE)/ II YEAR
ARCHANA INSTITUTE OF TECHNOLOGY
ABSTRACT
Record Linkage is used to match the records that refer to same entity. While traditional classification methods was based on setting the threshold level manually which is cumbersome and time consuming process. In this paper we proposed a two-step approach that classifies the candidate record
pairs automatically. In first step training set is automatically selected from

compared candidate record pair using weight vector classifier. In second step Support Vector Classifier is used to improve the performance of
training set used in first step.
INTRODUCTION
RECORD LINKAGE PROCESS: The task of record linkage is used for data quality and data integrity level which reduces time by referring the data with same entity. Steps Involved:
Data Cleaning and Standardization
Indexing and comparison Data Classification
Evaluation Process
EXISTING SYSTEM
While the existing system classification was based on setting the threshold
level manually or with some procedures statically. It uses k-means

clustering for record pair classification Drawbacks:
It requires training data which is not available in more real datasets.

It takes more time to prepare training data manually. It classifies based on pair wise attribute similarities leads to time
consuming and error prone process
PROPOSED SYSTEM
A Novel two step approach is used for record pair classification which can
create the training data automatically using weight vector. Indexing Technique: Sorted Blocking Technique Comparison Function: Edit Distance Comparison Algorithms:NN Classification Algorithm and Iterative SVM Algorithm Classifier Used: Neighborhood based and SVM
Advantages:
Usage of weight vectors for creating training data is time consuming process. The proposed classifiers does not classifies the attribute similarity by pairwise.
This approach outperforms the result than unsupervised approaches.
ARCHITECTURE DESIGN
DATASET A CLEANING PROCESS DATASET B MATCHED INDEXING PROCESS
COMPARISON PROCESS
POSSIBLE MATCH
CLASSIFICATION
NON MATCHES
EVALUATION
Algorithm 1. Iterative SVM Classification Algorithm Input: Weight vector set: W, WM, and WN,Increment percentage: ip, total training percentage: tp Output: Weight vectors classified as matches: ZM,Weight vectors classified as Non matches: ZN 1: TM: =WM and TN: =WN 2: WU: =W\ (WMUWN) 3: sum0:=train sum(TM, TN) 4: i=0 5: while (|TM| + |TN|) < (|W|*tp/100) do: 6: XM, XN := sum classifier(sumi, WU) 7: Sort XM and XN according to distance from sumi 8: YM:=|XM|*(ip/100) vectors from XM 9: YN:=|XN|*(ip/100) vectors from XN 10: TM:=TMUYM 11: TN:=TNUYN 12: i=i+1 13: sumi:=train_sum(TM,TN) 14: WU:=WU\(YMUYN) 15: end while 16: XM, XN := sum classifier(sumi,WU) 17: ZM:=TMUXM and ZN:=TNUXN
CLEANING AND STANDARDISATION

To remove noisy data
Data tab shows the type
of dataset.
Allows to read two input
data set.
Removes the noisy data.
CLEANING AND STANDARDISATION

Explorer tab displays the
records of two dataset.

Set each field in some
standard form
It displays each attribute
of both records in standard form.
INDEXING PROCESS
Creates candidate
record using sorted block indexing technique.

Create block which can
store the records having
similar attributes in
both record.
COMPARISON STEP
Compare two string
using edit distance comparison function. Finds similarity and dissimilarity and assign cost. Comparison is based on soundex encoding method.
CLASSIFICATION STEP
Classifies the record
based on weight vectors.

The value of weight
vectors ranges from 1 to 0 corresponds to possible matches.
EVALUATION STEP
F Score for 2500 records F Score for 5000 records
CONCLUSION AND FUTURE WORK

Presents two step approach for record pair classification. Allow automatic generation of training data. Future work involves checking the scalability of current
classifiers by conducting more experiments with different data sets.
REFERENCES
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
Peter Christen IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.

Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of the
American Statistical Association, 64(328):11831210, December 1969. William E. Winkler and Yves Thibaudeau. An application of the FellegiSunter Model of record linkage to the 1990 U.S. Decennial Census. Technical Report RR91/09, U.S. Bureau of the Census, 1991. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(1):116, January 2007. William E. Winkler. Overview of record linkage and current research directions. Technical Report RRS2006/02, U.S. Bureau of the Census, February 2006. Mikhail Bilenko, Raymond J. Mooney, William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):1623, January/February 2003. Min-Yen Kan and Yee Fan Tan. Record Matching in Digital Library Metadata. To appear in Communications of the ACM (CACM).
Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002.
S. Sarawagi and A. Bhamidipaty, "Interactive Deduplication Using Active Learning,"

A Novel Two Approach For Record Pair Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Novel Two Approach For Record Pair Classification

Uploaded by

Copyright:

Available Formats

A NOVEL TWO STEP APPROACH FOR RECORD PAIR CLASSIFICATION IN RECORD LINKAGE PROCESS

PRESENTED BY, K.GOMATHI M.E(CSE)/ II YEAR

ARCHANA INSTITUTE OF TECHNOLOGY

pairs automatically. In first step training set is automatically selected from

training set used in first step.

level manually or with some procedures statically. It uses k-means

It requires training data which is not available in more real datasets.

consuming and error prone process

This approach outperforms the result than unsupervised approaches.

CLEANING AND STANDARDISATION

CLEANING AND STANDARDISATION

records of two dataset.

of both records in standard form.

record using sorted block indexing technique.

store the records having

Classifies the record

based on weight vectors.

vectors ranges from 1 to 0 corresponds to possible matches.

CONCLUSION AND FUTURE WORK

classifiers by conducting more experiments with different data sets.

Peter Christen IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.

S. Sarawagi and A. Bhamidipaty, "Interactive Deduplication Using Active Learning,"

You might also like