You are on page 1of 16

A NOVEL TWO STEP APPROACH FOR RECORD PAIR CLASSIFICATION IN RECORD LINKAGE PROCESS

PRESENTED BY, K.GOMATHI M.E(CSE)/ II YEAR

ARCHANA INSTITUTE OF TECHNOLOGY

ABSTRACT
Record Linkage is used to match the records that refer to same entity. While traditional classification methods was based on setting the threshold level manually which is cumbersome and time consuming process. In this paper we proposed a two-step approach that classifies the candidate record

pairs automatically. In first step training set is automatically selected from


compared candidate record pair using weight vector classifier. In second step Support Vector Classifier is used to improve the performance of

training set used in first step.

INTRODUCTION
RECORD LINKAGE PROCESS: The task of record linkage is used for data quality and data integrity level which reduces time by referring the data with same entity. Steps Involved:
Data Cleaning and Standardization
Indexing and comparison Data Classification

Evaluation Process

EXISTING SYSTEM
While the existing system classification was based on setting the threshold

level manually or with some procedures statically. It uses k-means


clustering for record pair classification Drawbacks:

It requires training data which is not available in more real datasets.


It takes more time to prepare training data manually. It classifies based on pair wise attribute similarities leads to time

consuming and error prone process

PROPOSED SYSTEM
A Novel two step approach is used for record pair classification which can

create the training data automatically using weight vector. Indexing Technique: Sorted Blocking Technique Comparison Function: Edit Distance Comparison Algorithms:NN Classification Algorithm and Iterative SVM Algorithm Classifier Used: Neighborhood based and SVM

Advantages:
Usage of weight vectors for creating training data is time consuming process. The proposed classifiers does not classifies the attribute similarity by pairwise.

This approach outperforms the result than unsupervised approaches.

ARCHITECTURE DESIGN
DATASET A CLEANING PROCESS DATASET B MATCHED INDEXING PROCESS

COMPARISON PROCESS

POSSIBLE MATCH

CLASSIFICATION

NON MATCHES

EVALUATION

Algorithm 1. Iterative SVM Classification Algorithm Input: Weight vector set: W, WM, and WN,Increment percentage: ip, total training percentage: tp Output: Weight vectors classified as matches: ZM,Weight vectors classified as Non matches: ZN 1: TM: =WM and TN: =WN 2: WU: =W\ (WMUWN) 3: sum0:=train sum(TM, TN) 4: i=0 5: while (|TM| + |TN|) < (|W|*tp/100) do: 6: XM, XN := sum classifier(sumi, WU) 7: Sort XM and XN according to distance from sumi 8: YM:=|XM|*(ip/100) vectors from XM 9: YN:=|XN|*(ip/100) vectors from XN 10: TM:=TMUYM 11: TN:=TNUYN 12: i=i+1 13: sumi:=train_sum(TM,TN) 14: WU:=WU\(YMUYN) 15: end while 16: XM, XN := sum classifier(sumi,WU) 17: ZM:=TMUXM and ZN:=TNUXN

CLEANING AND STANDARDISATION


To remove noisy data
Data tab shows the type

of dataset.
Allows to read two input

data set.
Removes the noisy data.

CLEANING AND STANDARDISATION


Explorer tab displays the

records of two dataset.


Set each field in some

standard form
It displays each attribute

of both records in standard form.

INDEXING PROCESS
Creates candidate

record using sorted block indexing technique.


Create block which can

store the records having

similar attributes in
both record.

COMPARISON STEP
Compare two string

using edit distance comparison function. Finds similarity and dissimilarity and assign cost. Comparison is based on soundex encoding method.

CLASSIFICATION STEP

Classifies the record

based on weight vectors.


The value of weight

vectors ranges from 1 to 0 corresponds to possible matches.

EVALUATION STEP
F Score for 2500 records F Score for 5000 records

CONCLUSION AND FUTURE WORK


Presents two step approach for record pair classification. Allow automatic generation of training data. Future work involves checking the scalability of current

classifiers by conducting more experiments with different data sets.

REFERENCES
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

Peter Christen IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING.


Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of the

American Statistical Association, 64(328):11831210, December 1969. William E. Winkler and Yves Thibaudeau. An application of the FellegiSunter Model of record linkage to the 1990 U.S. Decennial Census. Technical Report RR91/09, U.S. Bureau of the Census, 1991. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(1):116, January 2007. William E. Winkler. Overview of record linkage and current research directions. Technical Report RRS2006/02, U.S. Bureau of the Census, February 2006. Mikhail Bilenko, Raymond J. Mooney, William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):1623, January/February 2003. Min-Yen Kan and Yee Fan Tan. Record Matching in Digital Library Metadata. To appear in Communications of the ACM (CACM).
Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002.

S. Sarawagi and A. Bhamidipaty, "Interactive Deduplication Using Active Learning,"

You might also like