Professional Documents
Culture Documents
ABSTRACT
Record Linkage is used to match the records that refer to same entity. While traditional classification methods was based on setting the threshold level manually which is cumbersome and time consuming process. In this paper we proposed a two-step approach that classifies the candidate record
INTRODUCTION
RECORD LINKAGE PROCESS: The task of record linkage is used for data quality and data integrity level which reduces time by referring the data with same entity. Steps Involved:
Data Cleaning and Standardization
Indexing and comparison Data Classification
Evaluation Process
EXISTING SYSTEM
While the existing system classification was based on setting the threshold
PROPOSED SYSTEM
A Novel two step approach is used for record pair classification which can
create the training data automatically using weight vector. Indexing Technique: Sorted Blocking Technique Comparison Function: Edit Distance Comparison Algorithms:NN Classification Algorithm and Iterative SVM Algorithm Classifier Used: Neighborhood based and SVM
Advantages:
Usage of weight vectors for creating training data is time consuming process. The proposed classifiers does not classifies the attribute similarity by pairwise.
ARCHITECTURE DESIGN
DATASET A CLEANING PROCESS DATASET B MATCHED INDEXING PROCESS
COMPARISON PROCESS
POSSIBLE MATCH
CLASSIFICATION
NON MATCHES
EVALUATION
Algorithm 1. Iterative SVM Classification Algorithm Input: Weight vector set: W, WM, and WN,Increment percentage: ip, total training percentage: tp Output: Weight vectors classified as matches: ZM,Weight vectors classified as Non matches: ZN 1: TM: =WM and TN: =WN 2: WU: =W\ (WMUWN) 3: sum0:=train sum(TM, TN) 4: i=0 5: while (|TM| + |TN|) < (|W|*tp/100) do: 6: XM, XN := sum classifier(sumi, WU) 7: Sort XM and XN according to distance from sumi 8: YM:=|XM|*(ip/100) vectors from XM 9: YN:=|XN|*(ip/100) vectors from XN 10: TM:=TMUYM 11: TN:=TNUYN 12: i=i+1 13: sumi:=train_sum(TM,TN) 14: WU:=WU\(YMUYN) 15: end while 16: XM, XN := sum classifier(sumi,WU) 17: ZM:=TMUXM and ZN:=TNUXN
of dataset.
Allows to read two input
data set.
Removes the noisy data.
standard form
It displays each attribute
INDEXING PROCESS
Creates candidate
similar attributes in
both record.
COMPARISON STEP
Compare two string
using edit distance comparison function. Finds similarity and dissimilarity and assign cost. Comparison is based on soundex encoding method.
CLASSIFICATION STEP
EVALUATION STEP
F Score for 2500 records F Score for 5000 records
REFERENCES
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication
American Statistical Association, 64(328):11831210, December 1969. William E. Winkler and Yves Thibaudeau. An application of the FellegiSunter Model of record linkage to the 1990 U.S. Decennial Census. Technical Report RR91/09, U.S. Bureau of the Census, 1991. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(1):116, January 2007. William E. Winkler. Overview of record linkage and current research directions. Technical Report RRS2006/02, U.S. Bureau of the Census, February 2006. Mikhail Bilenko, Raymond J. Mooney, William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5):1623, January/February 2003. Min-Yen Kan and Yee Fan Tan. Record Matching in Digital Library Metadata. To appear in Communications of the ACM (CACM).
Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '02), 2002.