Journal of Applied Sciences ›› 2013, Vol. 31 ›› Issue (2): 212-220.doi: 10.3969/j.issn.0255-8297.2013.02.017

• Computer Science and Applications • Previous Articles    

Duplicate Field Matching for Data Cleaning of Chinese Placenames

YE Ou1, ZHANG Jing1,2, LI Jun-huai1   

  1. 1. School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048, China
    2. State Key Laboratory for Manufacturing Systems Engineer, Xi’an Jiaotong University, Xi’an 710048, China
  • Received:2012-03-15 Revised:2012-07-18 Online:2013-03-25 Published:2012-07-18

Abstract: To improve accuracy of field matching of Chinese placenames, an approximate duplicate detection and cleaning strategy and a matrix approximate duplicate matching method are proposed. In the strategy, a matrix approximate duplicate matching method is used. Frequencies of the same Chinese characters or word between two Chinese placenames can first be calculated with a matrix operation. Semantic similarity and structure similarity can be calculated using the frequencies. By combining semantic and structure similarities, they are considered as the basis of duplicate detection and data cleaning. Simulation experiments are conducted to prove feasibility and validity of the method, showing that the matrix approximate duplicate matching method is better than other existing methods in terms of precision and recall ratio.

Key words: data cleaning, field matching, matrix approximately duplicate matching, Chinese placename, semantic similarity, structure similarity

CLC Number: