应用科学学报 ›› 2005, Vol. 23 ›› Issue (3): 292-296.

• 论文 • 上一篇    下一篇

一种在线数据清洗方法

韩京宇, 胡孔法, 徐立臻, 董逸生   

  1. 东南大学计算机科学与工程系, 江苏南京 210096
  • 收稿日期:2004-03-05 修回日期:2004-10-13 出版日期:2005-05-31 发布日期:2005-05-31
  • 作者简介:韩京宇(1976-),男,吉林白山人,博士生,E-mail:hjy3789759@163.com;董逸生(1940-),男,江苏启东人,教授,博导.
  • 基金资助:
    江苏省十五高科技资助项目(BG2001013)

An Online Data Cleaning Method

HAN Jing-yu, HU Kong-fa, XU Li-zhen, DONG Yi-sheng   

  1. Department of Computer Science and Engineering, Southeast University, Nanjing 210096, China
  • Received:2004-03-05 Revised:2004-10-13 Online:2005-05-31 Published:2005-05-31

摘要: 提出一种新的在线数据清洗方法:将确认为干净的参照表中的记录字符串映射成高维空间中的点后进行聚类划分,然后利用B+树对划分中的点进行索引从而将高维空间的查询转换成一维空间的范围查询.输入表中的元组利用索引采用分枝限界策略搜索KNN (K nearest neighbors)记录从而完成与其最匹配记录的识别.理论分析和实验表明这是一种解决在线数据清洗的有效途径.

关键词: 数据清洗, 分枝限界, B+树

Abstract: A new method for online data cleaning is presented.First, each clean record in the reference table is mapped as a point in a high-dimensional metric space measured by Manhattan distance.Next, all the points in the space are partitioned by clustering and indexed with B+ tree.In this way, the search in highdimensional space can be translated into search in one-dimensional space.To find the KNN (K nearest neighbors) in reference table for each incoming record, the search method of branch and bound is employed. The top K records that best match the incoming record are then identified.Theory and experiment show that it is an effective approach for online data cleaning.

Key words: B+tree, data cleaning, branch and bound

中图分类号: