应用科学学报 ›› 2005, Vol. 23 ›› Issue (4): 399-403.

• 论文 • 上一篇    下一篇

PBC:一种基于路径的XML文档聚类方法

梁作鹏, 业宁, 董逸生   

  1. 东南大学计算机科学与工程系, 江苏南京 210096
  • 收稿日期:2004-04-07 修回日期:2004-10-29 出版日期:2005-07-31 发布日期:2005-07-31
  • 作者简介:梁作鹏(1973-),男,山东肥城人,博士生,E-mail:L_zuopeng@seu.edu.cn;董逸生(1940-),男,江苏启东人,教授,博导,E-mail:ysdong@seu.edu.cn

PBC: A Path-Based Method to Clustering XML Documents

LIANG Zuo-peng, YE Ning, DONG Yi-sheng   

  1. Department of Computer Science & Engineering, Southeast University, Nanjing 210096, China
  • Received:2004-04-07 Revised:2004-10-29 Online:2005-07-31 Published:2005-07-31

摘要: 提出了一种基于路径的XML文档结构聚类方法(PBC).与其他方法直接计算XML文档结构距离不同,该方法通过对文档包含的路径聚类,间接完成文档的聚类.首先,包含某一路径的文档集合形成初始类,并用该路径作为初始类的标识.然后,用层次聚类方法根据设定的标准,合并初始类,直至结束.类的标识信息是类中文档包含的路径,结果直观,容易理解.算法的复杂度是O(n),其中n是文档的大小.相关实验证明该算法不但能保证聚类结果准确,而且能大幅度提高计算的速度.

关键词: XML, 信息检索, 文档聚类

Abstract: Most existing clustering techniques for XML are based on the concept of edit-distance and have two main disadvantages:(1) very high time complexity and (2) difficulty in understanding the description of the resulting clusters.In this paper, a novel approach called path-based clustering (PBC) is presented.Instead of comparing XML documents structure and clustering them directly, the paths contained in these documents are clustered.For each path, a cluster containing documents that have that path is formed.After that, clusters that contain similar sets of documents are combined.The resulting clusters contain documents that share a similar set of paths.Experimental results show the effectiveness and efficiency of this approach.

Key words: information retrieval, XML, document clustering

中图分类号: