应用科学学报 ›› 2005, Vol. 23 ›› Issue (1): 71-74.

• 论文 • 上一篇    下一篇

一种基于结构信息总结树的XML文档聚类方法

梁作鹏, 吴文明, 董逸生   

  1. 东南大学计算机科学与工程系, 江苏南京 210096
  • 收稿日期:2003-11-01 修回日期:2004-03-15 出版日期:2005-01-31 发布日期:2005-01-31
  • 作者简介:梁作鹏(1973-),男,山东肥城人,博士生,E-mail:l_zuopeng@sed.edu.cn;董逸生(1940-),男,江苏启东人,教授,博导,E-mail:ysdong@sed.edu.cn

Clustering XML Documents Based on a Structural Summary Tree

LIANG Zuo-peng, WU Wen-ming, DONG Yi-sheng   

  1. Department of Computer Science & Engineering, Southeast University, Nanjing 210096, China
  • Received:2003-11-01 Revised:2004-03-15 Online:2005-01-31 Published:2005-01-31

摘要: 提出一种有效的XML文档结构信息表达方法,用数字化的结构总结树SST对XML文档的结构信息进行编码,在此基础上给出结构距离的定义,并采用遗传算法对XML文档进行聚类.实验证明该方法分类准确率高,易于实现,且不需先验的DTD知识.

关键词: 文档聚类, SST (结构总结树), 信息检索, 遗传算法, XML

Abstract: An approach for calculating the structural similarity between XML documents is proposed in this paper.The structural information of an XML document is captured with a structural summary tree (SST).By encoding elements as digital numbers, a SST is transformed to a digit-labeled tree.Digital numbers at different tree levels are concatenated to form a vector after the normalization process.Consequently, each XML document is represented as an m-dimension vector.The GA-based clustering algorithm is adopted since it is able to provide good results irrespective of the starting configuration.Experimental results show the effectiveness and scalability of the approach.

Key words: XML, GA, SST (structure summary tree), information retrieval, document clustering

中图分类号: