Journal of Applied Sciences ›› 2005, Vol. 23 ›› Issue (4): 399-403.

• Articles • Previous Articles     Next Articles

PBC: A Path-Based Method to Clustering XML Documents

LIANG Zuo-peng, YE Ning, DONG Yi-sheng   

  1. Department of Computer Science & Engineering, Southeast University, Nanjing 210096, China
  • Received:2004-04-07 Revised:2004-10-29 Online:2005-07-31 Published:2005-07-31

Abstract: Most existing clustering techniques for XML are based on the concept of edit-distance and have two main disadvantages:(1) very high time complexity and (2) difficulty in understanding the description of the resulting clusters.In this paper, a novel approach called path-based clustering (PBC) is presented.Instead of comparing XML documents structure and clustering them directly, the paths contained in these documents are clustered.For each path, a cluster containing documents that have that path is formed.After that, clusters that contain similar sets of documents are combined.The resulting clusters contain documents that share a similar set of paths.Experimental results show the effectiveness and efficiency of this approach.

Key words: information retrieval, XML, document clustering

CLC Number: