应用科学学报 ›› 2006, Vol. 24 ›› Issue (4): 396-400.

• 论文 • 上一篇    下一篇

一种大规模高维数据集的高效聚类算法

周晓云, 孙志挥, 张柏礼   

  1. 东南大学计算机科学与工程系, 江苏南京 210096
  • 收稿日期:2005-02-25 修回日期:2005-05-24 出版日期:2006-07-31 发布日期:2006-07-31
  • 作者简介:周晓云,博士生,研究方向:数据挖掘、知识发现,E-mail:ZXY0724@seu.edu.cn;孙志挥,教授,博导,研究方向:复杂系统集成、知识发现、数据挖掘,E-mail:zhsun@seu.edu.cn
  • 基金资助:
    国家自然科学基金(70371015);教育部高等学校博士学科点专项科研基金(20040286009)资助项目

An Efficient Clustering Algorithm of Large Scale and High Dimensional Data Set

ZHOU Xiao-yun, SUN Zhi-hui, ZHANG Bai-li   

  1. Department of Computer Science and Engineering, Southeast University, Nanjing 210096, China
  • Received:2005-02-25 Revised:2005-05-24 Online:2006-07-31 Published:2006-07-31

摘要: 大规模高维数据集的聚类算法已成为当前聚类研究的热点,由于高维的原因,聚类往往隐藏在数据空间的某些子空间中,传统的聚类算法无法获得有意义的聚类结果.此外,高维数据中含有的大量的随机噪声也会带来额外的效率问题.为了解决以上问题,该文在CLIQUE算法的基础上提出了一种基于最优区间分割和数据集划分的聚类算法—OpCluster,并使用仿真数据对该算法加以验证,实验结果表明,OpCluster对大规模高维数据集具有很好的聚类效果.

关键词: 聚类算法, 子空间聚类, 最优分割, 数据划分

Abstract: Clustering large data set of high dimensionality has always been a serious challenge for clustering algorithms. Traditional clustering algorithms often fail to detect meaningful clusters because of the high dimensionality and inherently sparse feature space of most real-world data sets.Nevertheless, the data sets often contain clusters hidden in various subspaces of the original feature space.In addition, high-dimensional data often contain a significant amount of noise which causes additional effectiveness problems.To overcome these problems, a new algorithm based on CLIQUE, named OpCluster, is proposed.A set of experiments on a synthetic dataset demonstrate the effectiveness and efficiency of the new approach.

Key words: clustering algorithms, subspace clustering, optimal partition, data partition

中图分类号: