应用科学学报 ›› 2018, Vol. 36 ›› Issue (3): 524-534.doi: 10.3969/j.issn.0255-8297.2018.03.012

• 计算机科学与应用 • 上一篇    下一篇

基于云计算平台Hadoop的HKM聚类算法设计研究

张淑芬1, 董岩岩1, 陈学斌1,2   

  1. 1. 华北理工大学 理学院, 河北 唐山 063009;
    2. 河北省数据科学与应用重点实验室, 河北 唐山 063009
  • 收稿日期:2016-12-27 修回日期:2017-02-04 出版日期:2018-05-31 发布日期:2018-05-31
  • 作者简介:张淑芬,教授,研究方向:云计算、大数据,E-mail:hblgzhsf@qq.com

HKM Clustering Algorithm Design and Research Based on Hadoop Platform

ZHANG Shu-fen1, DONG Yan-yan1, Chen Xue-bin1,2   

  1. 1. College of Science, North China University of Science and Technology, Tangshan 063009, Hebei Province, China;
    2. Hebei Key Laboratory of Data Science & Application, Tangshan 063009, Hebei Province, China
  • Received:2016-12-27 Revised:2017-02-04 Online:2018-05-31 Published:2018-05-31

摘要:

为有效解决传统K-means聚类算法在处理大规模数据集时面临的扩展性问题,提出了一种Hadoop K-means聚类算法.该算法首先根据样本密度剔除数据集中孤立点或者噪声点的影响,再利用最大化最小距离思想选取K个初始中心,使初始聚簇中心点最优化,最后用Hadoop云计算平台的MapReduce编程模型实现算法的并行化.实验结果表明,该算法不仅在聚类结果上具有较高的准确率和稳定性,而且能够很好地解决传统聚类算法在处理大规模数据时所面临的扩展性问题.

关键词: K-means算法, 样本密度, 最大化最小距离, Hadoop平台, 并行化计算

Abstract:

In order to solve the problem of traditional K-means clustering algorithm in dealing with large-scale data set, a Hadoop K-means (HKM) clustering algorithm is proposed. Firstly, based on the of sample density, the algorithm excludes the effect of data set point or noise. Secondly the optimization of the initial cluster centers is carried out by selecting K initial centers guided by the thought of maximizing the minimum distance. In the end, the MapReduce programming model of Hadoop cloud computing platform is used to realize the parallelization of the algorithm. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but also can solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.

Key words: K-means algorithm, maximum minimum distance, Hadoop platform, parallel computing, sample density

中图分类号: