基于云计算平台Hadoop的HKM聚类算法设计研究

doi:10.3969/j.issn.0255-8297.2018.03.012

应用科学学报 ›› 2018, Vol. 36 ›› Issue (3): 524-534.doi: 10.3969/j.issn.0255-8297.2018.03.012

基于云计算平台Hadoop的HKM聚类算法设计研究

张淑芬¹, 董岩岩¹, 陈学斌^1,2

1. 华北理工大学理学院, 河北唐山 063009;
2. 河北省数据科学与应用重点实验室, 河北唐山 063009

收稿日期:2016-12-27 修回日期:2017-02-04 出版日期:2018-05-31 发布日期:2018-05-31
作者简介:张淑芬,教授,研究方向:云计算、大数据,E-mail:hblgzhsf@qq.com

HKM Clustering Algorithm Design and Research Based on Hadoop Platform

ZHANG Shu-fen¹, DONG Yan-yan¹, Chen Xue-bin^1,2

1. College of Science, North China University of Science and Technology, Tangshan 063009, Hebei Province, China;
2. Hebei Key Laboratory of Data Science & Application, Tangshan 063009, Hebei Province, China

Received:2016-12-27 Revised:2017-02-04 Online:2018-05-31 Published:2018-05-31

摘要/Abstract

摘要：

为有效解决传统K-means聚类算法在处理大规模数据集时面临的扩展性问题，提出了一种Hadoop K-means聚类算法.该算法首先根据样本密度剔除数据集中孤立点或者噪声点的影响，再利用最大化最小距离思想选取K个初始中心，使初始聚簇中心点最优化，最后用Hadoop云计算平台的MapReduce编程模型实现算法的并行化.实验结果表明，该算法不仅在聚类结果上具有较高的准确率和稳定性，而且能够很好地解决传统聚类算法在处理大规模数据时所面临的扩展性问题.

关键词: K-means算法, 样本密度, 最大化最小距离, Hadoop平台, 并行化计算

Abstract:

In order to solve the problem of traditional K-means clustering algorithm in dealing with large-scale data set, a Hadoop K-means (HKM) clustering algorithm is proposed. Firstly, based on the of sample density, the algorithm excludes the effect of data set point or noise. Secondly the optimization of the initial cluster centers is carried out by selecting K initial centers guided by the thought of maximizing the minimum distance. In the end, the MapReduce programming model of Hadoop cloud computing platform is used to realize the parallelization of the algorithm. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but also can solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.

Key words: K-means algorithm, maximum minimum distance, Hadoop platform, parallel computing, sample density

中图分类号:

TP391.1

张淑芬, 董岩岩, 陈学斌. 基于云计算平台Hadoop的HKM聚类算法设计研究[J]. 应用科学学报, 2018, 36(3): 524-534.

ZHANG Shu-fen, DONG Yan-yan, Chen Xue-bin. HKM Clustering Algorithm Design and Research Based on Hadoop Platform[J]. Journal of Applied Sciences, 2018, 36(3): 524-534.

参考文献

[1] 张杰,卓灵,朱韵攸. 一种K-means聚类算法的改进与应用[J]. 电子技术应用,2015,41(1):125-128. Zhang J, Zhou L, Zhu Y X. Improvement and application of K-means clustering algorithm[J]. Application of Electronic Technique, 2015, 41(1):125-128. (in Chinese)
[2] 陈兴蜀,吴小松,王文贤. 基于特征关联度的K-means初始聚类中心优化算法[J]. 四川大学学报(工程科学版),2015, 47(1):13-19.Chen X S, Wu X S, Wang W X. A K-means initial clustering center optimization algorithm based on feature relevance[J]. Journal of Sichuan University (Engineering Science Edition), 2015, 47(1):13-19. (in Chinese)
[3] 罗军锋,锁志海. 一种基于密度的K-means聚类算法[J]. 微电子学与计算机,2014(10):28-31. Luo J F, Suo Z H. A clustering algorithm based on density for K-means[J]. Microelectronics and Computer Science, 2014(10):28-31. (in Chinese)
[4] 沈艳,余冬华,王昊雷. 粒子群K-means聚类算法的改进[J]. 计算机工程与应用,2014, 50(21):125-128. Shen Y, Yu D H, Wang H L. An improved K-means clustering algorithm for particle swarm optimization[J]. Computer Engineering and Applications, 2014, 50(21):125-128. (in Chinese)
[5] 张洁玲,白清源. 一种高效的K-means聚类改进算法[J]. 福州大学学报(自然科学版),2014, 42(4):537-542. Zhang J L, Bai Q Y. An efcient K-means clustering algorithm[J]. Journal of Fuzhou University (Natural Science Edition), 2014, 42(4):537-542. (in Chinese)
[6] 罗倩. K-means聚类中心的鲁棒优化算法[J]. 计算机工程与设计,2015, 36(9):2395-2400. Luo Q. Robust Optimization algorithm for K-means clustering centers[J]. Computer Engineering and Design, 2015, 36(9):2395-2400. (in Chinese)
[7] 王勇,唐靖,饶勤菲. 高效率的K-means最佳聚类数确定算法[J]. 计算机应用,2014, 34(5):1331- 1335. Wang Y, Tang J, Rao Q F. A K-means optimal clustering algorithm for high efciency[J]. Computer Application, 2014, 34(5):1331-1335. (in Chinese)
[8] 熊平,顾霄. 基于属性权重最优化的K-means聚类算法[J]. 微电子学与计算机,2014, 31(4):40-43. Xiong P, Gu X. K-means clustering algorithm based on attribute weight optimization[J]. Microelectronics and Computer Science, 2014, 31(4):40-43. (in Chinese)
[9] 安计勇,闫子骥,翟靖轩. 基于距离阈值及样本加权的K-means聚类算法[J]. 微电子学与计算机,2015(8):135-138. An J Y, Yan Z J, Zhai J X. K-means clustering algorithm based on distance threshold and sample weighting[J]. Microelectronics and Computer Science, 2015(8):135-138. (in Chinese)
[10] 王飞,秦小麟,刘亮. 云环境下基于数据流的K-means聚类算法[J]. 计算机科学,2015, 42(11):235-239. Wang F, Qin X L, Liu L. K-means clustering algorithm based on data flow in cloud environment[J]. Computer Science, 2015, 42(11):235-239. (in Chinese)
[11] 朱烨行,李艳玲,崔梦天. 一种改进K-means算法的聚类算法CARDBK[J]. 计算机科学,2015, 42(3):201-205. Zhu Y X, Li Y L, Cun M T. CARDBK clustering algorithm with improved K-means algorithm[J]. Computer Science, 2015, 42(3):201-205. (in Chinese)
[12] 张玉芳,毛嘉莉,熊忠阳. 一种改进的K-means算法[J]. 计算机应用,2003, 23(8):31-33. Zhang Y F, Mao J L, Xiong Z Y. An improved K-means algorithm[J]. Computer Application, 2003, 23(8):31-33. (in Chinese)
[13] 邢长征,谷浩. 基于平均密度优化初始聚类中心的K-means算法[J]. 计算机工程与应用,2014, 50(20):135-138. Xing C Z, Gu H. A K-means algorithm for initial clustering center optimization based on average density[J]. Computer Engineering and Applications, 2014, 50(20):135-138. (in Chinese)
[14] 杨志,罗可. 一种改进的基于粒子群的聚类算法[J]. 计算机应用研究,2014, 31(9):2597-2599. Yang Z, Luo K. An improved clustering algorithm based on particle swarm optimization[J]. Journal of Computer Applications, 2014, 31(9):2597-2599. (in Chinese)
[15] 柳静,郭红山. 云计算中K-means聚类中心优化求解方法[J]. 科技通报,2015, 31(10):100-102. Liu J, Guo H S. Optimization method of K-means clustering center in cloud computing[J]. Bulletin of Science and Technology, 2015, 31(10):100-102. (in Chinese)
[16] 周炜奔,石跃祥. 基于密度的K-means聚类中心选取的优化算法[J]. 计算机应用研究,2012, 29(5):1726-1728. Zhou W B, Shi Y X. An optimization algorithm of K-means clustering center selection based on density[J]. Computer application research, 2012, 29(5):1726-1728. (in Chinese)
[17] 符保龙,张爱科. 基于均值密度中心估计的K-means聚类文本挖掘方法[J]. 重庆邮电大学学报(自然科学版),2014, 26(1):111-116. Fu B L, Zhang A K. A K-means clustering text mining method based on mean density center estimation[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2014, 26(1):111-116. (in Chinese)
[18] Rao B T, Reddy L S. Survey on improved scheduling in Hadoop MapReduce in cloud environments[J]. Computer Science, 2012, 34(9):29-33.
[19] Dittrich J, Quiane-Ruiz J A. Efcient big data processing in Hadoop MapReduce[J]. Proceedings of the Vldb Endowment, 2015, 5(12):2014-2015.
[20] Lin X, Meng Z, Xu C. A practical performance model for Hadoop MapReduce[C]//IEEE International Conference on CLUSTER Computing Workshops. IEEE, 2015:231-239.
[21] Han W, Zhang X, Chen Y. MapReduce based image classifcation approach[J]. Journal of Computer Applications, 2014, 34(6):1600-1603.
[22] 周润物,李智勇,陈少淼. 面向大数据处理的并行优化抽样聚类K-means算法[J]. 计算机应用,2016, 36(2):311-315. Zhou R W, Li Z Y, Chen S M. Parallel optimal sampling clustering K-means algorithm for large data[J]. Computer Application, 2016, 36(2):311-315. (in Chinese)
[23] Rao B T, Reddy L S S. Survey on improved scheduling in Hadoop MapReduce in cloud environments[J]. Computer Science, 2015, 34(9):29-33.
[24] 马汉达,郝晓宇,马仁庆. 基于Hadoop的并行PSO-K-means算法实现Web日志挖掘[J]. 计算机科学,2015, 42(s1):470-473. Ma H D, Hao X Y, Ma R Q. Implementation of Web log mining based on Hadoop's parallel PSO-K-means algorithm[J]. Computer Science, 2015, 42(s1):470-473. (in Chinese)
[25] 周婷,张君瑛,罗成. 基于Hadoop的K-means聚类算法的实现[J]. 计算机技术与发展,2013, 23(7):18-21. Zhou T, Zhang J Y, Luo C. Implementation of K-means clustering algorithm based on Hadoop[J]. Computer Technology and Development, 2013, 23(7):18-21. (in Chinese)
[26] 赵卫中,马慧芳,傅燕翔. 基于云计算平台Hadoop的并行K-means聚类算法设计研究[J]. 计算机科学,2011, 38(10):166-168. Zhao W Z, Ma H F, Fu Y X. Research on parallel K-means clustering algorithm based on cloud computing platform Hadoop[J]. Computer Science, 2011, 38(10):166-168. (in Chinese)

基于云计算平台Hadoop的HKM聚类算法设计研究

HKM Clustering Algorithm Design and Research Based on Hadoop Platform

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics

本文评价

[1]	赵云山, 段友祥. 基于Attention机制的卷积神经网络文本分类模型[J]. 应用科学学报, 2019, 37(4): 541-550.
[2]	周林，平西建，童莉. 一种改进的自适应距离保持水平集演化方法[J]. 应用科学学报, 2011, 29(3): 274-280.
[3]	郭龙，平西建，周林，童莉. 基本图像特征用于文本图像文种识别[J]. 应用科学学报, 2011, 29(1): 56-60.
[4]	虞礼贞. 曲轴动平衡检测中一种数值校正方法[J]. 应用科学学报, 2001, 19(3): 233-236.