HKM Clustering Algorithm Design and Research Based on Hadoop Platform

ZHANG Shu-fen; DONG Yan-yan; Chen Xue-bin

doi:10.3969/j.issn.0255-8297.2018.03.012

Journal of Applied Sciences >

2018 , Vol. 36 >Issue 3: 524 - 534

DOI: https://doi.org/10.3969/j.issn.0255-8297.2018.03.012

Computer Science and Applications

HKM Clustering Algorithm Design and Research Based on Hadoop Platform

ZHANG Shu-fen ,
DONG Yan-yan ,
Chen Xue-bin

Expand

1. College of Science, North China University of Science and Technology, Tangshan 063009, Hebei Province, China;
2. Hebei Key Laboratory of Data Science & Application, Tangshan 063009, Hebei Province, China

Received date: 2016-12-27

Revised date: 2017-02-04

Online published: 2018-05-31

Fold

Abstract

In order to solve the problem of traditional K-means clustering algorithm in dealing with large-scale data set, a Hadoop K-means (HKM) clustering algorithm is proposed. Firstly, based on the of sample density, the algorithm excludes the effect of data set point or noise. Secondly the optimization of the initial cluster centers is carried out by selecting K initial centers guided by the thought of maximizing the minimum distance. In the end, the MapReduce programming model of Hadoop cloud computing platform is used to realize the parallelization of the algorithm. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but also can solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.

Key words： K-means algorithm; maximum minimum distance; Hadoop platform; parallel computing; sample density

Cite this article

ZHANG Shu-fen , DONG Yan-yan , Chen Xue-bin . HKM Clustering Algorithm Design and Research Based on Hadoop Platform[J]. Journal of Applied Sciences, 2018 , 36(3) : 524 -534 . DOI: 10.3969/j.issn.0255-8297.2018.03.012

References

[1] 张杰,卓灵,朱韵攸. 一种K-means聚类算法的改进与应用[J]. 电子技术应用,2015,41(1):125-128. Zhang J, Zhou L, Zhu Y X. Improvement and application of K-means clustering algorithm[J]. Application of Electronic Technique, 2015, 41(1):125-128. (in Chinese)
[2] 陈兴蜀,吴小松,王文贤. 基于特征关联度的K-means初始聚类中心优化算法[J]. 四川大学学报(工程科学版),2015, 47(1):13-19.Chen X S, Wu X S, Wang W X. A K-means initial clustering center optimization algorithm based on feature relevance[J]. Journal of Sichuan University (Engineering Science Edition), 2015, 47(1):13-19. (in Chinese)
[3] 罗军锋,锁志海. 一种基于密度的K-means聚类算法[J]. 微电子学与计算机,2014(10):28-31. Luo J F, Suo Z H. A clustering algorithm based on density for K-means[J]. Microelectronics and Computer Science, 2014(10):28-31. (in Chinese)
[4] 沈艳,余冬华,王昊雷. 粒子群K-means聚类算法的改进[J]. 计算机工程与应用,2014, 50(21):125-128. Shen Y, Yu D H, Wang H L. An improved K-means clustering algorithm for particle swarm optimization[J]. Computer Engineering and Applications, 2014, 50(21):125-128. (in Chinese)
[5] 张洁玲,白清源. 一种高效的K-means聚类改进算法[J]. 福州大学学报(自然科学版),2014, 42(4):537-542. Zhang J L, Bai Q Y. An efcient K-means clustering algorithm[J]. Journal of Fuzhou University (Natural Science Edition), 2014, 42(4):537-542. (in Chinese)
[6] 罗倩. K-means聚类中心的鲁棒优化算法[J]. 计算机工程与设计,2015, 36(9):2395-2400. Luo Q. Robust Optimization algorithm for K-means clustering centers[J]. Computer Engineering and Design, 2015, 36(9):2395-2400. (in Chinese)
[7] 王勇,唐靖,饶勤菲. 高效率的K-means最佳聚类数确定算法[J]. 计算机应用,2014, 34(5):1331- 1335. Wang Y, Tang J, Rao Q F. A K-means optimal clustering algorithm for high efciency[J]. Computer Application, 2014, 34(5):1331-1335. (in Chinese)
[8] 熊平,顾霄. 基于属性权重最优化的K-means聚类算法[J]. 微电子学与计算机,2014, 31(4):40-43. Xiong P, Gu X. K-means clustering algorithm based on attribute weight optimization[J]. Microelectronics and Computer Science, 2014, 31(4):40-43. (in Chinese)
[9] 安计勇,闫子骥,翟靖轩. 基于距离阈值及样本加权的K-means聚类算法[J]. 微电子学与计算机,2015(8):135-138. An J Y, Yan Z J, Zhai J X. K-means clustering algorithm based on distance threshold and sample weighting[J]. Microelectronics and Computer Science, 2015(8):135-138. (in Chinese)
[10] 王飞,秦小麟,刘亮. 云环境下基于数据流的K-means聚类算法[J]. 计算机科学,2015, 42(11):235-239. Wang F, Qin X L, Liu L. K-means clustering algorithm based on data flow in cloud environment[J]. Computer Science, 2015, 42(11):235-239. (in Chinese)
[11] 朱烨行,李艳玲,崔梦天. 一种改进K-means算法的聚类算法CARDBK[J]. 计算机科学,2015, 42(3):201-205. Zhu Y X, Li Y L, Cun M T. CARDBK clustering algorithm with improved K-means algorithm[J]. Computer Science, 2015, 42(3):201-205. (in Chinese)
[12] 张玉芳,毛嘉莉,熊忠阳. 一种改进的K-means算法[J]. 计算机应用,2003, 23(8):31-33. Zhang Y F, Mao J L, Xiong Z Y. An improved K-means algorithm[J]. Computer Application, 2003, 23(8):31-33. (in Chinese)
[13] 邢长征,谷浩. 基于平均密度优化初始聚类中心的K-means算法[J]. 计算机工程与应用,2014, 50(20):135-138. Xing C Z, Gu H. A K-means algorithm for initial clustering center optimization based on average density[J]. Computer Engineering and Applications, 2014, 50(20):135-138. (in Chinese)
[14] 杨志,罗可. 一种改进的基于粒子群的聚类算法[J]. 计算机应用研究,2014, 31(9):2597-2599. Yang Z, Luo K. An improved clustering algorithm based on particle swarm optimization[J]. Journal of Computer Applications, 2014, 31(9):2597-2599. (in Chinese)
[15] 柳静,郭红山. 云计算中K-means聚类中心优化求解方法[J]. 科技通报,2015, 31(10):100-102. Liu J, Guo H S. Optimization method of K-means clustering center in cloud computing[J]. Bulletin of Science and Technology, 2015, 31(10):100-102. (in Chinese)
[16] 周炜奔,石跃祥. 基于密度的K-means聚类中心选取的优化算法[J]. 计算机应用研究,2012, 29(5):1726-1728. Zhou W B, Shi Y X. An optimization algorithm of K-means clustering center selection based on density[J]. Computer application research, 2012, 29(5):1726-1728. (in Chinese)
[17] 符保龙,张爱科. 基于均值密度中心估计的K-means聚类文本挖掘方法[J]. 重庆邮电大学学报(自然科学版),2014, 26(1):111-116. Fu B L, Zhang A K. A K-means clustering text mining method based on mean density center estimation[J]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), 2014, 26(1):111-116. (in Chinese)
[18] Rao B T, Reddy L S. Survey on improved scheduling in Hadoop MapReduce in cloud environments[J]. Computer Science, 2012, 34(9):29-33.
[19] Dittrich J, Quiane-Ruiz J A. Efcient big data processing in Hadoop MapReduce[J]. Proceedings of the Vldb Endowment, 2015, 5(12):2014-2015.
[20] Lin X, Meng Z, Xu C. A practical performance model for Hadoop MapReduce[C]//IEEE International Conference on CLUSTER Computing Workshops. IEEE, 2015:231-239.
[21] Han W, Zhang X, Chen Y. MapReduce based image classifcation approach[J]. Journal of Computer Applications, 2014, 34(6):1600-1603.
[22] 周润物,李智勇,陈少淼. 面向大数据处理的并行优化抽样聚类K-means算法[J]. 计算机应用,2016, 36(2):311-315. Zhou R W, Li Z Y, Chen S M. Parallel optimal sampling clustering K-means algorithm for large data[J]. Computer Application, 2016, 36(2):311-315. (in Chinese)
[23] Rao B T, Reddy L S S. Survey on improved scheduling in Hadoop MapReduce in cloud environments[J]. Computer Science, 2015, 34(9):29-33.
[24] 马汉达,郝晓宇,马仁庆. 基于Hadoop的并行PSO-K-means算法实现Web日志挖掘[J]. 计算机科学,2015, 42(s1):470-473. Ma H D, Hao X Y, Ma R Q. Implementation of Web log mining based on Hadoop's parallel PSO-K-means algorithm[J]. Computer Science, 2015, 42(s1):470-473. (in Chinese)
[25] 周婷,张君瑛,罗成. 基于Hadoop的K-means聚类算法的实现[J]. 计算机技术与发展,2013, 23(7):18-21. Zhou T, Zhang J Y, Luo C. Implementation of K-means clustering algorithm based on Hadoop[J]. Computer Technology and Development, 2013, 23(7):18-21. (in Chinese)
[26] 赵卫中,马慧芳,傅燕翔. 基于云计算平台Hadoop的并行K-means聚类算法设计研究[J]. 计算机科学,2011, 38(10):166-168. Zhao W Z, Ma H F, Fu Y X. Research on parallel K-means clustering algorithm based on cloud computing platform Hadoop[J]. Computer Science, 2011, 38(10):166-168. (in Chinese)

Options

Outlines

模态框（Modal）标题

Abstract

Cite this article

References