Journal of Applied Sciences ›› 2018, Vol. 36 ›› Issue (3): 524-534.doi: 10.3969/j.issn.0255-8297.2018.03.012

• Computer Science and Applications • Previous Articles     Next Articles

HKM Clustering Algorithm Design and Research Based on Hadoop Platform

ZHANG Shu-fen1, DONG Yan-yan1, Chen Xue-bin1,2   

  1. 1. College of Science, North China University of Science and Technology, Tangshan 063009, Hebei Province, China;
    2. Hebei Key Laboratory of Data Science & Application, Tangshan 063009, Hebei Province, China
  • Received:2016-12-27 Revised:2017-02-04 Online:2018-05-31 Published:2018-05-31

Abstract:

In order to solve the problem of traditional K-means clustering algorithm in dealing with large-scale data set, a Hadoop K-means (HKM) clustering algorithm is proposed. Firstly, based on the of sample density, the algorithm excludes the effect of data set point or noise. Secondly the optimization of the initial cluster centers is carried out by selecting K initial centers guided by the thought of maximizing the minimum distance. In the end, the MapReduce programming model of Hadoop cloud computing platform is used to realize the parallelization of the algorithm. Experimental results show that the proposed algorithm not only has high accuracy and stability in clustering results, but also can solve the problems of scalability encountered by traditional clustering algorithms in dealing with large scale data.

Key words: K-means algorithm, maximum minimum distance, Hadoop platform, parallel computing, sample density

CLC Number: