增量聚类算法可以解决数据量大、内存不足的问题.传统的增量式模糊聚类(incremental multiple medoids based fuzzy clustering,IMMFC)算法只为每个数据块选择一个或多个相同数目的中心,当聚类中的对象权重较小时聚类效果不好.该文提出新的增量式模糊聚类算法用于处理大数据集.首先将大数据集分成多个小的数据块,并对每个小的数据块进行模糊聚类;然后从每个小数据块的每个簇群中选择目标中心点,中心点的个数是簇群中对象的权重之和大于阈值的最少对象数.最后合并所有选定的中心点,并对最终数据块进行模糊聚类,获取最终的中心点.实验结果表明,与IMMFC算法相比,当数据块占总数据的10%以上时,所提算法优于IMMFC.
Incremental clustering algorithm has the ability to solve the problem that large data volume cannot be read into memory at one time. The traditional incremental multiple medoids based fuzzy clustering (IMMFC) algorithm selects only one or a fixed number of center points for each data block, thus leading to a poor clustering performance when the object weights in the cluster are small. A new incremental fuzzy clustering algorithm is proposed for processing large data sets. Firstly, the algorithm divides the large data set into multiple small data blocks and performs fuzzy clustering on each small data block. Then, the target center point is selected from each cluster of each small data block. The number of center points is the minimum number of objects whose sum of weights of objects in the cluster is greater than a threshold. Finally, all selected center points are merged, and the final data block is fuzzy clustered to obtain the final center point. Experimental results show that the algorithm works superior to IMMFC algorithm in the case that the data block accounts for more than 10% of the total data.
[1] Bie R F, Mehmood R. Adaptive fuzzy clustering by fast search and find of density peaks[J]. Personal and Ubiquitous Computing, 2016, 20(5):785-792.
[2] 李滔,王士同. 适合大规模数据集的增量式模糊聚类算法[J]. 智能系统学报,2016, 11(2):188-199. Li T, Wang S T. Incremental fuzzy (c+p)-means clustering for large data[J]. China Association of Artificial Intelligence Transactions on intelligent Systems, 2016, 11(2):188-199. (in Chinese)
[3] Bezdek J C, Ehrlich R, Full W. FCM:the fuzzy c-means clustering algorithm[J]. Computers & Geosciences, 1984, 10(2):191-203.
[4] 吴佳,罗可. 改进的模糊C均值的增量聚类算法[J]. 计算机工程与应用,2011, 47(23):141-142. Wu J, Luo K. Improved fuzzy C-means incremental clustering algorithm[J]. Computer Engineering and Applications, 2011, 47(23):141-142. (in Chinese)
[5] 於跃成,生佳根,江峰琴,等. 基于混合高斯模型的增量式聚类[J]. 江苏科技大学学报(自然科学版),2011, 25(6):597-601. Yu Y C, Sheng J G, Jiang F Q, et al. Incremental clustering based on Gaussian mixture model[J]. Journal of Jiangsu University of Science and Technology (Natural Science Edition), 2011, 25(6):597-601. (in Chinese)
[6] Cheng C Y, Bao C H. A Kernelized fuzzy C-means clustering algorithm based on bat algorithm[C]//International Conference on Computer and Automation Engineering, Brisbane, Australia, 2018:1-5.
[7] Huo W G, Qu F, Zhang Y X. Incremental learning of the triangular membership functions based on single-pass FCM and CHC genetic model[J]. High Technology Letters, 2017, 23(1):7-15.
[8] Fern X Z, Brodley C E. Random projection for high dimensional data clustering:a cluster ensemble approach[C]//The 20th International Conference on International Conference on Machine Learning. AAAI Press, 2013:186-193.
[9] Ericson K, Pallickara S. On the performance of high dimensional data clustering and classification algorithms[J]. Future Generation Computer Systems, 2013, 29(4):1024-1034.
[10] Yu Z, Luo P, You J, et al. Incremental Semi-supervised clustering ensemble for high dimensional data clustering[J]. IEEE Transactions on Knowledge & Data Engineering, 2016, 28(3):701-714.
[11] Alijamaat A, Khalilian M, Mustapha N. A novel approach for high dimensional data clustering[C]//International Conference on Knowledge Discovery and Data Mining. IEEE Computer Society, 2010:264-267.
[12] Krishnapuram R, Joshi A, Nasraoui O, et al. Low-complexity fuzzy relational clustering algorithms for Web mining[J]. IEEE Transactions on Fuzzy Systems, 2001, 9(4):595-607.
[13] Mei J P, Chen L. Fuzzy clustering with weighted medoids for relational data[J]. Pattern Recognition, 2010, 43(5):1964-1974.
[14] Wang Y, Chen L, Mei J P. Incremental fuzzy clustering with multiple medoids for large data[J]. IEEE Transactions on Fuzzy Systems, 2014, 22(6):1557-1568.