为了处理大量分布式存储的农业环境数据,提高农业生产效率,对高斯混合模型聚类算法进行了改进,提出了一种基于分布式聚类的农业环境数据异常检测方法.在Spark分布式计算框架下,首先对数据进行粗聚类,得到初始化模型;然后利用Spark迭代更新模型直至稳定,其中Map阶段将样本点分配到模型,Reduce阶段更新模型个数及参数;最后利用聚类结果,实现环境异常值的检测.实验结果表明该方法可行有效.
In order to process the massive agricultural environmental data stored in distributed system and improve the production efciency, the clustering algorithm based on Gaussian Mixture Model (GMM) is modifed in this paper. Based on this, an environmental anomaly detection method during crop growth is proposed. Under the Spark distributed computing framework, frstly, a pre-clustering algorithm is adopted to initialize the models. Secondly, Spark is utilized to update the models iterationally until it gets stable. In each iteration, Map phase distributes sample points to the models, Reduce phase renews the numbers of models and parameters. Finally, the detection of environmental anomaly is completed by taking advantages of the clustering result. The experimental results show that this approach is practically feasible and effective.
[1] Ruß G, Kruse R, Scheider M. A clustering approach for management zone delineation in precision agriculture[J]. Journal of Applied Physics, 2010, 26(9):1165-1172.
[2] Russ G, Kruse R. Machine learning methods for spatial clustering on precision agriculture data[C]//Scandinavian Conference on Artifcial Intelligence, 2011, 227:40-49.
[3] Cao L, Zhang X, San X, Zhao Y, Chen G. Application of fuzzy clustering algorithm in precision agriculture[C]//2012 IEEE World Automation Congress (WAC), 2012:1-4.
[4] Ananthara M G, Arunkumar T, Hemavathy R. CRY-An improved crop yield prediction model using bee hive clustering approach for agricultural data sets[C]//IEEE International Conference on Pattern Recognition, Informatics and Mobile Engineering, 2013:473-478.
[5] Wu M, Wang Y, Liao Z. A new clustering algorithm for sensor data streams in an agricultural IoT[C]//IEEE International Conference on High Performance Computing and Communications, Zhangjiajie. 2013:2373-2378.
[6] 时雷. 基于物联网的小麦生长环境数据采集与数据挖掘技术研究[D]. 郑州:河南农业大学,2013.
[7] 曹振丽. 面向养殖环境监测的数据流处理方法研究[D]. 北京:中国农业大学,2015.
[8] Oh S H, Lee W S. An anomaly intrusion detection method by clustering normal user behavior[J]. Computers & Security, 2003, 22(7):596-612.
[9] Li H, Achim A, Bull D. Unsupervised video anomaly detection using feature clustering[J]. IET Signal Processing, 2012, 6(5):521-533.
[10] Ahmed M, Mahmood A N, Maher M J. Heart disease diagnosis using co-clustering[M]. Germany:Springer International Publishing, 2015:61-70.
[11] Bilmes J A. A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models[R]. International Computer Science Institute Technical Report, UC Berkeley. 1998.
[12] Figueiredo M A T, Jain A K. Unsupervised learning of fnite mixture models[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2002, 24(3):381-396.
[13] Yang M S, Lai C Y, Lin C Y. A robust EM clustering algorithm for Gaussian mixture models[J]. Pattern Recognition, 2012, 45(11):3950-3961.
[14] Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark:cluster computing with working sets[C]//Usenix Conference on Hot Topics in Cloud Computing, USENIX Association, 2010:1765-1773.
[15] 侯加林. 温室番茄生长发育模拟模型的研究[D]. 北京:中国农业大学,2005.
[16] Mccallum A. Efcient clustering of high-dimensional data sets with application to reference matching[J]. Knowledge Discovery & Data Mining, 2010:169-178.