收稿日期: 2016-10-05
修回日期: 2017-02-27
网络出版日期: 2017-09-30
基金资助
国家自然科学基金(No.61300195);河北省自然科学基金(No.F2014501078,No.F2016501079)资助
Data Stream Classifcation with Data Uncertainty and Concept Drift
Received date: 2016-10-05
Revised date: 2017-02-27
Online published: 2017-09-30
隐私保护、数据丢失、网络错误等原因导致网络中大量数据存在不确定性.数据流系统中数据连续不断到达系统,故不能一次性获得全部数据,此外数据的概念特征经常发生变化.针对这种情况,构建了一个增量式分类模型来处理数据具有不确定性的隐含概念漂移的数据流分类问题.该模型采用非常快速决策树算法,在学习阶段使用霍夫丁边界理论迅速构建能处理数据不确定性的决策树模型;在分类阶段将加权贝叶斯分类器应用于决策树的叶子节点,以提高不确定数据分类的准确率;采用滑动窗口技术和替换树来处理数据流中的概念漂移现象.实验表明,无论对人工数据还是实际数据,该算法均有较高的分类准确率和执行效率.
吕艳霞, 王翠容, 王聪, 苑迎 . 一种基于数据不确定性的概念漂移数据流分类算法[J]. 应用科学学报, 2017 , 35(5) : 559 -569 . DOI: 10.3969/j.issn.0255-8297.2017.05.003
Data in the Web have much uncertainty because of privacy protection, data loss, network errors, etc. In a data stream system, data arrive continuously and therefore one cannot obtain all data in any time. In addition, the concept drift often occurs in the data stream. This paper constructs an incremental classifcation model to deal with data stream classifcation with data uncertainty and concept drift. In this model, a fast decision tree algorithm is used. It can analyze uncertain information quickly and effectively both in the learning stage and the classifcation stage. In the learning stage, it uses the Hoeffding bound theory to quickly construct a decision tree model for the data stream with data uncertainty. In the classifcation stage, it uses a weighted Bayes classifer in the tree leaves to improve precision of the classifcation. The use of a sliding window to replace the tree ensures that the algorithm can deal with concept drift. Experimental results show that the algorithm has good classifcation accuracy and execution efciency both on artifcial and real data.
Key words: concept drift; classifcation; data stream; data uncertainty; decision tree
[1] Tsang S, Kao B, Yip K Y, Ho W S, Lee S D. Decision trees for uncertain data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 231:64-78.
[2] Hulten G, Spencer L, Domingos P. Mining time-changing data streams[J]. ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2001:220-224.
[3] Charu C A, Philip S Y. A survey of uncertain data algorithms and applications[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21:609-623.
[4] Ding X, Lian X, Chen L, Jin H. Continuous monitoring of skylines over uncertain data streams[J]. Information Sciences, 2012:196-214.
[5] Gao C, Wang J. Direct mining of discriminative patterns for classifying uncertain data[C]//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010:861-870.
[6] Cao K Y, Wang G, Han D. An algorithm for classifcation over uncertain data based on extreme learning machine[J]. Neurocomputing, 2016, 174:194-202.
[7] Qin B, Xia Y, Wang S, Du X. A novel Bayesian classifcation for uncertain data[J]. Knowledge-Based System, 2011, 24:1151-1158.
[8] Liu J, Li X, Zhong W. Ambiguous decision trees for mining concept-drifting data streams[J]. Pattern Recognition Letters, 2009:1347-1355.
[9] Ahmadi Z, Kramer S. Prototype-based learning on concept-drifting data streams[J]. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014:412-421.
[10] Sidhu P, Bhatia M P S. A novel online ensemble approach to handle concept drifting data streams-diversifed dynamic weighted majority[J]. International Journal of Machine Learning & Cybernetics, 2015:1-25.
[11] Liu S, Sun Z, Liu T. Research of incremental data stream classifcation based on sample uncertainty[J]. Journal of Chinese Computer Systems, 2015:193-196.
[12] Liang C, Zhang Y, Shi P, Hu Z. Learning accurate very fast decision trees from uncertain data streams[J]. International Journal of Systems Science, 2014:1-19.
[13] Lü Y, Wang C R, Wang C, Yuan Y. Online classifcation algorithm for uncertain data stream in big data[J]. Journal of Northeastern University (Natural Science), 2016, 37(9):1245-1249.
[14] Hoeffding W. Probability inequalities for sums of bounded random variables[J]. Journal of the American Statistical Association, 1962:13-30.
[15] Qin B, Xia Y, Li F. Dtu:a decision tree for uncertain data[J]. Lecture Notes in Computer Science, 2009:4-15.
[16] He J, Zhang Y, Shi X L P. Learning naive Bayes classifers from positive and unlabelled examples with uncertainty[J]. International Journal of Systems Science, 2012:1805-1825.
[17] West D H D. Updating mean and variance estimates:an improved method[J]. Communication of ACM, 1979:532-535.
/
| 〈 |
|
〉 |