一种基于数据不确定性的概念漂移数据流分类算法

doi:10.3969/j.issn.0255-8297.2017.05.003

应用科学学报 ›› 2017, Vol. 35 ›› Issue (5): 559-569.doi: 10.3969/j.issn.0255-8297.2017.05.003

• 2016中国计算机应用大会遴选论文 • 上一篇下一篇

一种基于数据不确定性的概念漂移数据流分类算法

吕艳霞^1,2, 王翠容^1,2, 王聪², 苑迎²

1. 东北大学计算机科学与工程学院, 沈阳 110819;
2. 东北大学秦皇岛分校计算机与通信工程学院, 河北秦皇岛 066004

收稿日期:2016-10-05 修回日期:2017-02-27 出版日期:2017-09-30 发布日期:2017-09-30
作者简介:吕艳霞,博士生,讲师,研究方向:大数据分析、分布式计算、在线学习,E-mail:shaoqilyx@163.com
基金资助:
国家自然科学基金（No.61300195）；河北省自然科学基金（No.F2014501078，No.F2016501079）资助

Data Stream Classifcation with Data Uncertainty and Concept Drift

LÜ Yan-xia^1,2, WANG Cui-rong^1,2, WANG Cong², YUAN Ying²

1. College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China;
2. School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, Northeastern University, Qinhuangdao 066004, Hebei Province, China

Received:2016-10-05 Revised:2017-02-27 Online:2017-09-30 Published:2017-09-30

摘要/Abstract

摘要：

隐私保护、数据丢失、网络错误等原因导致网络中大量数据存在不确定性.数据流系统中数据连续不断到达系统，故不能一次性获得全部数据，此外数据的概念特征经常发生变化.针对这种情况，构建了一个增量式分类模型来处理数据具有不确定性的隐含概念漂移的数据流分类问题.该模型采用非常快速决策树算法，在学习阶段使用霍夫丁边界理论迅速构建能处理数据不确定性的决策树模型；在分类阶段将加权贝叶斯分类器应用于决策树的叶子节点，以提高不确定数据分类的准确率；采用滑动窗口技术和替换树来处理数据流中的概念漂移现象.实验表明，无论对人工数据还是实际数据，该算法均有较高的分类准确率和执行效率.

关键词: 数据不确定性, 数据流, 决策树, 概念漂移, 分类

Abstract:

Data in the Web have much uncertainty because of privacy protection, data loss, network errors, etc. In a data stream system, data arrive continuously and therefore one cannot obtain all data in any time. In addition, the concept drift often occurs in the data stream. This paper constructs an incremental classifcation model to deal with data stream classifcation with data uncertainty and concept drift. In this model, a fast decision tree algorithm is used. It can analyze uncertain information quickly and effectively both in the learning stage and the classifcation stage. In the learning stage, it uses the Hoeffding bound theory to quickly construct a decision tree model for the data stream with data uncertainty. In the classifcation stage, it uses a weighted Bayes classifer in the tree leaves to improve precision of the classifcation. The use of a sliding window to replace the tree ensures that the algorithm can deal with concept drift. Experimental results show that the algorithm has good classifcation accuracy and execution efciency both on artifcial and real data.

Key words: concept drift, classifcation, data stream, data uncertainty, decision tree

中图分类号:

TP311

吕艳霞, 王翠容, 王聪, 苑迎. 一种基于数据不确定性的概念漂移数据流分类算法[J]. 应用科学学报, 2017, 35(5): 559-569.

LÜ Yan-xia, WANG Cui-rong, WANG Cong, YUAN Ying. Data Stream Classifcation with Data Uncertainty and Concept Drift[J]. Journal of Applied Sciences, 2017, 35(5): 559-569.

参考文献

[1] Tsang S, Kao B, Yip K Y, Ho W S, Lee S D. Decision trees for uncertain data[J]. IEEE Transactions on Knowledge & Data Engineering, 2009, 231:64-78.
[2] Hulten G, Spencer L, Domingos P. Mining time-changing data streams[J]. ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2001:220-224.
[3] Charu C A, Philip S Y. A survey of uncertain data algorithms and applications[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21:609-623.
[4] Ding X, Lian X, Chen L, Jin H. Continuous monitoring of skylines over uncertain data streams[J]. Information Sciences, 2012:196-214.
[5] Gao C, Wang J. Direct mining of discriminative patterns for classifying uncertain data[C]//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010:861-870.
[6] Cao K Y, Wang G, Han D. An algorithm for classifcation over uncertain data based on extreme learning machine[J]. Neurocomputing, 2016, 174:194-202.
[7] Qin B, Xia Y, Wang S, Du X. A novel Bayesian classifcation for uncertain data[J]. Knowledge-Based System, 2011, 24:1151-1158.
[8] Liu J, Li X, Zhong W. Ambiguous decision trees for mining concept-drifting data streams[J]. Pattern Recognition Letters, 2009:1347-1355.
[9] Ahmadi Z, Kramer S. Prototype-based learning on concept-drifting data streams[J]. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014:412-421.
[10] Sidhu P, Bhatia M P S. A novel online ensemble approach to handle concept drifting data streams-diversifed dynamic weighted majority[J]. International Journal of Machine Learning & Cybernetics, 2015:1-25.
[11] Liu S, Sun Z, Liu T. Research of incremental data stream classifcation based on sample uncertainty[J]. Journal of Chinese Computer Systems, 2015:193-196.
[12] Liang C, Zhang Y, Shi P, Hu Z. Learning accurate very fast decision trees from uncertain data streams[J]. International Journal of Systems Science, 2014:1-19.
[13] Lü Y, Wang C R, Wang C, Yuan Y. Online classifcation algorithm for uncertain data stream in big data[J]. Journal of Northeastern University (Natural Science), 2016, 37(9):1245-1249.
[14] Hoeffding W. Probability inequalities for sums of bounded random variables[J]. Journal of the American Statistical Association, 1962:13-30.
[15] Qin B, Xia Y, Li F. Dtu:a decision tree for uncertain data[J]. Lecture Notes in Computer Science, 2009:4-15.
[16] He J, Zhang Y, Shi X L P. Learning naive Bayes classifers from positive and unlabelled examples with uncertainty[J]. International Journal of Systems Science, 2012:1805-1825.
[17] West D H D. Updating mean and variance estimates:an improved method[J]. Communication of ACM, 1979:532-535.

一种基于数据不确定性的概念漂移数据流分类算法

Data Stream Classifcation with Data Uncertainty and Concept Drift

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	王孟轩, 张胜, 王月, 雷霆, 杜渂. 改进的CRNN模型在警情文本分类中的研究与应用[J]. 应用科学学报, 2020, 38(3): 388-400.
[2]	司广文, 秦川, 姚恒, 韩彦芳, 张志超. 基于纹理特征分类与合成的鲁棒无载体信息隐藏[J]. 应用科学学报, 2020, 38(3): 441-454.
[3]	孙中军, 翟江涛, 戴跃伟. 一种基于DPI和负载随机性的加密流量识别方法[J]. 应用科学学报, 2019, 37(5): 711-720.
[4]	卢才武, 齐凡, 阮顺领. 基于深度图像分析的细粒度矿石分级测定方法[J]. 应用科学学报, 2019, 37(4): 490-500.
[5]	赵云山, 段友祥. 基于Attention机制的卷积神经网络文本分类模型[J]. 应用科学学报, 2019, 37(4): 541-550.
[6]	冯勇, 屈渤浩, 徐红艳, 王嵘冰, 张永刚. 融合TF-IDF和LDA的中文FastText短文本分类方法[J]. 应用科学学报, 2019, 37(3): 378-388.
[7]	王金伟, 吴少华, 瞿治国. CFMoment:挖掘数据流频繁闭项集算法[J]. 应用科学学报, 2019, 37(3): 389-397.
[8]	欧阳志友, 陈晨, 王愉茜, 陈金刚, 殷昭, 周青松. 基于自然语言处理的蛋白质小分子亲和力值预测[J]. 应用科学学报, 2019, 37(3): 327-335.
[9]	杨惠雯, 方俊永, 赵冬. 基于改进分离阈值法的农作物遥感精细分类特征选择[J]. 应用科学学报, 2019, 37(1): 51-63.
[10]	汪然, 牛少彰, 平西建, 张涛, 桑晓丹. 降低特征类内离散度的JPEG图像隐写分析[J]. 应用科学学报, 2019, 37(1): 41-50.
[11]	刘诚诚, 姜瑛. 基于紧密度的模糊加权kNN数据分类方法[J]. 应用科学学报, 2018, 36(4): 679-688.
[12]	宋海峰, 陈广胜, 景维鹏, 杨巍巍. 基于(2D)²PCA的受限玻尔兹曼机图像分类算法及其并行化实现[J]. 应用科学学报, 2018, 36(3): 495-503.
[13]	刘小凯, 姚恒, 秦川. 基于图像块分类阈值优化的改进可逆图像伪装[J]. 应用科学学报, 2018, 36(2): 237-246.
[14]	方志坚, 傅仰耿, 陈建华. 纹理图像分类的置信规则库推理方法[J]. 应用科学学报, 2017, 35(5): 545-558.
[15]	巫兆聪, 刘培, 巫远. 一种多光谱遥感应用需求综合方法[J]. 应用科学学报, 2017, 35(5): 658-666.