应用科学学报 ›› 2017, Vol. 35 ›› Issue (5): 559-569.doi: 10.3969/j.issn.0255-8297.2017.05.003

• 2016中国计算机应用大会遴选论文 • 上一篇    下一篇

一种基于数据不确定性的概念漂移数据流分类算法

吕艳霞1,2, 王翠容1,2, 王聪2, 苑迎2   

  1. 1. 东北大学 计算机科学与工程学院, 沈阳 110819;
    2. 东北大学秦皇岛分校 计算机与通信工程学院, 河北 秦皇岛 066004
  • 收稿日期:2016-10-05 修回日期:2017-02-27 出版日期:2017-09-30 发布日期:2017-09-30
  • 作者简介:吕艳霞,博士生,讲师,研究方向:大数据分析、分布式计算、在线学习,E-mail:shaoqilyx@163.com
  • 基金资助:

    国家自然科学基金(No.61300195);河北省自然科学基金(No.F2014501078,No.F2016501079)资助

Data Stream Classifcation with Data Uncertainty and Concept Drift

LÜ Yan-xia1,2, WANG Cui-rong1,2, WANG Cong2, YUAN Ying2   

  1. 1. College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China;
    2. School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, Northeastern University, Qinhuangdao 066004, Hebei Province, China
  • Received:2016-10-05 Revised:2017-02-27 Online:2017-09-30 Published:2017-09-30

摘要:

隐私保护、数据丢失、网络错误等原因导致网络中大量数据存在不确定性.数据流系统中数据连续不断到达系统,故不能一次性获得全部数据,此外数据的概念特征经常发生变化.针对这种情况,构建了一个增量式分类模型来处理数据具有不确定性的隐含概念漂移的数据流分类问题.该模型采用非常快速决策树算法,在学习阶段使用霍夫丁边界理论迅速构建能处理数据不确定性的决策树模型;在分类阶段将加权贝叶斯分类器应用于决策树的叶子节点,以提高不确定数据分类的准确率;采用滑动窗口技术和替换树来处理数据流中的概念漂移现象.实验表明,无论对人工数据还是实际数据,该算法均有较高的分类准确率和执行效率.

关键词: 数据不确定性, 数据流, 决策树, 概念漂移, 分类

Abstract:

Data in the Web have much uncertainty because of privacy protection, data loss, network errors, etc. In a data stream system, data arrive continuously and therefore one cannot obtain all data in any time. In addition, the concept drift often occurs in the data stream. This paper constructs an incremental classifcation model to deal with data stream classifcation with data uncertainty and concept drift. In this model, a fast decision tree algorithm is used. It can analyze uncertain information quickly and effectively both in the learning stage and the classifcation stage. In the learning stage, it uses the Hoeffding bound theory to quickly construct a decision tree model for the data stream with data uncertainty. In the classifcation stage, it uses a weighted Bayes classifer in the tree leaves to improve precision of the classifcation. The use of a sliding window to replace the tree ensures that the algorithm can deal with concept drift. Experimental results show that the algorithm has good classifcation accuracy and execution efciency both on artifcial and real data.

Key words: concept drift, classifcation, data stream, data uncertainty, decision tree

中图分类号: