2016中国计算机应用大会遴选论文

基于DTS-ILDA模型和关联过滤的新闻话题演化分析

展开
  • 1. 东北电力大学 信息工程学院, 吉林 吉林市 132012;
    2. 国网吉林省电力有限公司 吉林供电公司, 吉林 吉林市 132000;
    3. 吉林市丰满发电厂, 吉林 吉林市 132012;
    4. 中国移动通信集团吉林有限公司 吉林市分公司, 吉林 吉林市 132012
周自岚,硕士生,研究方向:文本信息处理、文本可视化,E-mail:1422076216@qq.com

收稿日期: 2016-10-02

  修回日期: 2017-03-07

  网络出版日期: 2017-09-30

基金资助

国家自然科学基金(No.51277023);吉林省科技厅项目基金(No.20150307020GX)资助

Analysis of News Topic Evolution Based on DTS-ILDA Model and Association Filtering

Expand
  • 1. School of Information Engineering, Northeast Dianli University, Jilin 132012, Jilin Province, China;
    2. Jilin Power Supply Company, State Grid Jilin Province Electric Power Supply Company, Jilin 132000, Jilin Province, China;
    3. Jilin Fengman Power Plant, Jilin 132012, Jilin Province, China;
    4. Jilin Branch, China Mobile Communications Group Jilin Co., Ltd., Jilin 132012, Jilin Province, China

Received date: 2016-10-02

  Revised date: 2017-03-07

  Online published: 2017-09-30

摘要

在话题演化跟踪领域,主题模型中时间片大小和主题数K值固定导致无法发掘重要时间转折点,为此提出一种动态时序分割无限潜在狄利克雷分配(dynamic temporalsegmentation-infnite latent Dirichlet allocation,DTS-ILDA)模型.对于演化分析中容易产生错误话题关联的问题,提出一种关联过滤机制.首先运用DTS-ILDA模型提取主题,将改进动态时间分割算法与无限潜在狄利克雷分配(infnite latent Dirichlet allocation,ILDA)模型进行融合.动态时间分割算法按时间顺序遍历数据集,根据列联表分析前后时间片主题分布情况以衡量分割效果,从而找到合适的时间片分割点;ILDA模型可在各时间片内提取不同数量话题并对提取出的主题进行演化关联分析,然后用关键过滤方法滤除关联性不强的关联关系,最后按照时间顺序关系为剩余的关联建立子话题的5种演化关系图.实验表明:该方法能有效找到主题内容发生重要变化的时间点,防止产生无意义话题,同时减少错误话题关联干扰,挖掘出准确的话题深层次关系.

本文引用格式

郭晓利, 周自岚, 刘耀伟, 独健鸿, 黄岩 . 基于DTS-ILDA模型和关联过滤的新闻话题演化分析[J]. 应用科学学报, 2017 , 35(5) : 634 -646 . DOI: 10.3969/j.issn.0255-8297.2017.05.009

Abstract

In topic evolution and tracking, as the size of time slices and the K value of the topic model are fxed, it is hard to locate important time turning points, which is prone to error topic correlation in the evolutionary analysis. To solve the problem, we propose an improved dynamic temporal segmentation-infnite latent Dirichlet allocation (DTS-ILDA) model and an associated fltering mechanism. The model combines an improved dynamic time segmentation algorithm with an infnite latent Dirichlet allocation (ILDA) model to extract topics. Dynamic time segmentation algorithm traverses the data set according to the time sequence, and then uses a contingency table to analysis the distribution of topics to measure the segmentation results and an ILDA model to extract K topics. In addition, an association fltering mechanism is proposed for error prone association in the evolutionary analysis. It removes weak association relationship. Finally, fve evolutionary relationships of right subtopic association are established according to the time sequence relationship. Experiments show that the presented method can effectively fnd important time points when the main content of the topic changes, preventing generation of meaningless topics. It can also reduce error-topic related interference, extracting exact deep relationship between the topics.

参考文献

[1] Cui W W, Liu S X, Tan L, Shi C L, Song Y Q, Gao Z J, Tong X, Qu H M. TextFlow:towards better understanding of evolving topics in text[J]. IEEE Transactions on Visualization & Computer Graphics, 2011, 17(12):2412-21.
[2] 曲朝阳,范旭东,曲楠,于华涛. 基于本体的智能电网文本知识获取模型[J]. 东北电力大学学报,2014, 34(5):60-68. Qu Z Y, Fan X D, Qu N, Yu H T. Smart grid text knowledge acquisition model based on ontology[J]. Journal of Northeast Dianli University, 2014, 34(5):60-68. (in Chinese)
[3] 曹丽娜,唐锡晋. 基于主题模型的BBS话题演化趋势分析[J]. 管理科学学报,2014, 17(11):109-121. Cao L N, Tang X J. Trends of BBS topics based on dynamic topic model[J]. Journal of Mangement Sciences in China, 2014, 17(11):109-121. (in Chinese)
[4] 洪宇,仓玉,朱巧明,姚建民,周国栋. 话题跟踪中静态和动态话题模型的核捕捉衰减[J]. 软件学报,2012, 23(5):1100-1119. Hong Y, Cang Y, Zhu Q M, Yao J M, Zhou G D. Descending kernel track of static and dynamic topic models in topic tracking[J]. Journal of Software, 2012, 23(5):1100-1119. (in Chinese)
[5] 徐戈,王厚峰. 自然语言处理中主题模型的发展[J]. 计算机学报,2011, 34(8):1423-1436. Xu G, Wang H F. The development of topic models in natural language processing[J]. Chinese Journal of Computers, 2011, 34(8):1423-1436. (in Chinese)
[6] 郭晓利,韩啸. 电网知识协同发现策略研究[J]. 东北电力大学学报,2014, 34(1):94-98. Guo X L, Han X. Grid knowledge collaborative discovery strategy research[J]. Journal of Northeast Dianli University, 2014, 34(1):94-98. (in Chinese)
[7] 杨玉珍,刘培玉,费绍栋,张成功. 融合扩展信息瓶颈理论的话题关联检测方法研究[J]. 自动化学报,2014, 40(3):471-479. Yang Y Z, Liu P Y, Fei S D, Zhang C G. A topic link detection method based on improved information bottleneck theory[J]. Acta Automatica Sinica, 2014, 40(3):471-479. (in Chinese)
[8] Hospedales T, Gong S, Xiang T. Video behavior mining using a dynamic topic model[J]. International Journal of Computer Vision, 2012, 98(3):303-323.
[9] Alsumait L, Barbar Daniel, Domeniconi C. On-line LDA:adaptive topic models for mining text streams with applications to topic detection and tracking[C]//IEEE International Conference on Data Mining. Pisa:IEEE, 2008:3-12.
[10] 胡艳丽,白亮,张维明. 一种话题演化建模与分析方法[J]. 自动化学报,2012, 38(10):1690-1697. Hu Y L, Bai L, Zhang W M. Modeling and analyzing topic evolution[J]. Acta Automatica Sinica, 2012, 38(10):1690-1697. (in Chinese)
[11] Wang X, Mccallum A. Topics over time:a non-Markov continuous-time model of topical trends[C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Philadelphia:ACM, 2006:424-433.
[12] Hall D, Jurafsky D, Manning C D. Studying the history of ideas using topic models[C]//Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. PA:ACM, 2008:363-371.
[13] 赵旭剑,李波,杨春明,张晖,金培权,岳丽华,戴文锴. 一种基于特征演变的新闻话题演化挖掘方法[J]. 计算机学报,2014, 37(4):819-832. Zhao X J, Li B, Yang C M, Zhang H, Jing P Q, Yue L H, Dai W K. A topic evolution mining algorithm of news text based on feature evolving[J]. Chinese Journal of Computers, 2014, 37(4):819-832. (in Chinese)
[14] Brody S, Elhadad N. An unsupervised aspect-sentiment model for online reviews[C]//Human Language Technologies:Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, Los Angeles, California, USA. DBLP, 2013:804-812.
[15] Pan S M, Zhou M, Song Y Q, Qian W H, Wang F, Liu S X. Optimizing temporal topic segmentation for intelligent text visualization[C]//International Conference on Intelligent User Interfaces. Santa Monica:ACM, 2013:348-353.
[16] Gao Z, Song Y, Liu S, Wang H, Wei H, Chen Y, Cui W. Tracking and connecting topics via incremental hierarchical Dirichlet processes[C]//2011 IEEE 11th International Conference on Data Mining. Vancouver:IEEE, 2011:1056-1061.
[17] Gad S, Ramakrishnan N, Hampton K N, Kavanaugh A. Bridging the divide in democratic engagement:studying conversation patterns in advantaged and disadvantaged communities[C]//International Conference on Social Informatics. Alexandria:IEEE, 2012:165-176.
[18] 吕楠,罗军勇,刘尧,杨慧洁. 一种有效的事件演化分析算法[J]. 计算机应用研究,2009, 26(11):4101-4103. Lü N, Luo J Y, Liu Y, Yang H J. Effective event evolution analysis algorithm[J]. Application Research of Computers, 2009, 26(11):4101-4103. (in Chinese)
[19] 胡艳丽,白亮,张维明. 网络舆情中一种基于OLDA的在线话题演化方法[J]. 国防科技大学学报,2012, 34(1):150-154. Hu Y L, Bai L, Zhang W M. OLDA-based method for online topic evolution in network public opinion analysis[J]. Journal of National University of Defense Technology, 2012, 34(1):150-154. (in Chinese)
[20] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3:993-1022.
[21] Heinrich G. "Infnite LDA"-implementing the HDP with minimum code complexity[J]. Technical Note, 2011, 20(1):114-134.
[22] Teh Y W, Blei D M. Hierarchical Dirichlet processes[J]. Journal of the American Statistical Association, 2004, 101(467):1566-1581.
[23] Ding W, Chen C. Dynamic topic detection and tracking:a comparison of HDP, C-word, and cocitation methods[J]. Journal of the Association for Information Science & Technology, 2014, 65(10):2084-2097.
[24] 赵凡. 基于共词分析的学科主题动态跟踪相似算法改进研究[J]. 情报杂志,2010, 29(1):173-176. Zhao F. Research on similarity algorithm improvement of dynamic tracing disciplinary themes based on Co-word analysis[J]. Journal of Intelligence,2010, 29(1):173-176. (in Chinese)
[25] 李保利,杨星. 基于LDA模型和话题过滤的研究主题演化分析[J]. 小型微型计算机系,2012, 33(12):2738-2743. Li B L, Yang X. Analyzing research topic evolution with LDA and topic fltering[J]. Journal of Chinese Computer Systems, 2012, 33(12):2738-2743. (in Chinese)
[26] 曲朝阳,陈帅,杨帆,朱莉. 基于云计算技术的电力大数据预处理属性约简方法[J]. 电力系统自动化,2014, 38(8):67-71. Qu Z Y, Chen S, Yang F, Zhu L. An attribute reducing method for electric power big data preprocessing based on cloud computing technology[J]. Automation of Electric Power Systems, 2014, 38(8):67-71. (in Chinese)

文章导航

/