融合TF-IDF和LDA的中文FastText短文本分类方法

doi:10.3969/j.issn.0255-8297.2019.03.008

应用科学学报 ›› 2019, Vol. 37 ›› Issue (3): 378-388.doi: 10.3969/j.issn.0255-8297.2019.03.008

融合TF-IDF和LDA的中文FastText短文本分类方法

冯勇¹, 屈渤浩¹, 徐红艳¹, 王嵘冰¹, 张永刚²

1. 辽宁大学信息学院, 沈阳 110036;
2. 吉林大学符号计算与知识工程教育部重点实验室, 长春 130012

收稿日期:2018-09-28 修回日期:2018-10-29 出版日期:2019-05-31 发布日期:2019-05-31
通信作者: 王嵘冰,副教授,研究方向:数据挖掘、大数据技术,E-mail:wrb@lnu.edu.cn E-mail:wrb@lnu.edu.cn
基金资助:
国家自然科学基金（No.71771110）；中国博士后科学基金（No.2018M631814）；辽宁省社会科学规划基金（No.L18AGL007）；符号计算与知识工程教育部重点实验室项目基金（No.93K172018K01）资助

Chinese FastText Short Text Classification Method Integrating TF-IDF and LDA

FENG Yong¹, QU Bohao¹, XU Hongyan¹, WANG Rongbing¹, ZHANG Yonggang²

1. College of Information, Liaoning University, Shenyang 110036, China;
2. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China

Received:2018-09-28 Revised:2018-10-29 Online:2019-05-31 Published:2019-05-31

摘要/Abstract

摘要： FastText文本分类模型具有快速高效的优势，但直接将其用于中文短文本分类则存在精确率不高的问题.为此提出一种融合词频-逆文本频率（term frequency-inverse document frequency，TF-IDF）和隐含狄利克雷分布（latent Dirichlet allocation，LDA）的中文FastText短文本分类方法.该方法在FastText文本分类模型的输入阶段对n元语法模型处理后的词典进行TF-IDF筛选，使用LDA模型进行语料库主题分析，依据所得结果对特征词典进行补充，从而在计算输入词序列向量均值时偏向高区分度的词条，使其更适用于中文短文本分类环境.对比实验结果可知，所提方法在中文短文本分类方面具有更高的精确率.

关键词: FastText, 词向量, 中文短文本分类, 词频-逆文本频率, 隐含狄利克雷分布

Abstract: FastText text classification model has the advantages of high speed and high efficiency, but its application in Chinese short text classification has the problem of low precision. To solve this problem, a Chinese FastText short text classification method integrating TF-IDF and LDA is proposed. In the input phase of FastText text classification model, the dictionaries generated after n-gram processing are filtered by TF-IDF, and corpus thematic analysis is conducted by LDA model, then the feature dictionary is supplemented according to the obtained results. Thus, the highly differentiated entries are biased in the process of computing the mean value of input word sequence vectors, making them more suitable for Chinese short text classification environment. The experimental results show that the proposed method has higher precision in Chinese short text classification.

Key words: Chinese short text classification, FastText, term frequency-inverse document frequency (TF-IDF), word vector, latent Dirichlet allocation (LDA)

中图分类号:

TP311

冯勇, 屈渤浩, 徐红艳, 王嵘冰, 张永刚. 融合TF-IDF和LDA的中文FastText短文本分类方法[J]. 应用科学学报, 2019, 37(3): 378-388.

FENG Yong, QU Bohao, XU Hongyan, WANG Rongbing, ZHANG Yonggang. Chinese FastText Short Text Classification Method Integrating TF-IDF and LDA[J]. Journal of Applied Sciences, 2019, 37(3): 378-388.

参考文献 15

[1]	段旭磊,张仰森,孙祎卓. 微博文本的句向量表示及相似度计算方法研究[J]. 计算机工程,2017, 43(5):143-148. Duan X L, Zhang Y S, Sun Y Z. Research on sentence vector representation and similarity calculation method about microblog texts[J]. Computer Engineering, 2017, 43(5):143-148. (in Chinese)
[2]	Spinellis D, Raptis K. Component mining:a process and its pattern language[J]. Information and Software Technology, 2000, 42(9):609-617.
[3]	张谦,高章敏,刘嘉勇. 基于Word2Vec的微博短文本分类研究[J]. 信息网络安全,2017, 17(1):57-62. Zhang Q, Gao Z M, Liu J Y. Research of weibo short text classification based on Word2Vec[J]. Netinfo Security, 2017, 17(1):57-62. (in Chinese)
[4]	赵辉,刘怀亮. 一种基于维基百科的中文短文本分类算法[J]. 图书情报工作, 2013, 57(11):120-124. Zhao H, Liu H L. Classification algorithm of Chinese short texts based on Wikipedia[J]. Library and Information Service, 2013, 57(11):120-124. (in Chinese)
[5]	范云杰,刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术,2012, 28(3):47-52. Fan Y J, Liu H L. Research on Chinese short text classification based on Wikipedia[J]. New Technology of Library and Information Service, 2012, 28(3):47-52. (in Chinese)
[6]	Wu F L, Zheng Y F. Adaptive normalized weighted KNN text classification based on PSO[J]. Scientific Bulletin of National Mining University, 2016, (1):109-115.
[7]	Liu J, Xu Y, Deng J, Wang L, Zhang L. Ld-CNNs:a deep learning system for structured text categorization based on LDA in content security[C]//International Conference on Network and System Security. Taiwan, 2016:113-125.
[8]	Bahassine S, Madani A, Kissi M. An improved Chi-square feature selection for Arabic text classification using decision tree[C]//International Conference on Intelligent Systems:Theories and Applications. Mohamrnedia, Morocco, IEEE, 2016:2378-2536.
[9]	阳爱民,林江豪,周咏梅. 中文文本情感词典构建方法[J]. 计算机科学与探索,2013, 7(11):1033-1039. Yang A M, Lin J H, Zhou Y M. Method on building Chinese text sentiment lexicon[J]. Journal of Frontiers of Computer Science and Technology, 2013, 7(11):1033-1039. (in Chinese)
[10]	陈科文,张祖平,龙军. 文本分类中基于熵的词权重计算方法研究[J]. 计算机科学与探索,2016, 10(9):1299-1309. Chen K W, Zhang Z P, Long J. Research on entropy-based term weighting methods in text categorization[J]. Journal of Frontiers of Computer Science and Technology, 2016, 10(9):1299-1309. (in Chinese)
[11]	Blei D M, Ng Y A, Jordan I M. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(1):993-1022.
[12]	Griffiths T L, Steyvers M. Finding scientific topics[C]//Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(1):5228-5235.
[13]	Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Spain, 2017:427-431.
[14]	Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information[C]//Association for Computational Linguistics. Massachusetts, 2017:135-146.
[15]	Hinton G E, Salakhutdinov R. Replicated softmax:an undirected topic model[C]//International Conference on Neural Information Processing Systems. Canada, 2009:1607-1614.

融合TF-IDF和LDA的中文FastText短文本分类方法

Chinese FastText Short Text Classification Method Integrating TF-IDF and LDA

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献 15

相关文章 15

编辑推荐

Metrics

本文评价

[1]	何正源, 段田田, 张颖, 张瀚文, 孙毅. 物联网中区块链技术的应用与挑战[J]. 应用科学学报, 2020, 38(1): 22-33.
[2]	包振山, 王凯旋, 张文博. 基于树形拓扑网络的实用拜占庭容错共识算法[J]. 应用科学学报, 2020, 38(1): 34-50.
[3]	邹秀清, 罗得寸, 林平, 沈世平, 谢振平, 王玉珏, 丁勇. 基于区块链的河长制水质信息存证系统[J]. 应用科学学报, 2020, 38(1): 65-80.
[4]	苑陈娟, 孙国梓, 李华康, 王纪涛. 牌类游戏可信存证链系统[J]. 应用科学学报, 2020, 38(1): 81-92.
[5]	江云超, 何小卫, 崔一举. 区块链节点存储优化方案[J]. 应用科学学报, 2020, 38(1): 119-126.
[6]	张逸飞, 曹少中, 祁德力, 王亮, 杨彦红. 基于区块链的图书侵权记录存证平台[J]. 应用科学学报, 2020, 38(1): 184-196.
[7]	胡本固, 戴牡红. 多中心点增量式模糊聚类算法[J]. 应用科学学报, 2019, 37(6): 806-814.
[8]	王佃来, 宿爱霞, 刘文萍. 基于Spearman等级系数的植被变化趋势分析[J]. 应用科学学报, 2019, 37(4): 519-528.
[9]	王聚全, 王伟, 马慧民, 杨博, 杜渂. 基于主成分回归算法的城市客流聚集风险预测[J]. 应用科学学报, 2019, 37(4): 529-540.
[10]	李博, 郑博, 郭子阳, 王宏志. 区块链技术在金融方向应用的发展及展望[J]. 应用科学学报, 2019, 37(2): 151-163.
[11]	宾晟, 孙更新, 周双. 基于区块链技术的社交网络中舆情传播模型[J]. 应用科学学报, 2019, 37(2): 191-202.
[12]	周启惠, 邓祖强, 邹萍, 王秋生, 李艳东, 姜海森. 基于区块链的防护物联网设备DDoS攻击方法[J]. 应用科学学报, 2019, 37(2): 213-223.
[13]	赵灵奇, 宋宇波, 张克落, 胡爱群, 罗坚. 基于区块链和分层加密的物流隐私保护机制[J]. 应用科学学报, 2019, 37(2): 224-234.
[14]	杨敏, 张仕斌, 张航, 刘宁, 甘波. 异构联盟系统中基于二层区块链的用户信任协商模型[J]. 应用科学学报, 2019, 37(2): 244-252.
[15]	陈艳, 俞春强, 侯晓杰, 张显全, 唐振军, 何南. 基于曲面插值的加密图像可逆信息隐藏算法[J]. 应用科学学报, 2018, 36(2): 220-236.