FastText text classification model has the advantages of high speed and high efficiency, but its application in Chinese short text classification has the problem of low precision. To solve this problem, a Chinese FastText short text classification method integrating TF-IDF and LDA is proposed. In the input phase of FastText text classification model, the dictionaries generated after n-gram processing are filtered by TF-IDF, and corpus thematic analysis is conducted by LDA model, then the feature dictionary is supplemented according to the obtained results. Thus, the highly differentiated entries are biased in the process of computing the mean value of input word sequence vectors, making them more suitable for Chinese short text classification environment. The experimental results show that the proposed method has higher precision in Chinese short text classification.
FENG Yong, QU Bohao, XU Hongyan, WANG Rongbing, ZHANG Yonggang
. Chinese FastText Short Text Classification Method Integrating TF-IDF and LDA[J]. Journal of Applied Sciences, 2019
, 37(3)
: 378
-388
.
DOI: 10.3969/j.issn.0255-8297.2019.03.008
[1] 段旭磊,张仰森,孙祎卓. 微博文本的句向量表示及相似度计算方法研究[J]. 计算机工程,2017, 43(5):143-148. Duan X L, Zhang Y S, Sun Y Z. Research on sentence vector representation and similarity calculation method about microblog texts[J]. Computer Engineering, 2017, 43(5):143-148. (in Chinese)
[2] Spinellis D, Raptis K. Component mining:a process and its pattern language[J]. Information and Software Technology, 2000, 42(9):609-617.
[3] 张谦,高章敏,刘嘉勇. 基于Word2Vec的微博短文本分类研究[J]. 信息网络安全,2017, 17(1):57-62. Zhang Q, Gao Z M, Liu J Y. Research of weibo short text classification based on Word2Vec[J]. Netinfo Security, 2017, 17(1):57-62. (in Chinese)
[4] 赵辉,刘怀亮. 一种基于维基百科的中文短文本分类算法[J]. 图书情报工作, 2013, 57(11):120-124. Zhao H, Liu H L. Classification algorithm of Chinese short texts based on Wikipedia[J]. Library and Information Service, 2013, 57(11):120-124. (in Chinese)
[5] 范云杰,刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术,2012, 28(3):47-52. Fan Y J, Liu H L. Research on Chinese short text classification based on Wikipedia[J]. New Technology of Library and Information Service, 2012, 28(3):47-52. (in Chinese)
[6] Wu F L, Zheng Y F. Adaptive normalized weighted KNN text classification based on PSO[J]. Scientific Bulletin of National Mining University, 2016, (1):109-115.
[7] Liu J, Xu Y, Deng J, Wang L, Zhang L. Ld-CNNs:a deep learning system for structured text categorization based on LDA in content security[C]//International Conference on Network and System Security. Taiwan, 2016:113-125.
[8] Bahassine S, Madani A, Kissi M. An improved Chi-square feature selection for Arabic text classification using decision tree[C]//International Conference on Intelligent Systems:Theories and Applications. Mohamrnedia, Morocco, IEEE, 2016:2378-2536.
[9] 阳爱民,林江豪,周咏梅. 中文文本情感词典构建方法[J]. 计算机科学与探索,2013, 7(11):1033-1039. Yang A M, Lin J H, Zhou Y M. Method on building Chinese text sentiment lexicon[J]. Journal of Frontiers of Computer Science and Technology, 2013, 7(11):1033-1039. (in Chinese)
[10] 陈科文,张祖平,龙军. 文本分类中基于熵的词权重计算方法研究[J]. 计算机科学与探索,2016, 10(9):1299-1309. Chen K W, Zhang Z P, Long J. Research on entropy-based term weighting methods in text categorization[J]. Journal of Frontiers of Computer Science and Technology, 2016, 10(9):1299-1309. (in Chinese)
[11] Blei D M, Ng Y A, Jordan I M. Latent Dirichlet allocation[J]. Journal of Machine Learning Research, 2003, 3(1):993-1022.
[12] Griffiths T L, Steyvers M. Finding scientific topics[C]//Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(1):5228-5235.
[13] Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Spain, 2017:427-431.
[14] Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information[C]//Association for Computational Linguistics. Massachusetts, 2017:135-146.
[15] Hinton G E, Salakhutdinov R. Replicated softmax:an undirected topic model[C]//International Conference on Neural Information Processing Systems. Canada, 2009:1607-1614.