应用科学学报 ›› 2019, Vol. 37 ›› Issue (3): 378-388.doi: 10.3969/j.issn.0255-8297.2019.03.008

• 信号与信息处理 • 上一篇    下一篇

融合TF-IDF和LDA的中文FastText短文本分类方法

冯勇1, 屈渤浩1, 徐红艳1, 王嵘冰1, 张永刚2   

  1. 1. 辽宁大学 信息学院, 沈阳 110036;
    2. 吉林大学 符号计算与知识工程教育部重点实验室, 长春 130012
  • 收稿日期:2018-09-28 修回日期:2018-10-29 出版日期:2019-05-31 发布日期:2019-05-31
  • 通信作者: 王嵘冰,副教授,研究方向:数据挖掘、大数据技术,E-mail:wrb@lnu.edu.cn E-mail:wrb@lnu.edu.cn
  • 基金资助:
    国家自然科学基金(No.71771110);中国博士后科学基金(No.2018M631814);辽宁省社会科学规划基金(No.L18AGL007);符号计算与知识工程教育部重点实验室项目基金(No.93K172018K01)资助

Chinese FastText Short Text Classification Method Integrating TF-IDF and LDA

FENG Yong1, QU Bohao1, XU Hongyan1, WANG Rongbing1, ZHANG Yonggang2   

  1. 1. College of Information, Liaoning University, Shenyang 110036, China;
    2. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
  • Received:2018-09-28 Revised:2018-10-29 Online:2019-05-31 Published:2019-05-31

摘要: FastText文本分类模型具有快速高效的优势,但直接将其用于中文短文本分类则存在精确率不高的问题.为此提出一种融合词频-逆文本频率(term frequency-inverse document frequency,TF-IDF)和隐含狄利克雷分布(latent Dirichlet allocation,LDA)的中文FastText短文本分类方法.该方法在FastText文本分类模型的输入阶段对n元语法模型处理后的词典进行TF-IDF筛选,使用LDA模型进行语料库主题分析,依据所得结果对特征词典进行补充,从而在计算输入词序列向量均值时偏向高区分度的词条,使其更适用于中文短文本分类环境.对比实验结果可知,所提方法在中文短文本分类方面具有更高的精确率.

关键词: FastText, 词向量, 中文短文本分类, 词频-逆文本频率, 隐含狄利克雷分布

Abstract: FastText text classification model has the advantages of high speed and high efficiency, but its application in Chinese short text classification has the problem of low precision. To solve this problem, a Chinese FastText short text classification method integrating TF-IDF and LDA is proposed. In the input phase of FastText text classification model, the dictionaries generated after n-gram processing are filtered by TF-IDF, and corpus thematic analysis is conducted by LDA model, then the feature dictionary is supplemented according to the obtained results. Thus, the highly differentiated entries are biased in the process of computing the mean value of input word sequence vectors, making them more suitable for Chinese short text classification environment. The experimental results show that the proposed method has higher precision in Chinese short text classification.

Key words: Chinese short text classification, FastText, term frequency-inverse document frequency (TF-IDF), word vector, latent Dirichlet allocation (LDA)

中图分类号: