Journal of Applied Sciences ›› 2019, Vol. 37 ›› Issue (3): 378-388.doi: 10.3969/j.issn.0255-8297.2019.03.008

• Signal and Information Processing • Previous Articles     Next Articles

Chinese FastText Short Text Classification Method Integrating TF-IDF and LDA

FENG Yong1, QU Bohao1, XU Hongyan1, WANG Rongbing1, ZHANG Yonggang2   

  1. 1. College of Information, Liaoning University, Shenyang 110036, China;
    2. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China
  • Received:2018-09-28 Revised:2018-10-29 Online:2019-05-31 Published:2019-05-31

Abstract: FastText text classification model has the advantages of high speed and high efficiency, but its application in Chinese short text classification has the problem of low precision. To solve this problem, a Chinese FastText short text classification method integrating TF-IDF and LDA is proposed. In the input phase of FastText text classification model, the dictionaries generated after n-gram processing are filtered by TF-IDF, and corpus thematic analysis is conducted by LDA model, then the feature dictionary is supplemented according to the obtained results. Thus, the highly differentiated entries are biased in the process of computing the mean value of input word sequence vectors, making them more suitable for Chinese short text classification environment. The experimental results show that the proposed method has higher precision in Chinese short text classification.

Key words: Chinese short text classification, FastText, term frequency-inverse document frequency (TF-IDF), word vector, latent Dirichlet allocation (LDA)

CLC Number: