利用概念知识的文本分类

doi:10.3969/j.issn.0255-8297.2013.02.015

应用科学学报 ›› 2013, Vol. 31 ›› Issue (2): 197-203.doi: 10.3969/j.issn.0255-8297.2013.02.015

利用概念知识的文本分类

丁泽亚1,2，张全1

1. 中国科学院声学研究所，北京100190
2. 中国科学院研究生院，北京100039

收稿日期:2011-08-26 修回日期:2012-01-08 出版日期:2013-03-25 发布日期:2012-01-08
通信作者: 丁泽亚，博士生，研究方向：文本分类、HNC理论，E-mail: zeya.ding@gmail.com
作者简介:丁泽亚，博士生，研究方向：文本分类、HNC理论，E-mail: zeya.ding@gmail.com；张全，研究员，博导，研究方向：自然语言处理、HNC理论等，E-mail: zhq@mail.ioa.ac.cn
基金资助:
国家“863”高技术研究发展计划基金(No.2012AA011102)；国家语委“十二·五”科研项目基金(No.YB125-53)；中科院声学所知识创新工程项目基金(No.Y154141431)；中国科学院学部咨询项目基金(No.Y129091211)资助

Text Categorization Based on Concept Knowledge

DING Ze-ya1,2, ZHANG Quan1

1. Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China
2. Graduate University of Chinese Academy of Sciences, Beijing 100039, China

Received:2011-08-26 Revised:2012-01-08 Online:2013-03-25 Published:2012-01-08

摘要/Abstract

摘要： 针对统计方法不能从语义理解的角度进行文本分类的问题，提出了利用概念层次网络概念知识进行文本分类的方法，包括两部分：依据概念进行特征选取以及根据类别关联度分类. 在特征选取时，通过计算概念与类别的区分度挖掘出类别核心概念，并采用类别核心概念对特征项进行精选. 依据类别核心概念相关的类别语义信息，提出了文档与类别关联度的计算方法，并根据类别关联度来判断文本类别. 实验表明，该方法可有效降低特征空间维数，在提高分类效率的同时保证了分类效果，F1值略有提高. 与SVM、KNN和Bayes分类器对比，当特征项数目较少时，该方法的F1值明显高于其他3种方法，综合分类效果与SVM相当，优于KNN和Bayes.

关键词: 文本分类, 概念层次网络, 概念, 概念区分度, 类别关联度

Abstract: To achieve semantic understanding, this paper proposes a method for text categorization based on concept-knowledge in the hierarchical network of concepts (HNC). The method includes two parts: feature selection using concepts and text categorization according to category relatedness degree. In this paper, category key concepts are explored by computing discrimination degree of concepts, and used to further reduce dimensionality of the feature space. Based on the category semantic information consisting of category key concepts and relatedness weights, the method of computing relatedness degrees between documents and categories is proposed. The category relatedness degree of document is used as a measure for text categorization. Experiments show that the proposed method can effectively reduce dimensionality of feature space, increase efficiency and ensure effectiveness of text categorization. Compared with SVM, KNN and Bayes, this method is the best in terms of F1 values at higher feature reduction levels. In terms of overall performance, the method is almost equivalent to SVM, and better than KNN and Bayes.

Key words: concept, concept discrimination, category relatedness, text categorization, hierarchical network of concepts

中图分类号:

TP391

丁泽亚1,2，张全1. 利用概念知识的文本分类[J]. 应用科学学报, 2013, 31(2): 197-203.

DING Ze-ya1,2, ZHANG Quan1. Text Categorization Based on Concept Knowledge[J]. Journal of Applied Sciences, 2013, 31(2): 197-203.

参考文献

[1]       YANG Yiming, LIU Xin. A re-examination of text categorization methods [C]//Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, California, United States, 1999: 42-49.
[2]       YANG Y, PEDERSEN JO. A comparative study on feature selection in text categorization [C]//Machine Learning-International Workshop Then Conference, San Francisco, USA, 1997: 412-420.
[3]       OGURA H, AMANO H, KONDO M. Feature selection with a measure of deviations from Poisson in text categorization [J]. Expert Systems with Applications, 2009, 36(3): 6826-6832.
[4]       SHANG Wenqian, HUANG Houkuan, ZHU Haibin, LIN Yongmin, QU Youli, WANG Zhihai. A novel feature selection algorithm for text categorization [J]. Expert Systems with Applications, 2007, 33(1): 1-5.
[5]       KUMAR MA, GOPAL M. A comparison study on multiple binary-class SVM methods for unilabel text categorization [J]. Pattern Recognition Letters, 2010, 31(11): 1437-1444.
[6]       Manabu TORIIA M, Lanlan YINB L, Thang NGUYENA T, Chand T. MAZUMDARA C T, Hongfang LIU H F, David M. HARTLEYA D M, Noele P. NELSONA N P. An exploratory study of a text classification framework for internet-based surveillance of emerging epidemics [J]. International Journal of Medical Informatics, 2011, 80(1): 56-66.
[7]       张孝飞，黄河燕. 一种采用聚类技术改进的KNN文本分类方法[J]. 模式识别与人工智能，2009, 22(6): 936-940.
ZHANG Xiaofei, HUANG Heyan. An improved KNN text categorization algorithm by adopting cluster technology [J].Pattern Recognition and Artificial Intelligence, 2009, 22(6): 936-940. (in Chinese)
[8]       李荣陆，胡运发. 基于密度的kNN文本分类器训练样本裁剪方法 [J]. 计算机研究与发展，2004, (04): 539-545.
LI Ronglu, HU Yunfa. A density-based method for reducing the amount of training data in kNN text classification [J]. Journal of Computer Research and Development, 2004, 41(4): 539-545. (in Chinese)
[9]       WU G, CHANG E Y. KBA: kernel boundary alignment considering imbalanced data distribution [J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(6): 786-795.
[10]    LIU Xuying, WU Jianxin, ZHOU Zhihua. Exploratory under-sampling for class-imbalance learning [C]//Sixth International Conference on Data Mining, HongKong, China, 2006: 965-969.
[11]    SUN A, LIM EP, LIU Y. On strategies for imbalanced text classification using SVM: a comparative study [J]. Decision Support Systems, 2009, 48(1): 191-201.
[12]    孙海霞，钱庆，成颖. 基于本体的语义相似度计算方法研究综述 [J]. 现代图书情报技术，2010, 9(1): 51-56.
SUN Haixia, QIAN Qing, CHENG Ying. Review of ontology-based semantic similarity measuring [J]. New Technology of Library and Information Service, 2010, 9(1): 51-56. (in Chinese)
[13]    BAI Rujiang, WANG Xiaoyue, LIAO Junhua. Extract semantic information from wordnet to improve text classification performance [J]. Advances in Computer Science and Information Technology, 2010, 6059/2010: 409-420.

利用概念知识的文本分类

Text Categorization Based on Concept Knowledge

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 10

编辑推荐

Metrics

本文评价

[1]	王孟轩, 张胜, 王月, 雷霆, 杜渂. 改进的CRNN模型在警情文本分类中的研究与应用[J]. 应用科学学报, 2020, 38(3): 388-400.
[2]	赵云山, 段友祥. 基于Attention机制的卷积神经网络文本分类模型[J]. 应用科学学报, 2019, 37(4): 541-550.
[3]	冯勇, 屈渤浩, 徐红艳, 王嵘冰, 张永刚. 融合TF-IDF和LDA的中文FastText短文本分类方法[J]. 应用科学学报, 2019, 37(3): 378-388.
[4]	吕艳霞, 王翠容, 王聪, 苑迎. 一种基于数据不确定性的概念漂移数据流分类算法[J]. 应用科学学报, 2017, 35(5): 559-569.
[5]	刘三民, 刘涛, 王忠群, 修宇, 刘余霞, 孟超. 融合分类器可信度的数据流集成分类[J]. 应用科学学报, 2017, 35(2): 226-232.
[6]	郭戈平西建张涛. 基于概念选择和重要性度量的多模态语义融合[J]. 应用科学学报, 2010, 28(3): 266-270.
[7]	潘跃建,立松. 改进的高精度单本体概念相似度计算模型[J]. 应用科学学报, 2009, 27(6): 630-636.
[8]	刘海峰;姚泽清;刘守生;王倩 . 文本分类中基于核的非线性判别[J]. 应用科学学报, 2008, 26(6): 627-631 .
[9]	梁军涛;蒋晓原. 基于特征的需求模型到面向Agent概念体系结构的映射[J]. 应用科学学报, 2008, 26(1): 74-74 .
[10]	忻健, 陆巍, 朱景德, 王翼飞. GenExtractor:一个基于Web的生物信息挖掘系统[J]. 应用科学学报, 2005, 23(1): 75-81.