针对传统的特征选择算法只专注于特征间的相关性和冗余性而没有考虑特征之间交互作用的问题,提出一种基于交互信息的混合特征选择(hybrid feature selection based onmutual information,MIHFS)算法,该算法以K-最近邻算法的分类准确率作为衡量所选特征分类性能的评价指标,有效地去除了冗余和不相关的特征,保留了具有交互作用的特征。为了评估该算法的性能,从分类准确率、所选特征数量以及算法稳定性三方面,与最大相关最小冗余、联合互信息等7种特征选择算法在8个数据集上进行了实验比较和分析。实验结果表明:MIHFS算法具有较强的稳定性,不仅有效降低了特征空间的维数,而且在所选特征的分类性能方面明显优于其他特征选择算法。最后将MIHFS算法与灰色关联分析法-逼近理想解的排序技术法相结合并应用到高邮凹陷永安地区戴一段地质评价中,其评价结果准确率为80%,与实际钻探结果基本吻合,具有较高的可靠性,能够有效指导油气地质评价。
Traditional feature selection algorithms only focus on feature correlation and feature redundancy without considering the interaction between features. This paper proposes a hybrid feature selection based on mutual information (MIHFS) algorithm. The algorithm takes the classification accuracy of K-nearest neighbor (KNN) algorithm as evaluation index to evaluate the classification performance of selected features, effectively removes redundant and irrelevant features, and retains the interactive features. In order to evaluate the performance of the proposed algorithm, the classification accuracy, the number of selected features and the stability of the algorithm are compared with seven other feature selection algorithms such as minimal redundancy maximal relevance (mRMR) and joint mutual information (JMI) in eight datasets. Experimental results show that the MIHFS algorithm has strong stability, which not only effectively reduces the dimension of feature space, but also has better classification performance than other feature selection algorithms. Finally, in combination with grey relation analysis (GRA) method-technique for order preference by similarity to ideal solution (TOPSIS) method, MIHFS algorithm is applied to the geological evaluation of the first member of Dainan Formation at Yong’an Area, Gaoyou Sag. Experimental results show that MIHFS algorithm performs an evaluation accuracy of 80% with high reliability, and this is basically consistent with actual drilling results and proves the effectiveness of MIHFS in oil and gas geological evaluation.
[1] 程乐利, 曹林, 刘小国, 等. 圈闭地质评价技术在海拉尔盆地中部南屯组的应用[J]. 广东化工, 2013, 40(6):19-20. Cheng L L, Cao L, Liu X G, et al. Application of trap geological evaluation technology in Nantun formation in central Hailar Basin[J]. Guangdong Chemical Industry, 2013, 40(6):19-20. (in Chinese)
[2] Cigdem B, Robert F. Classifying imbalanced data sets using similarity based hierarchical decomposition[J]. Pattern Recognition, 2015, 48(5):1653-1672.
[3] 边肇棋, 张学工. 模式识别[M]. 2版. 北京:清华大学出版社, 2001:176-210.
[4] Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics[J]. Bioinformatics, 2007, 23(19):2507-2517.
[5] Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy[J]. Journal of Machine Learning Research, 2004, 5(7):1205-1224.
[6] Santana L, Canuto A M. Filter-based optimization techniques for selection of feature subsets in ensemble systems[J]. Expert Systems with Applications, 2014, 41(4):1622-1631.
[7] Wei H L, Billings S A. Feature subset selection and ranking for data dimensionality reduction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 29(1):162-166.
[8] Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines[J]. Machine Learning, 2002, 46(1/2/3):389-422.
[9] Yun Y H, Li H D, Deng B C, et al. An overview of variable selection methods in multivariate analysis of near-infrared spectra[J]. Trends in Analytical Chemistry, 2019, 113:102-115.
[10] Nakariyakul S, Liu Z P, Chen L. Detecting thermophilic proteins through selecting amino acid and dipeptide composition features[J]. Amino Acids, 2012, 42(5):1947-1953.
[11] Lal T N, Chapelle O, Weston J, et al. Embedded methods[M/OL]. Berlin, Heidelberg:Springer, 2006:137-165[2020-09-08]. https://link.springer.com/chapter/10.1007/978-3-540-35488-8_6.
[12] Weston J, Mukherjee S, Chapelle O, et al. Feature selection for SVMs[C]//Advances in Neural Information Processing Systems, 2001:668-674.
[13] Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data[J]. Journal of Bioinformatics and Computational Biology, 2005, 3(2):185-205.
[14] Fleuret F. Fast binary feature selection with conditional mutual information[J]. Journal of Machine Learning Research, 2004(5):1531-1555.
[15] Meyer P E, Bontempi G. On the use of variable complementarity for feature selection in cancer classification[C]//Workshops on Applications of Evolutionary Computation. Berlin, Heidelberg:Springer, 2006:91-102.
[16] Brown G, Pocock A, Zhao M J, et al. Conditional likelihood maximisation:a unifying framework for information theoretic feature selection[J]. Journal of Machine Learning Research, 2012, 13(1):27-66.
[17] Jakulin A, Bratko I. Analyzing attribute dependencies[C]//European Conference on Principles of Data Mining and Knowledge Discovery. Berlin, Heidelberg:Springer, 2003:229-240.
[18] Zhao J, Zhou Y, Zhang X, et al. Part mutual information for quantifying direct associations in networks[J]. Proceedings of the National Academy of Sciences, 2016, 113(18):5130-5135.
[19] Han J W, Kamber M. 数据挖掘概念与技术[M]. 范明, 孟小峰, 译. 北京:机械工业出版社, 2007.
[20] Lin X, Li C, Ren W, et al. A new feature selection method based on symmetrical uncertainty and interaction gain[J]. Computational Biology and Chemistry, 2019, 83:107149.
[21] Pascoal C, Oliveira M, Pacheco A, et al. Theoretical evaluation of feature selection methods based on mutual information[J]. Neurocomputing, 2017, 226:168-181.
[22] Nie F, Hiang H, Cai X, et al. Efficient and robust feature selection via joint ℓ2, 1-norms minimization[C]//Advances in Neural Information Processing Systems, 2010:1813-1821.
[23] 刘杰, 张平, 高万夫. 基于条件相关的特征选择方法[J]. 吉林大学学报(工学版), 2018, 48(3):874-881. Liu J, Zhang P, Gao W F. Feature selection method based on conditional correlation[J]. Journal of Jilin University (Engineering Edition), 2018, 48(3):874-881. (in Chinese)
[24] Zeng Z, Zhang H, Zhang R, et al. A novel feature selection method considering feature interaction[J]. Pattern Recognition, 2015, 48(8):2656-2666.
[25] Lustgarten J L, Gopalakrishnan V, Visweswaran S. Measuring stability of feature selection in biomedical datasets[C]//American Medical Informatics Association Annual Symposium Proceedings, 2009:406-410.
[26] Li X, Han Y, Wu X, et al. Evaluating node importance in complex networks based on TOPSIS and gray correlation[C]//2018 Chinese Control and Decision Conference, 2018:750-754.