Traditional feature selection algorithms only focus on feature correlation and feature redundancy without considering the interaction between features. This paper proposes a hybrid feature selection based on mutual information (MIHFS) algorithm. The algorithm takes the classification accuracy of K-nearest neighbor (KNN) algorithm as evaluation index to evaluate the classification performance of selected features, effectively removes redundant and irrelevant features, and retains the interactive features. In order to evaluate the performance of the proposed algorithm, the classification accuracy, the number of selected features and the stability of the algorithm are compared with seven other feature selection algorithms such as minimal redundancy maximal relevance (mRMR) and joint mutual information (JMI) in eight datasets. Experimental results show that the MIHFS algorithm has strong stability, which not only effectively reduces the dimension of feature space, but also has better classification performance than other feature selection algorithms. Finally, in combination with grey relation analysis (GRA) method-technique for order preference by similarity to ideal solution (TOPSIS) method, MIHFS algorithm is applied to the geological evaluation of the first member of Dainan Formation at Yong’an Area, Gaoyou Sag. Experimental results show that MIHFS algorithm performs an evaluation accuracy of 80% with high reliability, and this is basically consistent with actual drilling results and proves the effectiveness of MIHFS in oil and gas geological evaluation.
[1] 程乐利, 曹林, 刘小国, 等. 圈闭地质评价技术在海拉尔盆地中部南屯组的应用[J]. 广东化工, 2013, 40(6):19-20. Cheng L L, Cao L, Liu X G, et al. Application of trap geological evaluation technology in Nantun formation in central Hailar Basin[J]. Guangdong Chemical Industry, 2013, 40(6):19-20. (in Chinese)
[2] Cigdem B, Robert F. Classifying imbalanced data sets using similarity based hierarchical decomposition[J]. Pattern Recognition, 2015, 48(5):1653-1672.
[3] 边肇棋, 张学工. 模式识别[M]. 2版. 北京:清华大学出版社, 2001:176-210.
[4] Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics[J]. Bioinformatics, 2007, 23(19):2507-2517.
[5] Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy[J]. Journal of Machine Learning Research, 2004, 5(7):1205-1224.
[6] Santana L, Canuto A M. Filter-based optimization techniques for selection of feature subsets in ensemble systems[J]. Expert Systems with Applications, 2014, 41(4):1622-1631.
[7] Wei H L, Billings S A. Feature subset selection and ranking for data dimensionality reduction[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, 29(1):162-166.
[8] Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines[J]. Machine Learning, 2002, 46(1/2/3):389-422.
[9] Yun Y H, Li H D, Deng B C, et al. An overview of variable selection methods in multivariate analysis of near-infrared spectra[J]. Trends in Analytical Chemistry, 2019, 113:102-115.
[10] Nakariyakul S, Liu Z P, Chen L. Detecting thermophilic proteins through selecting amino acid and dipeptide composition features[J]. Amino Acids, 2012, 42(5):1947-1953.
[11] Lal T N, Chapelle O, Weston J, et al. Embedded methods[M/OL]. Berlin, Heidelberg:Springer, 2006:137-165[2020-09-08]. https://link.springer.com/chapter/10.1007/978-3-540-35488-8_6.
[12] Weston J, Mukherjee S, Chapelle O, et al. Feature selection for SVMs[C]//Advances in Neural Information Processing Systems, 2001:668-674.
[13] Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data[J]. Journal of Bioinformatics and Computational Biology, 2005, 3(2):185-205.
[14] Fleuret F. Fast binary feature selection with conditional mutual information[J]. Journal of Machine Learning Research, 2004(5):1531-1555.
[15] Meyer P E, Bontempi G. On the use of variable complementarity for feature selection in cancer classification[C]//Workshops on Applications of Evolutionary Computation. Berlin, Heidelberg:Springer, 2006:91-102.
[16] Brown G, Pocock A, Zhao M J, et al. Conditional likelihood maximisation:a unifying framework for information theoretic feature selection[J]. Journal of Machine Learning Research, 2012, 13(1):27-66.
[17] Jakulin A, Bratko I. Analyzing attribute dependencies[C]//European Conference on Principles of Data Mining and Knowledge Discovery. Berlin, Heidelberg:Springer, 2003:229-240.
[18] Zhao J, Zhou Y, Zhang X, et al. Part mutual information for quantifying direct associations in networks[J]. Proceedings of the National Academy of Sciences, 2016, 113(18):5130-5135.
[19] Han J W, Kamber M. 数据挖掘概念与技术[M]. 范明, 孟小峰, 译. 北京:机械工业出版社, 2007.
[20] Lin X, Li C, Ren W, et al. A new feature selection method based on symmetrical uncertainty and interaction gain[J]. Computational Biology and Chemistry, 2019, 83:107149.
[21] Pascoal C, Oliveira M, Pacheco A, et al. Theoretical evaluation of feature selection methods based on mutual information[J]. Neurocomputing, 2017, 226:168-181.
[22] Nie F, Hiang H, Cai X, et al. Efficient and robust feature selection via joint ℓ2, 1-norms minimization[C]//Advances in Neural Information Processing Systems, 2010:1813-1821.
[23] 刘杰, 张平, 高万夫. 基于条件相关的特征选择方法[J]. 吉林大学学报(工学版), 2018, 48(3):874-881. Liu J, Zhang P, Gao W F. Feature selection method based on conditional correlation[J]. Journal of Jilin University (Engineering Edition), 2018, 48(3):874-881. (in Chinese)
[24] Zeng Z, Zhang H, Zhang R, et al. A novel feature selection method considering feature interaction[J]. Pattern Recognition, 2015, 48(8):2656-2666.
[25] Lustgarten J L, Gopalakrishnan V, Visweswaran S. Measuring stability of feature selection in biomedical datasets[C]//American Medical Informatics Association Annual Symposium Proceedings, 2009:406-410.
[26] Li X, Han Y, Wu X, et al. Evaluating node importance in complex networks based on TOPSIS and gray correlation[C]//2018 Chinese Control and Decision Conference, 2018:750-754.