CCF NCCA 2020专辑

基于决策树-逻辑回归模型精确识别僵尸企业

展开
  • 1. 河海大学 计算机与信息学院, 江苏 南京 211100;
    2. 河海大学 商学院, 江苏 南京 211100

收稿日期: 2020-08-26

  网络出版日期: 2021-08-04

基金资助

中央高校基本科研业务费项目基金(No.B200202185);江苏省“六大人才高峰”项目基金(No.XYDXX-078);江苏省高等学校自然科学基金(No.19KJB630006)资助

Accurately Identify Zombie Enterprises Based on Decision Tree-Logistic Regression Model

Expand
  • 1. School of Computer and Information, Hohai University, Nanjing 211100, Jiangsu, China;
    2. School of Business, Hohai University, Nanjing 211100, Jiangsu, China

Received date: 2020-08-26

  Online published: 2021-08-04

摘要

针对如何精准识别僵尸企业的问题,借助湖南科创信息有限公司公开的企业信息数据集,提出了一种决策树-逻辑回归的僵尸企业识别方法。该方法用中位数填充缺失数和离群值,然后分析数据集并进行特征衍生,最后使用多元线性回归和卡方检验等方法完成特征筛选。为了验证所提出方法的有效性,分别在阿里云环境和本地环境下将该方法与过度借贷法、连续亏损法、随机森林算法、BP神经网络算法、XGBoost算法进行比较。每个模型均训练50次,每次训练按一定比例随机选取数据,最终取各个指标的平均值作为最终实验结果。实验结果表明:所提出的决策树-逻辑回归模型对于僵尸企业的识别准确率最高,达到99.98%;并且模型的运行速度相对各种集成模型的速度有较大优势,平均执行时间约为1.5 s。在各实验环境中,实验结果差异较小,验证了该模型的有效性和稳定性。

本文引用格式

吴东鹏, 王峥, 童薇, 叶枫, 宋楚翘 . 基于决策树-逻辑回归模型精确识别僵尸企业[J]. 应用科学学报, 2021 , 39(4) : 569 -580 . DOI: 10.3969/j.issn.0255-8297.2021.04.005

Abstract

Aiming at the problem of how to accurately identify zombie enterprises, based on the enterprise information data set published by Hunan Kechuang Information Co., LTD., a zombie enterprise identification method based on decision tree-logistic regression model is proposed. The method uses median to fill in missing numbers and outliers, analyzes data sets for feature derivation, and finally uses multiple linear regression and chi-square test to complete feature screening. In order to verify the effectiveness of the proposed method, comparative experiments are carried out between the method and the over-borrowing method, continuous loss method, random forest algorithm, BP neural network algorithm, and XGBoost algorithm in the Alibaba Cloud environment and the local environment. Each model is trained 50 times, the data selected for each training is randomly selected according to a certain proportion, and finally the average value of each index is taken as the final result. Experimental results show that the proposed decision tree-logistic regression model has the highest accuracy in the identification of zombie companies, reaching 99.98%, and the model is superior to various other integrated models in running speed with average execution time of about 1.5 s. In all scenarios, experimental results of this model show relatively small differences, verifying the effectiveness and stability of the model.

参考文献

[1] 凌梦媛. 僵尸企业处置方法研究[D]. 杭州:杭州电子科技大学, 2018.
[2] Du W J, Li M J. Can environmental regulation promote the governance of excess capacity in China's energy sector? the market exit of zombie enterprises[J]. Journal of Cleaner Production, 2019, 207:306-316.
[3] 许江波, 史国梁. 基于PSM模型的僵尸企业识别方法有效性检验[J]. 财会月刊, 2018(15):31-37. Xu J B, Shi G L. Validation of identification method of zombie enterprises based on PSM model[J]. Journal of Finance and Accounting, 2018(15):31-37. (in Chinese)
[4] 宁相波, 蓝梦. 财务独立董事能否抑制僵尸企业的形成?[J]. 商业会计, 2018(8):72-75. Ning X B, Lan M. Can independent financial directors inhibit the formation of zombie enterprises?[J]. Business Accounting, 2018(8):72-75. (in Chinese)
[5] 孔繁成. 僵尸企业现状、原因及对策研究——来自中国A股上市公司的经验证据[J]. 现代管理科学, 2019(7):60-62. Kong F C. Current situation, causes and countermeasures of zombie enterprises:empirical evidence from A share listed companies in China[J]. Modern Management Science, 2019(7):60-62. (in Chinese)
[6] 何帆, 朱鹤. 僵尸企业的识别与应对[J]. 中国金融, 2016(5):20-22. He F, Zhu H. Identification and response of zombie enterprises[J]. China Finance, 2016(5):20-22. (in Chinese)
[7] 朱鹤, 何帆. 中国僵尸企业的数量测度及特征分析[J]. 北京工商大学学报(社会科学版), 2016, 31(4):116-126. Zhu H, He F. Quantitative measurement and characteristic analysis of Chinese zombie enterprises[J]. Journal of Beijing University of Technology and Industry (Social Science Edition), 2016, 31(4):116-126. (in Chinese)
[8] Hosmer D W Jr, Stanley L, Rodney S X. Applied logistic regression[M].[S.l.]:John Wiley & Sons, 2013, 23(1):159-160.
[9] Safavian R S, Landgrebe D. A survey of decision tree classifier methodology[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1991, 21(3):660-674.
[10] Maniruzzaman M, Rahman M J, Al-Mehedi H M, et al. Accurate diabetes risk stratification using machine learning:role of missing value and outliers[J]. Journal of Medical Systems, 2018, 42(5):1-17.
[11] Chandrashekar G, Sahin F. A survey on feature selection methods[J]. Computers & Electrical Engineering, 2014, 40(1):16-28.
[12] Zainodin H J, Yap S J. Overcoming multicollinearity in multiple regression using correlation coefficient[C]//American Institute of Physics Conference Proceedings, 2013, 1557(1):416-419.
[13] Uyanik G K, Güler N. A study on multiple linear regression analysis[J]. Procedia-Social and Behavioral Sciences, 2013, 106:234-240.
[14] Sharpe D. Chi-square test is statistically significant:now what?[J]. Practical Assessment, Research, and Evaluation, 2015, 20(1):8.
[15] Li M, Zhang T, Chen Y, et al. Efficient mini-batch training for stochastic optimization[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014:661-670.
[16] Bottou L. Large-scale machine learning with stochastic gradient descent[C]//Proceedings of COMPSTAT 2010.[S.l.]:Physica-Verlag HD, 2010:177-186.
[17] Sutskever I, Martens J, Dahl G, et al. On the importance of initialization and momentum in deep learning[C]//International Conference on Machine Learning.[S.l.]:PMLR, 2013:1139-1147.
[18] Zou F, Shen L, Jie Z, et al. A sufficient condition for convergences of Adam and RMSProp[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019:11127-11135.
文章导航

/