Journal of Applied Sciences ›› 2021, Vol. 39 ›› Issue (4): 569-580.doi: 10.3969/j.issn.0255-8297.2021.04.005

• Special Issue on CCF NCCA 2020 • Previous Articles    

Accurately Identify Zombie Enterprises Based on Decision Tree-Logistic Regression Model

WU Dongpeng1, WANG Zheng1, TONG Wei1, YE Feng1, SONG Chuqiao2   

  1. 1. School of Computer and Information, Hohai University, Nanjing 211100, Jiangsu, China;
    2. School of Business, Hohai University, Nanjing 211100, Jiangsu, China
  • Received:2020-08-26 Published:2021-08-04

Abstract: Aiming at the problem of how to accurately identify zombie enterprises, based on the enterprise information data set published by Hunan Kechuang Information Co., LTD., a zombie enterprise identification method based on decision tree-logistic regression model is proposed. The method uses median to fill in missing numbers and outliers, analyzes data sets for feature derivation, and finally uses multiple linear regression and chi-square test to complete feature screening. In order to verify the effectiveness of the proposed method, comparative experiments are carried out between the method and the over-borrowing method, continuous loss method, random forest algorithm, BP neural network algorithm, and XGBoost algorithm in the Alibaba Cloud environment and the local environment. Each model is trained 50 times, the data selected for each training is randomly selected according to a certain proportion, and finally the average value of each index is taken as the final result. Experimental results show that the proposed decision tree-logistic regression model has the highest accuracy in the identification of zombie companies, reaching 99.98%, and the model is superior to various other integrated models in running speed with average execution time of about 1.5 s. In all scenarios, experimental results of this model show relatively small differences, verifying the effectiveness and stability of the model.

Key words: zombie enterprise, machine learning, feature engineering, decision tree-logistic regression

CLC Number: