应用科学学报 ›› 2021, Vol. 39 ›› Issue (4): 569-580.doi: 10.3969/j.issn.0255-8297.2021.04.005

• CCF NCCA 2020专辑 • 上一篇    

基于决策树-逻辑回归模型精确识别僵尸企业

吴东鹏1, 王峥1, 童薇1, 叶枫1, 宋楚翘2   

  1. 1. 河海大学 计算机与信息学院, 江苏 南京 211100;
    2. 河海大学 商学院, 江苏 南京 211100
  • 收稿日期:2020-08-26 发布日期:2021-08-04
  • 通信作者: 叶枫,博士,讲师,研究方向为大数据。E-mail:yefeng1022@hhu.edu.cn E-mail:yefeng1022@hhu.edu.cn
  • 基金资助:
    中央高校基本科研业务费项目基金(No.B200202185);江苏省“六大人才高峰”项目基金(No.XYDXX-078);江苏省高等学校自然科学基金(No.19KJB630006)资助

Accurately Identify Zombie Enterprises Based on Decision Tree-Logistic Regression Model

WU Dongpeng1, WANG Zheng1, TONG Wei1, YE Feng1, SONG Chuqiao2   

  1. 1. School of Computer and Information, Hohai University, Nanjing 211100, Jiangsu, China;
    2. School of Business, Hohai University, Nanjing 211100, Jiangsu, China
  • Received:2020-08-26 Published:2021-08-04

摘要: 针对如何精准识别僵尸企业的问题,借助湖南科创信息有限公司公开的企业信息数据集,提出了一种决策树-逻辑回归的僵尸企业识别方法。该方法用中位数填充缺失数和离群值,然后分析数据集并进行特征衍生,最后使用多元线性回归和卡方检验等方法完成特征筛选。为了验证所提出方法的有效性,分别在阿里云环境和本地环境下将该方法与过度借贷法、连续亏损法、随机森林算法、BP神经网络算法、XGBoost算法进行比较。每个模型均训练50次,每次训练按一定比例随机选取数据,最终取各个指标的平均值作为最终实验结果。实验结果表明:所提出的决策树-逻辑回归模型对于僵尸企业的识别准确率最高,达到99.98%;并且模型的运行速度相对各种集成模型的速度有较大优势,平均执行时间约为1.5 s。在各实验环境中,实验结果差异较小,验证了该模型的有效性和稳定性。

关键词: 僵尸企业, 机器学习, 特征工程, 决策树-逻辑回归

Abstract: Aiming at the problem of how to accurately identify zombie enterprises, based on the enterprise information data set published by Hunan Kechuang Information Co., LTD., a zombie enterprise identification method based on decision tree-logistic regression model is proposed. The method uses median to fill in missing numbers and outliers, analyzes data sets for feature derivation, and finally uses multiple linear regression and chi-square test to complete feature screening. In order to verify the effectiveness of the proposed method, comparative experiments are carried out between the method and the over-borrowing method, continuous loss method, random forest algorithm, BP neural network algorithm, and XGBoost algorithm in the Alibaba Cloud environment and the local environment. Each model is trained 50 times, the data selected for each training is randomly selected according to a certain proportion, and finally the average value of each index is taken as the final result. Experimental results show that the proposed decision tree-logistic regression model has the highest accuracy in the identification of zombie companies, reaching 99.98%, and the model is superior to various other integrated models in running speed with average execution time of about 1.5 s. In all scenarios, experimental results of this model show relatively small differences, verifying the effectiveness and stability of the model.

Key words: zombie enterprise, machine learning, feature engineering, decision tree-logistic regression

中图分类号: