针对现有智能合约漏洞检测技术检测效率和自动化程度低、无法实现大规模智能合约样本检测的问题,提出了基于机器学习对智能合约进行漏洞检测的方法。该方法首先对智能合约数据集进行预处理,将智能合约Solidity源码转换为操作码序列,并制定操作码抽象简化规则对其进行约简。在此基础上,利用N-gram模型从简化后的操作码序列数据集中提取到2025维bigram特征,并分别采用嵌入法进行特征选择和主成分分析法进行特征降维,构建3种特征表示方式。然后使用Borderline SMOTE方法对正负样本不均衡数据集进行均衡处理,最后分别使用决策树、支持向量机、随机森林和XGBoost这4种算法构建漏洞检测模型。实验结果表明,随机森林的漏洞检测模型平均准确率达93.60%,总体性能Macro-F1达到93.91%,能够高效地实现多种漏洞的检测。
To address the limitations of the existing smart contract vulnerability detection technology, including low detection efficiency, inadequate automation, and the inability to realize large-scale smart contract sample detection, this study proposed a method for smart contract vulnerability detection technology based on machine learning. The method first preprocessed the smart contract dataset, converted the source code of the smart contract into a sequence of opcodes, and formulated the opcode abstraction simplification rules for simplification. On this basis, 2025-dimensional bigram features were extracted from the simplified opcode sequence dataset using the N-gram model, and three feature representations were constructed by using the embedding method for feature selection and principal component analysis for feature dimensionality reduction, respectively. Then, the Borderline SMOTE method, an improved algorithm of SMOTE, was used to equalize the positive and negative sample imbalance dataset. Finally, four algorithms, namely, decision tree, support vector machine, random forest, and XGBoost, were applied to construct the vulnerability detection model, respectively. The experimental results show that the vulnerability detection model of random forest has an average accuracy of 93.60%, and the overall performance Macro-F1 reaches 93.91%, which can efficiently detect multiple vulnerabilities.
[1] Szabo N. Formalizing and securing relationships on public networks [J]. First Monday, 1997(9): 1-21.
[2] Mehar M I, Shier C L, Giambattista A, et al. Understanding a revolutionary and flawed grand experiment in blockchain: the DAO attack [J]. Journal of Cases on Information Technology, 2019(1): 19-32.
[3] Palladino S. The parity wallet hack explained [EB/OL]. (2017-07-19) [2025-01-02]. https://blog.openzeppelin.com/on-the-parity-wallet-multisig-hack-405a8c12e8f7.
[4] 张登记, 赵相福, 陈中育, 等. 基于Ethereum智能合约的安全策略分析[J]. 应用科学学报, 2021, 39(1): 151-163. Zhang D J, Zhao X F, Chen Z Y, et al. Analysis of security strategies for smart contracts based on Ethereum [J]. Journal of Applied Sciences, 2021, 39(1): 151-163. (in Chinese)
[5] 古天龙, 蔡国永. 网络协议的形式化分析与设计[M]. 北京: 电子工业出版社, 2003.
[6] Hirai, Y. Defining the Ethereum virtual machine for interactive theorem provers [C]//International Conference on Financial Cryptography & Data Security. Springer, Cham, 2017.
[7] Hildenbrandt E, Saxena M, Rodrigues N, et al. KEVM: a complete formal semantics of the Ethereum virtual machine [C]//2018 IEEE 31st Computer Security Foundations Symposium (CSF), 2018: 204-217.
[8] Luu L, Chu D H, Olickel H, et al. Making smart contracts smarter [C]//2016 ACM SIGSAC Conference on Computer and Communications Security, 2016: 254-269.
[9] Chen T, Li X Q, Luo X P, et al. Under-optimized smart contracts devour your money [C]//24th IEEE International Conference on Software Analysis, Evolution and Reengineering, 2017: 442-446.
[10] Nikolic I, Kolluri A, Sergey I, et al. Finding the greedy, prodigal, and suicidal contracts at scale [C]//34th Annual Computer Security Applications Conference, 2018: 653-663.
[11] Mueller B, Honig J, Parasaram N, et al. ConsenSys/mythril [EB/OL]. (2024-03-28) [2025- 01-02]. https://github.com/ConsenSys/mythril.
[12] Liu H, Liu C, Zhao W, et al. S-gram: towards semantic-aware security auditing for Ethereum smart contracts [C]//33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), 2018: 814-819.
[13] Liao J W, Tsai T T, He C K, et al. Soliaudit: smart contract vulnerability assessment based on machine learning and fuzz testing [C]//2019 Sixth International Conference on Internet of Things: Systems, Management and Security (IOTSMS), 2019: 458-465.
[14] Wang W, Song J, Xu G Q, et al. Contractward: automated vulnerability detection models for ethereum smart contracts [J]. IEEE Transactions on Network Science and Engineering, 2021, 8(2): 1133-1144.
[15] Eshghie M, Artho C, Gurov D. Dynamic vulnerability detection on smart contracts using machine learning [C]//25th International Conference on Evaluation and Assessment in Software Engineering, 2021: 305-312.
[16] Xue Y, Ye J, Zhang W, et al. xFuzz: machine learning guided cross-contract fuzzing [DB/OL]. (2022-06-30) [2025-07-20]. https://arxiv.org/pdf/2111.12423v2.
[17] He J, Balunović M, Ambroladze N, et al. Learning to fuzz from symbolic execution with application to smart contracts [C]//2019 ACM SIGSAC Conference on Computer and Communications Security, 2019: 531-548.
[18] Durieux T, Ferreira J F, Abreu R, et al. Empirical review of automated analysis tools on 47, 587 Ethereum smart contracts [C]//ACM/IEEE 42nd International Conference on Software Engineering (ICSE), 2020: 530-541.
[19] Hassan N, Gomaa W, Khoriba G, et al. Credibility detection in twitter using word N-gram analysis and supervised machine learning techniques [J]. International Journal of Intelligent Engineering and Systems, 2020(1): 291-300.
[20] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique [J]. Journal of Artificial Intelligence Research, 2002(1): 321-357.
[21] Han H, Wang W Y, Mao B H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning [C]//2005 International Conference on Intelligent Computing, 2005: 878-887.
[22] 刘峰. 基于多目标优化的多标签分类算法参数调谐研究[D]. 南京: 南京师范大学, 2014.
[23] Peng M, Wu Z, Zhang Z, et al. From macro to micro expression recognition: deep learning on small datasets using transfer learning [C]//201813th IEEE International Conference on Automatic Face & Gesture Recognition, 2018: 657-661.