基于自然语言处理的蛋白质小分子亲和力值预测

欧阳志友, 陈晨, 王愉茜, 陈金刚, 殷昭, 周青松

doi:10.3969/j.issn.0255-8297.2019.03.003

应用科学学报 >

2019 , Vol. 37 >Issue 3: 327 - 335

DOI: https://doi.org/10.3969/j.issn.0255-8297.2019.03.003

信号与信息处理

基于自然语言处理的蛋白质小分子亲和力值预测

展开

1. 南京邮电大学先进技术研究院, 南京 210023;
2. 南京邮电大学计算机学院, 南京 210023;
3. 南京邮电大学经济学院, 南京 210023;
4. 中国石油大学(华东)石油工程学院, 山东青岛 266580;
5. 重庆邮电大学通信与信息工程学院, 重庆 400065

欧阳志友,博士生,研究方向:机器学习与电力大数据分析,E-mail:ouyang@njupt.edu.cn

收稿日期: 2018-10-10

修回日期: 2018-10-25

网络出版日期: 2019-05-31

基金资助

国家自然科学基金（No.61533010）资助

收起

Protein Small Molecule Affinity Prediction Based on Natural Language Processing

Expand

1. Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing 210023, China;
2. School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China;
3. School of Economics, Nanjing University of Posts and Telecommunications, Nanjing 210023, China;
4. School of Petroleum Engineering, China University of Petroleum, Qingdao 266580, Shandong Province, China;
5. Department of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Received date: 2018-10-10

Revised date: 2018-10-25

Online published: 2019-05-31

Fold

摘要

蛋白质与小分子的相互作用研究对药物的研发非常重要，而现有的蛋白质小分子亲和力值的预测方法存在成本高、精度低等问题.为此提出了一种新的蛋白质小分子亲和力值的预测方法，利用自然语言处理技术对蛋白质结构数据与小分子指纹数据进行处理，并利用梯度提升决策树模型进行预测.实验表明，该方法的精度较原有方案有较大提高.

关键词： 自然语言处理; 梯度提升决策树; 蛋白质小分子亲和力值; 机器学习

本文引用格式

欧阳志友, 陈晨, 王愉茜, 陈金刚, 殷昭, 周青松 . 基于自然语言处理的蛋白质小分子亲和力值预测[J]. 应用科学学报, 2019 , 37(3) : 327 -335 . DOI: 10.3969/j.issn.0255-8297.2019.03.003

Abstract

The interaction between proteins and small molecules plays a very important role in drug research and development. However, the existing methods for predicting the affinity of small molecules have some problems, such as high cost and low accuracy. In this paper, a new protein small molecule affinity prediction method is proposed based on natural language processing (NLP) technology, which using NLP to analysis the protein structure data and small molecule fingerprint data, as well as using gradient boosting decision tree (GBDT) model to predict the affinity. Experiments show that the proposed method has performance over the exiting methods in terms of accuracy.

Key words： natural language processing; machine learning; gradient boosting decision tree (GBDT); protein small molecule affinity value

参考文献

[1] 高睿. 基于片段的AF9及ENL YEATS domain的苗头化合物鉴定[D]. 合肥:中国科学技术大学,2017.
[2] 甄蓓,宋亚军,郭兆彪,王津,俞守义,杨瑞馥. 炭疽芽孢DNA适配子结构与长度对亲和力的影响[J]. 第四军医大学学报,2002(16):1467-1470. Zhen P, Song Y J, Guo Z B, Wang J, Yu S Y, Yang R F. Relationship between the secondary structure of an aptamer and its binding affinity to Bacillus anthracis spores[J]. Journal of the Fourth Military Medical University, 2002(16):1467-1470. (in Chinese)
[3] 伍智蔚,易忠胜,董露,冯理涛. 结合QSAR、分子对接和动力学模拟剖析小分子与不同受体的结合亲和力[J]. 分析科学学报,2016, 32(3):320-324. Wu Z W, Yi Z S, Dong L, Feng L T. Analyzing the affinity of small molecules with different receptors with QSAR, molecular docking and molecular dynamic simulation[J]. Journal of Analytical Science, 2016, 32(3):320-324. (in Chinese)
[4] 杨胜勇,李国菠,李琳丽,杨羚羚,魏于全. 基于分子描述符的蛋白质-配体亲和力预测方法:中国,CN102930181A[P]. 2013-02-13[2018-10-10].
[5] Kwak N J, Song T S. Android-based human action recognition alarm service using action recognition parameter and decision tree[J]. International Journal of Security & Its Applications, 2013, 7(4):277-286.
[6] Gan L, Chen F. Human action recognition using APJ3D and random forests[J]. Journal of Software, 2013, 8(9):412-423.
[7] Flach P A, Lachiche N. Naive Bayesian classification of structured data[J]. Machine Learning, 2004, 57(3):233-269.
[8] Mazaar H, Emary E, Onsi H. Ensemble based-feature selection on human activity recognition[C]//Proceedings of the 10th International Conference on Informatics and Systems. Yantai, China, 2016:81-87.
[9] 欧阳志友, 孙孝魁. 基于梯度提升模型的行为式验证码人机识别[J]. 信息网络安全,2017(9):143-146. Ouyang Z Y, Sun X K. Human-machine behavior recognition for CAPTCHA based on gradient boosting model[J]. Netinfo Security, 2017(9):143-146. (in Chinese)
[10] Mikolov T, Sutskever I, Chen K, Corradog S. Distributed representations of words and phrases and their compositionality[C]//Advanes in Neural Information Processing Systems, 2013, 26:3111-3119.
[11] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space[EB/OL].[2018-10-10]. https://arxiv.org/abs/1301.3781.
[12] 施聪莺,徐朝军,杨晓江. TFIDF算法研究综述[J]. 计算机应用, 2009, 29(S1):167-170+180. Shi C Y, Xu C J, Yang X J. Study of TFIDF algorithm[J]. Journal of Computer Applications, 2009, 29(S1):167-170+180. (in Chinese)
[13] Ke G L, Meng Q, Finley T, Wang T F, Chen W, Ma W D, Ye Q W, Liu T Y. LightGBM:a highly efficient gradient boosting decision tree[C]//Advances in Neural Information Processing Systems, Long Beach, USA, 2017, 30:3149-3159.
[14] Chen T, Guestrin C. XGBoost:reliable large-scale tree boosting system[C]//Proceedings of the 22nd SIGKDD Conference on Knowledge Discovery and Data Mining. San Francisco, USA. 2016:13-17.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献