蛋白质与小分子的相互作用研究对药物的研发非常重要,而现有的蛋白质小分子亲和力值的预测方法存在成本高、精度低等问题.为此提出了一种新的蛋白质小分子亲和力值的预测方法,利用自然语言处理技术对蛋白质结构数据与小分子指纹数据进行处理,并利用梯度提升决策树模型进行预测.实验表明,该方法的精度较原有方案有较大提高.
The interaction between proteins and small molecules plays a very important role in drug research and development. However, the existing methods for predicting the affinity of small molecules have some problems, such as high cost and low accuracy. In this paper, a new protein small molecule affinity prediction method is proposed based on natural language processing (NLP) technology, which using NLP to analysis the protein structure data and small molecule fingerprint data, as well as using gradient boosting decision tree (GBDT) model to predict the affinity. Experiments show that the proposed method has performance over the exiting methods in terms of accuracy.
[1] 高睿. 基于片段的AF9及ENL YEATS domain的苗头化合物鉴定[D]. 合肥:中国科学技术大学,2017.
[2] 甄蓓,宋亚军,郭兆彪,王津,俞守义,杨瑞馥. 炭疽芽孢DNA适配子结构与长度对亲和力的影响[J]. 第四军医大学学报,2002(16):1467-1470. Zhen P, Song Y J, Guo Z B, Wang J, Yu S Y, Yang R F. Relationship between the secondary structure of an aptamer and its binding affinity to Bacillus anthracis spores[J]. Journal of the Fourth Military Medical University, 2002(16):1467-1470. (in Chinese)
[3] 伍智蔚,易忠胜,董露,冯理涛. 结合QSAR、分子对接和动力学模拟剖析小分子与不同受体的结合亲和力[J]. 分析科学学报,2016, 32(3):320-324. Wu Z W, Yi Z S, Dong L, Feng L T. Analyzing the affinity of small molecules with different receptors with QSAR, molecular docking and molecular dynamic simulation[J]. Journal of Analytical Science, 2016, 32(3):320-324. (in Chinese)
[4] 杨胜勇,李国菠,李琳丽,杨羚羚,魏于全. 基于分子描述符的蛋白质-配体亲和力预测方法:中国,CN102930181A[P]. 2013-02-13[2018-10-10].
[5] Kwak N J, Song T S. Android-based human action recognition alarm service using action recognition parameter and decision tree[J]. International Journal of Security & Its Applications, 2013, 7(4):277-286.
[6] Gan L, Chen F. Human action recognition using APJ3D and random forests[J]. Journal of Software, 2013, 8(9):412-423.
[7] Flach P A, Lachiche N. Naive Bayesian classification of structured data[J]. Machine Learning, 2004, 57(3):233-269.
[8] Mazaar H, Emary E, Onsi H. Ensemble based-feature selection on human activity recognition[C]//Proceedings of the 10th International Conference on Informatics and Systems. Yantai, China, 2016:81-87.
[9] 欧阳志友, 孙孝魁. 基于梯度提升模型的行为式验证码人机识别[J]. 信息网络安全,2017(9):143-146. Ouyang Z Y, Sun X K. Human-machine behavior recognition for CAPTCHA based on gradient boosting model[J]. Netinfo Security, 2017(9):143-146. (in Chinese)
[10] Mikolov T, Sutskever I, Chen K, Corradog S. Distributed representations of words and phrases and their compositionality[C]//Advanes in Neural Information Processing Systems, 2013, 26:3111-3119.
[11] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space[EB/OL].[2018-10-10]. https://arxiv.org/abs/1301.3781.
[12] 施聪莺,徐朝军,杨晓江. TFIDF算法研究综述[J]. 计算机应用, 2009, 29(S1):167-170+180. Shi C Y, Xu C J, Yang X J. Study of TFIDF algorithm[J]. Journal of Computer Applications, 2009, 29(S1):167-170+180. (in Chinese)
[13] Ke G L, Meng Q, Finley T, Wang T F, Chen W, Ma W D, Ye Q W, Liu T Y. LightGBM:a highly efficient gradient boosting decision tree[C]//Advances in Neural Information Processing Systems, Long Beach, USA, 2017, 30:3149-3159.
[14] Chen T, Guestrin C. XGBoost:reliable large-scale tree boosting system[C]//Proceedings of the 22nd SIGKDD Conference on Knowledge Discovery and Data Mining. San Francisco, USA. 2016:13-17.