CN112164428B - Method and device for predicting properties of small drug molecules based on deep learning - Google Patents

Method and device for predicting properties of small drug molecules based on deep learning Download PDF

Info

Publication number
CN112164428B
CN112164428B CN202011007504.8A CN202011007504A CN112164428B CN 112164428 B CN112164428 B CN 112164428B CN 202011007504 A CN202011007504 A CN 202011007504A CN 112164428 B CN112164428 B CN 112164428B
Authority
CN
China
Prior art keywords
drug
data
small molecule
prediction model
small
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011007504.8A
Other languages
Chinese (zh)
Other versions
CN112164428A (en
Inventor
宋怡然
马元巍
李泽朋
顾徐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Weiyizhi Technology Co Ltd
Original Assignee
Changzhou Weiyizhi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Weiyizhi Technology Co Ltd filed Critical Changzhou Weiyizhi Technology Co Ltd
Priority to CN202011007504.8A priority Critical patent/CN112164428B/en
Publication of CN112164428A publication Critical patent/CN112164428A/en
Application granted granted Critical
Publication of CN112164428B publication Critical patent/CN112164428B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Epidemiology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method and a device for predicting the property of a small drug molecule based on deep learning, wherein the method comprises the following steps: acquiring a data set containing a plurality of drug micromolecule data, and acquiring structural characteristic data of each drug micromolecule in the data set to form a sample set; preprocessing a sample set; performing ensemble learning based on the preprocessed sample set through various neural networks to obtain an ensemble prediction model; and acquiring small molecule data of the drug to be predicted, and inputting the small molecule data of the drug to be predicted into the integrated prediction model to obtain a property prediction result of the drug small molecule. The invention can conveniently, efficiently and accurately realize the prediction of the properties of the small molecules of the drug, thereby effectively improving the efficiency of drug research and development and accelerating the virtual screening process.

Description

Method and device for predicting properties of small drug molecules based on deep learning
Technical Field
The invention relates to the technical field of deep learning, in particular to a method for predicting the property of a small drug molecule based on deep learning, a device for predicting the property of the small drug molecule based on deep learning, computer equipment and a non-transitory computer-readable storage medium.
Background
The new drug development cycle is very long, and comprises target selection and verification, discovery from a seedling-end compound (Hit) to a Lead compound (Lead) and finally discovery and optimization of a Candidate drug (Candidate), layer-by-layer screening and high cost. The virtual screening is hoped to be carried out through a computer, rules behind drug molecules are fully mined based on an existing drug-related biochemical database, and the speed of drug discovery and development is accelerated, such as discovery and evaluation of new target protein seedling compounds.
The traditional method for predicting the property of the small molecule of the drug carries out quantitative structure-activity relationship or structure-activity relationship (QSAR/QSPR) modeling by using the molecular descriptors (including 1D/2D/3D/high-dimensional descriptors such as the physicochemical properties of molecular weight) as input characteristics, and more than 5000 molecular descriptors are developed at present. However, the algorithmic model predictive performance of such a modeling approach using molecular descriptors in large numbers is highly dependent on whether valid molecular descriptor features can be selected. While the feature engineering is time-consuming and labor-consuming, the generalization and universality of the algorithm model also greatly depend on the quality of the trained model.
Therefore, it is highly desirable to provide a convenient, efficient and accurate scheme for predicting the properties of small molecules of drugs.
Disclosure of Invention
The invention provides a method and a device for predicting the property of a small drug molecule based on deep learning, which can conveniently, efficiently and accurately predict the property of the small drug molecule, thereby effectively improving the efficiency of drug research and development and accelerating the virtual screening process.
The technical scheme adopted by the invention is as follows:
a method for predicting the properties of small molecules of a medicine based on deep learning comprises the following steps: acquiring a data set containing a plurality of drug micromolecule data, and acquiring structural feature data of each drug micromolecule in the data set to form a sample set; preprocessing the sample set; performing ensemble learning based on the preprocessed sample set through various neural networks to obtain an ensemble prediction model; and acquiring small molecule data of the drug to be predicted, and inputting the small molecule data of the drug to be predicted into the integrated prediction model to obtain a property prediction result of the drug small molecule.
The method for predicting the properties of the small molecules of the medicine based on deep learning further comprises the following steps: acquiring different types of drug databases; and performing property prediction on the small drug molecule data in the different types of drug databases through the integrated prediction model to detect the universal capability of the integrated prediction model.
Acquiring structural feature data of each drug small molecule in the data set, specifically comprising: acquiring a SMILES (Simplified molecular input line entry specification, Simplified molecular linear input specification) character string corresponding to each drug small molecule data; coding each SMILES character string in a One-hot coding form, and counting the character position of each SMILES character string to construct a drug small molecule characteristic vector, or learning a molecular substructure vector representation form of each drug small molecule data by mol2vec to construct a drug small molecule characteristic vector.
Preprocessing the sample set, specifically comprising: and (3) Oversampling the structural characteristic data of the drug small molecules in the sample set by adopting a SMOTE (Synthetic minimum optimization technology, a few types of Oversampling technologies) algorithm, and dividing the oversampled data into a training set and a test set.
The plurality of neural networks include a multilayer CNN (convolutional neural network), CNN _ GRU, and CNN _ LSTM.
The CNN _ GRU adopts a bidirectional GRU (gate control unit network), and the CNN _ LSTM adopts a bidirectional LSTM (long short term memory network).
A device for predicting properties of small molecules of a drug based on deep learning, comprising: the first acquisition module is used for acquiring a data set containing a plurality of drug small molecule data and acquiring structural feature data of each drug small molecule in the data set to form a sample set; a pre-processing module to pre-process the sample set; the learning module is used for performing ensemble learning on the basis of the preprocessed sample set through various neural networks to obtain an ensemble prediction model; and the prediction module is used for acquiring the small molecule data of the drug to be predicted and inputting the small molecule data of the drug to be predicted into the integrated prediction model so as to obtain the property prediction result of the small molecule of the drug.
The device for predicting the property of the small molecule of the drug based on deep learning further comprises: a second acquisition module for acquiring different types of drug databases; a detection module for performing property prediction on the small drug molecule data in the different types of drug databases through the integrated prediction model to detect the universal capability of the integrated prediction model.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the method for predicting the property of the small molecule of the drug based on deep learning.
A non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above deep learning-based drug small molecule property prediction method.
The invention has the beneficial effects that:
according to the invention, the data set containing a plurality of drug micromolecule data is obtained, the structural feature data of each drug micromolecule in the data set is obtained, then the integrated learning is carried out based on the structural feature data set through a plurality of neural networks to obtain the integrated prediction model, and finally the drug micromolecule data to be predicted is input into the integrated prediction model to obtain the drug micromolecule property prediction result.
Drawings
FIG. 1 is a flowchart of a method for predicting the properties of a small molecule of a drug based on deep learning according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a multilayer CNN according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a CNN + RNN structure according to an embodiment of the present invention;
fig. 4 is a block diagram of a device for predicting the property of a small molecule of a drug based on deep learning according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the method for predicting the property of a small molecule of a drug based on deep learning according to an embodiment of the present invention includes the following steps:
and S1, acquiring a data set containing a plurality of drug small molecule data, and acquiring structural feature data of each drug small molecule in the data set to form a sample set.
In one embodiment of the invention, the data set may contain chemical structure data for a large number of small drug molecules. It should be understood that the larger the number of drug small molecules in the data set, the better the training effect on subsequent models, as the calculation and storage capabilities allow.
After the data set is obtained, SMILES character strings corresponding to each drug small molecule data in the data set can be obtained, the maximum length value of the character strings and the number of the types of characters are counted, then each SMILES character string can be coded in a One-hot coding mode, and the character position of each SMILES character string is counted to construct a drug small molecule feature vector. Alternatively, a molecular substructure vector representation of each drug small molecule data can be learned using mol2vec to construct a drug small molecule feature vector. According to the two feature extraction modes provided by the embodiment of the invention, the constructed drug small molecule feature vector can be suitable for different neural network algorithm models. In addition, the two feature extraction modes do not need to calculate additional molecular structures or physical and chemical features and perform feature engineering, so that the feature extraction time is greatly shortened, and the cost of subsequent model training in the prediction process is reduced.
And S2, preprocessing the sample set.
Specifically, the SMOTE algorithm may be adopted to oversample the structural feature data of the drug small molecules in the sample set, and the oversampled data is divided into a training set and a test set. Oversampling based on the SMOTE algorithm can solve the data imbalance problem.
And S3, performing ensemble learning through various neural networks based on the preprocessed sample set to obtain an ensemble prediction model.
In one embodiment of the invention, the plurality of neural networks include multiple layers CNN, CNN _ GRU and CNN _ LSTM, i.e. including one single CNN structure and two CNN + RNN structures.
The embodiment of the invention optimizes the network structure and model parameters of various neural networks, optimizes important parameters such as learning rate, activation function, dimensionality and the like by adjusting the network structure and searching in the full space, and selects the optimal algorithm model. For example, by using bidirectional RNN (recurrent neural network), the current output of a drug small molecular structure is known to be of common concern with the preceding and following states. The bidirectional RNN is formed by superposing two RNNs up and down. For another example, a 1D convolutional layer is added between the embedded layer and the RNN layer.
As shown in fig. 2, the CNNs are arranged in a sequential structure, and each CNN comprises a convolutional layer, a normalization layer, two convolutional layers, a flat layer, and two fully-connected layers.
As shown in fig. 3, CNN _ GRU employs bidirectional GRU, arranged in a sequential structure, and composed of one convolution layer, one bidirectional GRU model, one normalization layer, two convolution layers, one lamination layer, and two full-link layers. The CNN _ LSTM adopts bidirectional LSTM, adopts sequential structure arrangement and consists of a first convolution layer, a first bidirectional LSTM model, a first standardization layer, two convolution layers, a first lamination flat layer and two full-connection layers.
In addition, multiple layers CNN, CNN _ GRU, and CNN _ LSTM all add Dropout, reducing model overfitting by regularization. The Adam optimizer is adopted to adjust the learning rate during model training, so that the condition that the Loss is not reduced along with the increase of iteration times caused by the learning rate is avoided to a certain extent, and model under-fitting is prevented.
According to the embodiment of the invention, multiple different neural network models are combined into one integrated prediction Model Ensemble Model through the integrated Learning Ensemble Learning, so that the deviation Bias and the Variance of the final Model can be reduced simultaneously, and the prediction accuracy and the prediction effect are further improved. The specific process of ensemble learning can refer to the prior art, and is not described herein in detail.
And S4, acquiring the small molecule data of the drug to be predicted, and inputting the small molecule data of the drug to be predicted into the integrated prediction model to obtain the prediction result of the property of the small molecule of the drug.
By obtaining the chemical structure data of the drug small molecule to be predicted, extracting the characteristic vector of the drug small molecule, and inputting the characteristic vector into the integrated prediction model of the embodiment of the invention, the prediction result of the drug small molecule such as target binding property, activity or toxicity can be obtained.
In addition, in one embodiment of the invention, different types of drug databases can be obtained, and the integrated prediction model is used for performing property prediction on drug small molecule data in the different types of drug databases so as to detect the universal capability of the integrated prediction model.
Specifically, public drug databases such as drug bank and BindingDB can be obtained, drug small molecule data in the drug databases are input into the integrated prediction model to obtain corresponding drug small molecule property prediction results, the performance of the integrated prediction model for predicting various drug databases is evaluated by using at least one evaluation index, and if the performance difference is within a preset threshold value, the integrated prediction model has strong universal capability and meets requirements; if the performance difference is larger and is beyond the preset threshold value, the universal capability of the integrated prediction model is poor and the requirement is not met.
In one embodiment of the invention, the evaluation index comprises one or more of accuracy, precision, recall, area under the ROC curve, confusion matrix.
The Accuracy is the percentage of the result with correct prediction in the total sample, the actual prediction data in the historical prediction database is evaluated in the embodiment of the invention, namely the sample involved in the model evaluation in the embodiment of the invention is the set of the historical actual prediction data. The formula for Accuracy is as follows:
Accuracy=(TP+TN)/(TP+TN+FP+FN)
wherein TP (True Positive) refers to Positive samples predicted to be Positive by the model, FP (False Positive) refers to Negative samples predicted to be Positive by the model, FN (False Negative) refers to Positive samples predicted to be Negative by the model, and TN (True Negative) refers to Negative samples predicted to be Negative by the model.
The Precision ratio Precision is also called Precision ratio, and it is referred to the prediction result, and means the probability of the actual positive sample among all the samples predicted to be positive, that is, how much confidence is in the result of predicting to be positive sample to predict the accuracy. The Precision is formulated as follows:
Precision=TP/(TP+FP)
recall, called Recall, refers to the probability of being predicted as a positive sample among the actual positive samples for the original sample. The formula for Recall rate recalls is as follows:
Recall=TP/(TP+FN)
roc (receiver Operating characteristics) curve, also called receiver Operating characteristic curve. The two main indicators in the ROC curve are the true rate TPR and the false positive rate FPR, where the abscissa is the false positive rate FPR and the ordinate is the true rate TPR.
The true rate tpr (true Positive rate), also called sensitivity, is as follows:
TPR=TP/(TP+FN)
false Positive rate FPR (false Positive Rat), also called specificity, is as follows:
FPR=FP/(TN+FP)
the ROC curve has the good characteristic that when the distribution of the positive and negative samples changes, the ROC curve can be kept unchanged, and the influence of the imbalance of the sample types on the index result can be well eliminated.
The area Under the ROC curve, AUC (area Under dark), is the size of the portion of the area Under the ROC curve. The larger the area under the ROC curve indicates the better the model performance, and the AUC is the evaluation index generated thereby. Typically, the AUC values range from 0.5 to 1.0, with larger AUC's representing better prediction performance. If the model is perfect, its AUC is 1, proving that all positive examples are preceded by negative examples, if the model is a simple two-class random guess model, its AUC is 0.5, if one model is better than the other, its area under the curve is relatively large, and the corresponding AUC value will also be large.
The Confusion Matrix (fusion Matrix) is also referred to as an error Matrix, by which the effect of the algorithm can be visually observed. Each column of which is a predicted classification of the sample and each row is a true classification of the sample (or vice versa), which reflects the degree of confusion of the classification results.
The embodiment of the invention can evaluate the performance of the integrated prediction model for predicting various drug databases by using any one evaluation index to further obtain the universal capability of the integrated prediction model, and can also evaluate the performance of the integrated prediction model for predicting various drug databases by using a plurality of evaluation indexes to further obtain the universal capability of the integrated prediction model.
In the preferred embodiment of the invention, the performance of the integrated prediction model for predicting various drug databases is evaluated by the area AUC under the ROC curve, and if the difference between the AUC corresponding to various drug databases is within the preset area threshold, the general capability of the integrated prediction model meets the requirement.
When the performance of the integrated prediction model for predicting the performance of various drug databases is evaluated by a plurality of evaluation indexes, each evaluation index can have corresponding weight, the weighted average value of the results obtained by the plurality of evaluation indexes is used as an evaluation basis for predicting the performance of various drug databases, and if the weighted average value corresponding to the various drug databases is within a preset threshold value, the general capability of the integrated prediction model meets the requirement. Preferably, the area under the ROC curve, AUC, is the highest weight.
In an embodiment of the present invention, the step of detecting the general capability of the model may be performed before step S4, and if it is detected that the general capability of the integrated prediction model satisfies the requirement, for example, the accuracy of predicting the properties of most or all kinds of drug small molecules is greater than a preset threshold, step S4 is directly performed to perform the property prediction on the drug small molecules to be predicted by the integrated prediction model.
If it is detected that the universal capability of the integrated predictive model does not meet the requirement, the method may return to step S1, i.e., re-execute steps S1-S3 and the step of detecting the universal capability of the model until step S4 is executed after the universal capability of the integrated predictive model meets the requirement. Or the drug small molecule types suitable for the integrated prediction model can be obtained according to the prediction data of different types of drug databases, for example, by predicting the property of the drug micromolecules in each drug database, the accuracy of the integrated prediction model for predicting the property of the drug micromolecules of a certain class or certain classes is higher than a preset threshold value, then, in step S4, after the data of the drug small molecule to be predicted is obtained, it can be determined whether the drug small molecule to be predicted belongs to the category whose prediction accuracy is greater than the preset threshold, if so, inputting the small molecule data of the drug to be predicted into the integrated prediction model to obtain the prediction result of the small molecule property of the drug, if not, returning to the step S1, re-integrating and learning to obtain the integrated prediction model until the universal capability of the integrated prediction model meets the requirement, or the prediction accuracy of the integrated prediction model to the type of the drug micromolecules to be predicted is greater than a preset threshold value.
According to the method for predicting the property of the small drug molecules based on deep learning, provided by the embodiment of the invention, the data set containing a plurality of pieces of small drug molecule data is obtained, the structural feature data of each small drug molecule in the data set is obtained, then the integrated learning is carried out on the basis of the structural feature data set through a plurality of neural networks to obtain the integrated prediction model, and finally the small drug molecule data to be predicted is input into the integrated prediction model to obtain the prediction result of the property of the small drug molecules.
Corresponding to the method for predicting the property of the small drug molecule based on deep learning in the embodiment, the invention further provides a device for predicting the property of the small drug molecule based on deep learning.
As shown in fig. 4, the device for predicting the property of a small molecule of a drug based on deep learning according to an embodiment of the present invention includes: a first acquisition module 10, a pre-processing module 20, a learning module 30, and a prediction module 40. The first obtaining module 10 is configured to obtain a data set including data of a plurality of drug small molecules, and obtain structural feature data of each drug small molecule in the data set to form a sample set; the preprocessing module 20 is used for preprocessing the sample set; the learning module 30 is configured to perform ensemble learning based on the preprocessed sample set through multiple neural networks to obtain an ensemble prediction model; the prediction module 40 is configured to obtain small molecule data of a drug to be predicted, and input the small molecule data of the drug to be predicted into the integrated prediction model to obtain a prediction result of the property of the small molecule of the drug.
In one embodiment of the invention, the data set may contain chemical structure data for a large number of small drug molecules. It should be understood that the larger the number of drug small molecules in the data set, the better the training effect on subsequent models, as the calculation and storage capabilities allow.
After the first obtaining module 10 obtains the data set, it may obtain a SMILES character string corresponding to each drug small molecule data in the data set, and count the maximum length of the character string and the number of the classes of the characters, then it may encode each SMILES character string in the form of One-hot encoding, and count the character position of each SMILES character string to construct a drug small molecule feature vector. Alternatively, a molecular substructure vector representation of each drug small molecule data can be learned using mol2vec to construct a drug small molecule feature vector. According to the two feature extraction modes provided by the embodiment of the invention, the constructed drug small molecule feature vector can be suitable for different neural network algorithm models. In addition, the two feature extraction modes do not need to calculate additional molecular structures or physical and chemical features and perform feature engineering, so that the feature extraction time is greatly shortened, and the cost of subsequent model training in the prediction process is reduced.
The preprocessing module 20 may specifically use SMOTE algorithm to oversample the structural feature data of the drug small molecules in the sample set, and divide the oversampled data into a training set and a test set. Oversampling based on the SMOTE algorithm can solve the data imbalance problem.
In one embodiment of the invention, the plurality of neural networks include multiple layers CNN, CNN _ GRU and CNN _ LSTM, i.e. including one single CNN structure and two CNN + RNN structures.
The embodiment of the invention optimizes the network structure and model parameters of various neural networks, optimizes important parameters such as learning rate, activation function, dimensionality and the like by adjusting the network structure and searching in the full space, and selects the optimal algorithm model. For example, by using bi-directional RNNs, the current output of a drug small molecule structural formula is known to be of common concern with both the previous and subsequent states. The bidirectional RNN is formed by superposing two RNNs up and down. For another example, a 1D convolutional layer is added between the embedded layer and the RNN layer.
As shown in fig. 2, the CNNs are arranged in a sequential structure, and each CNN comprises a convolutional layer, a normalization layer, two convolutional layers, a flat layer, and two fully-connected layers.
As shown in fig. 3, CNN _ GRU employs bidirectional GRU, arranged in a sequential structure, and composed of one convolution layer, one bidirectional GRU model, one normalization layer, two convolution layers, one lamination layer, and two full-link layers. The CNN _ LSTM adopts bidirectional LSTM, adopts sequential structure arrangement and consists of a first convolution layer, a first bidirectional LSTM model, a first standardization layer, two convolution layers, a first lamination flat layer and two full-connection layers.
In addition, multiple layers CNN, CNN _ GRU, and CNN _ LSTM all add Dropout, reducing model overfitting by regularization. The Adam optimizer is adopted to adjust the learning rate during model training, so that the condition that the Loss is not reduced along with the increase of iteration times caused by the learning rate is avoided to a certain extent, and model under-fitting is prevented.
The integration module 40 combines a plurality of different neural network models into an integrated prediction Model by integrated Learning Ensemble Learning, so that the Bias and Variance of the final Model can be reduced simultaneously, and the prediction accuracy and prediction effect are further improved. The specific process of ensemble learning can refer to the prior art, and is not described herein in detail.
The prediction module 50 can obtain the prediction result of the small drug molecule such as target binding property, activity or toxicity by obtaining the chemical structure data of the small drug molecule to be predicted, extracting the feature vector of the small drug molecule, and inputting the feature vector into the integrated prediction model according to the embodiment of the present invention.
In addition, in an embodiment of the present invention, the device for predicting properties of small drug molecules based on deep learning may further include a second obtaining module, configured to obtain different types of drug databases, and a detecting module, configured to perform property prediction on small drug molecule data in the different types of drug databases through the integrated prediction model, so as to detect a universal capability of the integrated prediction model.
Specifically, the second obtaining module can obtain public drug databases, such as drug bank and BindingDB, the detection module inputs drug small molecule data in the drug databases into the integrated prediction model to obtain corresponding drug small molecule property prediction results, the integrated prediction model is evaluated by using at least one evaluation index to predict the performance of each type of drug databases, and if the performance difference is within a preset threshold value, the integrated prediction model has strong universal capability and meets the requirements; if the performance difference is larger and is beyond the preset threshold value, the universal capability of the integrated prediction model is poor and the requirement is not met.
In one embodiment of the invention, the evaluation index comprises one or more of accuracy, precision, recall, area under the ROC curve, confusion matrix.
The embodiment of the invention can evaluate the performance of the integrated prediction model for predicting various drug databases by using any one evaluation index to further obtain the universal capability of the integrated prediction model, and can also evaluate the performance of the integrated prediction model for predicting various drug databases by using a plurality of evaluation indexes to further obtain the universal capability of the integrated prediction model.
In the preferred embodiment of the invention, the performance of the integrated prediction model for predicting various drug databases is evaluated by the area AUC under the ROC curve, and if the difference between the AUC corresponding to various drug databases is within the preset area threshold, the general capability of the integrated prediction model meets the requirement.
When the performance of the integrated prediction model for predicting the performance of various drug databases is evaluated by a plurality of evaluation indexes, each evaluation index can have corresponding weight, the weighted average value of the results obtained by the plurality of evaluation indexes is used as an evaluation basis for predicting the performance of various drug databases, and if the weighted average value corresponding to the various drug databases is within a preset threshold value, the general capability of the integrated prediction model meets the requirement. Preferably, the area under the ROC curve, AUC, is the highest weight.
In one embodiment of the present invention, the detection module may perform its function before the prediction module 40, and if the detection module detects that the general capability of the integrated prediction model meets the requirement, for example, the accuracy of predicting the properties of most or all kinds of drug small molecules is greater than a preset threshold, the subsequent prediction module 40 directly performs the property prediction on the drug small molecules to be predicted by using the integrated prediction model.
If the detection module detects that the general capability of the integrated prediction model does not meet the requirement, the first obtaining module 10, the preprocessing module 20 and the learning module 30 can obtain the integrated prediction model again, and the prediction module 40 performs property prediction on the drug small molecules to be predicted after the general capability of the integrated prediction model meets the requirement. Or, the types of the small drug molecules suitable for the integrated prediction model can be obtained according to the prediction data of the different types of drug databases, for example, by predicting the properties of the small drug molecules in each drug database, it is obtained that the prediction accuracy of the integrated prediction model for the properties of the small drug molecules of a certain type or some types is greater than a preset threshold, the prediction module 40 can determine whether the small drug molecules to be predicted belong to the types with the prediction accuracy greater than the preset threshold after obtaining the data of the small drug molecules to be predicted, if so, the data of the small drug molecules to be predicted are input into the integrated prediction model to obtain the prediction result of the properties of the small drug molecules, and if not, the integrated prediction model is obtained again by the first obtaining module 10, the preprocessing module 20 and the learning module 30 until the universal capability of the integrated prediction model meets the requirement, or the prediction accuracy of the integrated prediction model for the types with the small drug molecules to be predicted is greater than the preset threshold .
According to the device for predicting the property of the small drug molecules based on deep learning, provided by the embodiment of the invention, the data set containing a plurality of data of the small drug molecules is obtained, the structural feature data of each small drug molecule in the data set is obtained, then the integrated learning is carried out on the basis of the structural feature data set through a plurality of neural networks to obtain the integrated prediction model, and finally the data of the small drug molecules to be predicted are input into the integrated prediction model to obtain the result of predicting the property of the small drug molecules.
The invention further provides a computer device corresponding to the embodiment.
The computer device of the embodiment of the invention comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and when the processor executes the computer program, the method for predicting the property of the small molecule of the drug based on deep learning according to the embodiment of the invention can be realized.
According to the computer device provided by the embodiment of the invention, when the processor executes the computer program stored on the memory, the data set containing a plurality of drug micromolecule data is obtained, the structural feature data of each drug micromolecule in the data set is obtained, then the integrated learning is carried out on the basis of the structural feature data set through a plurality of neural networks to obtain the integrated prediction model, and finally the drug micromolecule data to be predicted are input into the integrated prediction model to obtain the drug micromolecule property prediction result.
The invention also provides a non-transitory computer readable storage medium corresponding to the above embodiment.
A non-transitory computer-readable storage medium of an embodiment of the present invention has stored thereon a computer program, which when executed by a processor, can implement the method for predicting a property of a small molecule of a drug based on deep learning according to the above-described embodiment of the present invention.
According to the non-transitory computer-readable storage medium of the embodiment of the invention, when the processor executes the computer program stored thereon, the data set containing a plurality of drug small molecule data is acquired, the structural feature data of each drug small molecule in the data set is acquired, then the ensemble learning is performed based on the structural feature data set through a plurality of neural networks to obtain the ensemble prediction model, and finally the drug small molecule data to be predicted is input into the ensemble prediction model to obtain the drug small molecule property prediction result.
In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The meaning of "plurality" is two or more unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly through intervening media, either internally or in any other relationship. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (6)

1. A method for predicting the properties of small molecules of a medicine based on deep learning is characterized by comprising the following steps:
s1, acquiring a data set containing a plurality of drug micromolecule data, and acquiring structural feature data of each drug micromolecule in the data set to form a sample set;
s2, preprocessing the sample set;
s3, performing ensemble learning based on the preprocessed sample set through multiple neural networks to obtain an ensemble prediction model, wherein the multiple neural networks comprise multiple layers of CNNs, CNN _ GRUs and CNN _ LSTMs, the multiple layers of CNNs are arranged in a sequential structure and comprise a convolutional layer, a normalization layer, two convolutional layers, a lamination flat layer and two full-connection layers; the CNN _ GRU adopts bidirectional GRU, adopts sequential structure arrangement and consists of a coiling layer, a bidirectional GRU model, a standardization layer, two coiling layers, a lamination layer and two full-connection layers; the CNN _ LSTM adopts bidirectional LSTM, adopts sequential structure arrangement and consists of a first convolution layer, a first bidirectional LSTM model, a first standardization layer, two convolution layers, a first lamination flat layer and two full-connection layers;
s4, acquiring the small molecule data of the drug to be predicted, inputting the small molecule data of the drug to be predicted into the integrated prediction model to obtain the prediction result of the property of the small molecule of the drug,
the method for predicting the properties of the small molecules of the medicine based on deep learning further comprises the following steps:
acquiring different types of drug databases;
performing property prediction on the small drug molecule data in the different types of drug databases through the integrated prediction model to detect the universal capability of the integrated prediction model,
if the universal capability of the integrated prediction model is detected not to meet the requirement, the accuracy of the integrated prediction model for predicting the properties of the small molecules of the drugs of a certain class or classes is obtained by predicting the properties of the small molecules of the drugs in each drug database, wherein the accuracy is higher than a preset threshold, then in step S4, after acquiring the data of the drug small molecule to be predicted, determining whether the drug small molecule to be predicted belongs to the category with the prediction accuracy rate greater than the preset threshold value, if so, inputting the small molecule data of the drug to be predicted into the integrated prediction model to obtain the prediction result of the small molecule property of the drug, if not, returning to the step S1, re-integrating and learning to obtain the integrated prediction model until the universal capability of the integrated prediction model meets the requirement, or the prediction accuracy of the integrated prediction model to the type of the drug micromolecules to be predicted is greater than a preset threshold value.
2. The method for predicting the property of the small drug molecules based on deep learning of claim 1, wherein the obtaining of the structural feature data of each small drug molecule in the data set specifically comprises:
acquiring a SMILES character string corresponding to each drug small molecule data;
coding each SMILES character string in a form of One-hot coding, and counting the character position of each SMILES character string to construct a drug small molecule characteristic vector, or,
and learning a molecular substructure vector representation form of each drug small molecule data by using mol2vec to construct a drug small molecule characteristic vector.
3. The method for predicting the property of the small molecule of the drug based on the deep learning of claim 2, wherein the preprocessing is performed on the sample set, and specifically comprises:
and adopting an SMOTE algorithm to oversample the structural feature data of the drug micromolecules in the sample set, and dividing the oversampled data into a training set and a test set.
4. A device for predicting the property of a small molecule of a drug based on deep learning, comprising:
the first acquisition module is used for acquiring a data set containing a plurality of drug small molecule data and acquiring structural feature data of each drug small molecule in the data set to form a sample set;
a pre-processing module to pre-process the sample set;
the system comprises a learning module, a prediction module and a prediction module, wherein the learning module is used for performing integrated learning on the basis of a preprocessed sample set through a plurality of neural networks to obtain an integrated prediction model, the plurality of neural networks comprise a plurality of layers of CNN (CNN), CNN _ GRU (CNN _ GRU) and CNN _ LSTM (CNN _ LSTM), and the plurality of layers of CNN are arranged in a sequential structure and comprise a convolutional layer, a normalization layer, two convolutional layers, a lamination layer, a lamination leveling layer and two full-connection layers; the CNN _ GRU adopts bidirectional GRU, adopts sequential structure arrangement and consists of a coiling layer, a bidirectional GRU model, a standardization layer, two coiling layers, a lamination layer and two full-connection layers; the CNN _ LSTM adopts bidirectional LSTM, adopts sequential structure arrangement and consists of a first convolution layer, a first bidirectional LSTM model, a first standardization layer, two convolution layers, a first lamination flat layer and two full-connection layers;
the prediction module is used for acquiring the data of the small molecules of the drug to be predicted and inputting the data of the small molecules of the drug to be predicted into the integrated prediction model to obtain the prediction result of the property of the small molecules of the drug,
the device for predicting the property of the small molecule of the drug based on deep learning further comprises:
a second acquisition module for acquiring different types of drug databases;
a detection module for performing property prediction on the small drug molecule data in the different types of drug databases through the integrated prediction model to detect the universal capability of the integrated prediction model,
if the detection module detects that the universal capability of the integrated prediction model does not meet the requirement, the accuracy of the integrated prediction model for predicting the properties of the small molecules of the drugs of a certain class or classes is higher than a preset threshold value by predicting the properties of the small molecules of the drugs in each drug database, the prediction module judges whether the small drug molecules to be predicted belong to the species with the prediction accuracy rate larger than the preset threshold value after acquiring the data of the small drug molecules to be predicted, if so, inputting the small molecule data of the drug to be predicted into the integrated prediction model to obtain the prediction result of the small molecule property of the drug, if not, the first acquisition module, the preprocessing module and the learning module obtain the integrated prediction model again until the general capability of the integrated prediction model meets the requirement, or the prediction accuracy of the integrated prediction model to the type of the drug micromolecules to be predicted is greater than a preset threshold value.
5. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method for deep learning based small molecule property prediction of a drug according to any of claims 1-3.
6. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the method for deep learning based drug small molecule property prediction according to any one of claims 1-3.
CN202011007504.8A 2020-09-23 2020-09-23 Method and device for predicting properties of small drug molecules based on deep learning Active CN112164428B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011007504.8A CN112164428B (en) 2020-09-23 2020-09-23 Method and device for predicting properties of small drug molecules based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011007504.8A CN112164428B (en) 2020-09-23 2020-09-23 Method and device for predicting properties of small drug molecules based on deep learning

Publications (2)

Publication Number Publication Date
CN112164428A CN112164428A (en) 2021-01-01
CN112164428B true CN112164428B (en) 2022-02-18

Family

ID=73864339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011007504.8A Active CN112164428B (en) 2020-09-23 2020-09-23 Method and device for predicting properties of small drug molecules based on deep learning

Country Status (1)

Country Link
CN (1) CN112164428B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113393911B (en) * 2021-06-23 2022-08-19 石家庄鲜虞数字生物科技有限公司 Ligand compound rapid pre-screening method based on deep learning
CN115050428B (en) * 2022-06-10 2024-06-14 华南理工大学 Drug property prediction method and system based on deep learning fusion molecular graph and fingerprint

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110890137A (en) * 2019-11-18 2020-03-17 上海尔云信息科技有限公司 Modeling method, device and application of compound toxicity prediction model
CN111243682A (en) * 2020-01-10 2020-06-05 京东方科技集团股份有限公司 Method, device, medium and apparatus for predicting toxicity of drug

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019018780A1 (en) * 2017-07-20 2019-01-24 The University Of North Carolina At Chapel Hill Methods, systems and non-transitory computer readable media for automated design of molecules with desired properties using artificial intelligence

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110890137A (en) * 2019-11-18 2020-03-17 上海尔云信息科技有限公司 Modeling method, device and application of compound toxicity prediction model
CN111243682A (en) * 2020-01-10 2020-06-05 京东方科技集团股份有限公司 Method, device, medium and apparatus for predicting toxicity of drug

Also Published As

Publication number Publication date
CN112164428A (en) 2021-01-01

Similar Documents

Publication Publication Date Title
CN112164427A (en) Method and device for predicting activity of small drug molecule target based on deep learning
CN107798235B (en) Unsupervised abnormal access detection method and unsupervised abnormal access detection device based on one-hot coding mechanism
CN111063393B (en) Prokaryotic acetylation site prediction method based on information fusion and deep learning
CN112164428B (en) Method and device for predicting properties of small drug molecules based on deep learning
CN112164426A (en) Drug small molecule target activity prediction method and device based on TextCNN
KR20180055787A (en) System and method for predicting disease inforamtion using deep neural network
US20170140273A1 (en) System and method for automatic selection of deep learning architecture
CN109411016B (en) Gene variation site detection method, device, equipment and storage medium
CN111180009B (en) Cancer stage prediction system based on genome analysis
CN110853756A (en) Esophagus cancer risk prediction method based on SOM neural network and SVM
CN113674862A (en) Acute renal function injury onset prediction method based on machine learning
CN113858566B (en) Injection molding machine energy consumption prediction method and system based on machine learning
CN117438090B (en) Drug-induced immune thrombocytopenia toxicity prediction model, method and system
CN110689140A (en) Method for intelligently managing rail transit alarm data through big data
CN113903458A (en) Acute kidney injury early prediction method and device
CN116543538B (en) Internet of things fire-fighting electrical early warning method and early warning system
CN114418189A (en) Water quality grade prediction method, system, terminal device and storage medium
CN113764034A (en) Method, device, equipment and medium for predicting potential BGC in genome sequence
CN116910526A (en) Model training method, device, communication equipment and readable storage medium
CN116451081A (en) Data drift detection method, device, terminal and storage medium
Wang et al. Feature selection methods in the framework of mRMR
CN114678083A (en) Training method and prediction method of chemical genetic toxicity prediction model
NavyaSree et al. Predicting the Risk Factor of Kidney Disease using Meta Classifiers
CN114300036A (en) Genetic variation pathogenicity prediction method and device, storage medium and computer equipment
CN114140246A (en) Model training method, fraud transaction identification method, device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant