CN114067905A

CN114067905A - Drug-target interaction prediction method fusing multilayer drug structure information

Info

Publication number: CN114067905A
Application number: CN202111313022.XA
Authority: CN
Inventors: 车超; 张培良
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-02-18

Abstract

The invention provides a medicine-target interaction prediction method fusing multilayer medicine structure information. Firstly, preprocessing the drug and target information in a pharmacocmics database, and extracting the drug and target information with interaction; secondly, expressing the molecular fingerprint of the drug SMILES as a molecular diagram structure, and extracting drug characteristic information by using a molecular complement diagram convolutional neural network and a Transformer network; then, processing the target sequence information by using a convolutional neural network, and extracting target characteristic information; and finally, the extracted medicine characteristic information and target point characteristic information are sent to a classification model for training, the model is stored, and the relation between the medicine and the target point is predicted. The method effectively extracts the characteristic information in the molecular structure of the drug, has higher accuracy in predicting the drug-target relationship, improves the efficiency and the precision of verifying the drug-target relationship, effectively shortens the period of drug research and development, and greatly reduces the cost of research and development of new drugs.

Description

Drug-target interaction prediction method fusing multilayer drug structure information

Technical Field

The invention relates to the technical field of medical artificial intelligence and natural language processing, in particular to a medicine-target interaction prediction method fusing multilayer medicine structure information.

Background

The development of new drugs is an expensive and time consuming process. It is well known that the total average development cost of a new drug varies from $ 2 to $ 30 billion, with a total development time of 13-15 years. Therefore, a fast and efficient research and development mode is urgently needed in the field of drug research and development to improve the efficiency of drug research and development and reduce the cost of drug research and development. A large number of researches show that the drug relocation method can effectively shorten the drug research and development period and greatly reduce the research and development cost of new drugs. While drug-target interactions play a key role in drug relocation studies. The development of the human genome project enables the rapid accumulation of data related to drug compounds, targets and interactions, and provides data accumulation for the prediction of drug-target interactions. However, there are still a large number of interactions between drugs and targets that have not been discovered and verified.

At present, the verification of the drug-target interaction relationship mainly depends on large-scale biological or chemical experiments, a large amount of manpower, material resources and financial resources are required to be invested in the verification process, and the experiment verification has great contingency, so that the cost investment of the verification of the drug-target interaction relationship is increased. In order to reduce the cost investment of drug-target interaction relationship verification, more and more calculation methods are used for drug-target interaction relationship prediction, but all the methods have certain defects. For example, deep characteristic information between a drug and a target is often ignored in a similarity-based method, and although a deep learning method can acquire more characteristic information between the drug and the target, complex relationships between different entities are difficult to learn for complex omics data, and the method lacks practical guiding significance for drug reuse. The structure diagram of the drug molecule contains various atoms and chemical bonds for forming the drug, is the centralized embodiment of the chemical property and the curative effect of the drug, and has important influence on the prediction of the drug-target interaction. However, the current method based on the graph neural network only focuses on the interaction relationship in the complex network and ignores the molecular structure property of the drug itself.

Disclosure of Invention

Aiming at the problems in the prior art, the application provides a deep learning model based on drug molecular structure information to automatically predict the interaction relation between a drug and a target spot, so that the verification efficiency is improved, and the verification cost is reduced.

In order to achieve the purpose, the technical scheme of the application is as follows: a method for predicting drug-target interaction fusing multilayer drug structure information comprises the following steps:

step 1: preprocessing the drug and target information in a pharmacocmics database, extracting drug and target information with interaction, and constructing drug-target interaction data;

step 2: expressing the molecular fingerprint of the drug SMILES as a molecular diagram structure, and extracting drug characteristic information by using a molecular complement diagram convolutional neural network and a Transformer network;

and step 3: embedding and representing target point sequence information, processing by using a convolutional neural network, and extracting target point characteristic information;

and 4, step 4: the extracted medicine characteristic information and the extracted target point characteristic information are sent to a classification model for training, and then the model is stored;

and 5: and loading the model, inputting the information of the drug to be predicted and the target point, predicting the relationship between the drug and the target point and outputting a prediction result.

Further, step 1 specifically includes:

step 1.1: screening the drug information and the target point information from the pharmacocmics database, and deleting the drug information and the target point information which have no interaction relationship;

step 1.2: integrating the medicines and the targets with the interaction relationship to form a form of < medicine number, target number and label >, and marking the label as 1;

step 1.3: acquiring SMILES molecular fingerprints corresponding to the drugs and sequence information corresponding to target spots from a pharmacocmics database, and respectively using the SMILES molecular fingerprints and the sequence information as specific representation information of the drugs and the target spots;

step 1.4: according to the positive example: negative example is 1: 2, randomly constructing unknown drug-target relationships as negative cases, and labeling the negative case label as 0.

Further, step 2 specifically includes:

step 2.1: expressing the SMILES molecular fingerprint of each drug in a graph form by calling an RDKit function library in a Python library, wherein the vertex and the edge of the graph respectively represent atoms and chemical bonds of the drug, each drug molecule is expressed by using a characteristic matrix and an adjacent matrix, and each row of the characteristic matrix corresponds to the attribute of each atom; each drug is represented as

Wherein N represents the kind of the drug,

a feature matrix representing the drug substance,

representing a contiguous matrix of drugs, D_iRepresents the number of atoms of the ith drug, and C represents the number of characteristic channels of the atoms;

step 2.2: referring to fig. 2, the molecular complement graph convolutional neural network and the Transformer network are used to extract the drug characteristic information.

Further, step 2.2 specifically includes:

step 2.2.1: the molecule complement graph convolution neural network MCGCN takes the graph G obtained in the step 2.1 as input, and the MCGCN ensures that the size of an adjacency matrix and a characteristic matrix of each drug molecule is consistent by adding a complement graph to the original drug molecule graph, wherein the original graph and the complement graph are independent of each other; after completion, the molecular diagram of the drug is shown as follows:

wherein,

connection matrix between original graph G and complementary graph G' representing the ith drug；

Respectively representing the adjacency matrix and the characteristic matrix after completion; all drug molecules are represented as a graph G with consistent node number and size by the completion operation_new(ii) a The MCGCN comprises two hidden layers, the drug being represented in each hidden layer using formula (2):

wherein,

is the adjacency matrix with self-attention added,

is that

The weight matrix of (1), wherein

And Θ^(l)Is the convolved signal and filter parameters for the l-th layer; each hidden layer is then represented by σ (·) which is set to ReLU (·) max (0,); maximum pooling is used at the end of MCGCN, and dimensionality of data is reduced;

step 2.2.2: transformer network uses output vector of each hidden layer in MCGCN

Taking an Encoder part in a Transformer network as an input to extract features; in a Transformer network, vector information from different hidden layers in an MCGCN is processed by different multi-head attention modules, for a first hidden layer in the MCGCN(Vector)

Processing by using a multi-head attention module with the head number of 6; for vectors from the second hidden layer of MCGCN

Processing by using a plurality of attention modules with the number of 4; extracting features in the multi-head attention module using equation (3):

MultiHead_j(Q,K,V)＝Concat(head₁,...,head_i) (3)

splicing the feature vectors processed by the multi-head attention module by using a formula (4), sending the feature vectors into an original layer normalization part in a Transformer network, finally sending output vectors of the layer normalization into a full-connection feedforward neural network, and taking the output of the full-connection feedforward neural network as final feature vectors M of the medicine_alldrug；

AllMultiHead＝Concat(MultiHead₁,...,MultiHead_j) (4)。

Further, the step 3 specifically includes:

step 3.1: randomly initializing a lookup table corresponding to all amino acids appearing in the target sequence, wherein the size of the lookup table is 26 multiplied by 20; corresponding the amino acid in each target sequence with a lookup table to construct an embedded matrix M of the target sequence_tar(ii) a The embedded matrix M_tarThe length of (2) is the maximum length in the target point sequence, and is set to 2500, and the width is consistent with the width of the lookup table; during the model training process, the embedded vector is optimized continuously, so the relevant information in the lookup table changes continuously along with the optimization of the model.

Step 3.2: referring to fig. 3, a convolutional neural network is used to extract feature information in a target point sequence, and the embedded matrix M obtained in step 3.1 is used_tarAs an input to a convolutional neural network; the filling of empty tags is automatically performed for target sequences smaller than the length of the embedding matrix.

Further, the step 3.2 specifically includes:

step 3.2.1: embedded matrix M obtained in step 3.1_tarInputting convolution layers with convolution kernels of 10, 15 and 20 respectively and step length of 1 to extract features, and sending the extracted feature vectors to an ELU activation function for optimization, wherein the ELU activation function is defined as follows:

step 3.2.2: the optimized vector in the ELU activation function is sent to a global maximum pooling layer, the most important local feature is extracted, and after the vector passes through the global maximum pooling layer, the obtained vector dimension is 128;

step 3.3.3: splicing the output vectors of each maximum pooling layer to obtain a spliced vector with a dimension of 384, inputting the spliced vector into a fully-connected neural network to obtain a vector with a dimension of 128, and using the vector as a final feature vector M of a target point_alltar。

Further, the step 4 specifically includes:

step 4.1: the characteristic vector M of the medicine obtained in the step 2_alldrugAnd the target point feature vector M obtained in the step 3_alltarSplicing to obtain final vector representation M of input data_allTaking a label corresponding to the original drug-target point relation as a label of a final vector;

step 4.2: the final vector obtained in step 4.1 is represented as M_allInputting the label into a fully-connected neural network, and training a model; in order to obtain the best model effect, the model is optimized by using a binary cross entropy function optimized by an L2 norm, and the model _ best with the best effect is stored:

further, the step 5 specifically includes:

loading the model _ best in the step 4.2, inputting the drug-target point information in the verification data into the model, judging whether the drug and the target point have an interaction relation, and outputting a corresponding evaluation index;

due to the adoption of the technical scheme, the invention can obtain the following technical effects: the invention adopts a deep learning model, utilizes the information of the drugs and the target points in the drug database, combines the structural characteristics of the drugs and the target points, and automatically predicts the interaction information of the drugs and the target points through the model. The method effectively extracts the characteristic information in the molecular structure of the drug, has higher accuracy when predicting the drug-target relationship, has robustness, improves the efficiency and the precision of verifying the drug-target relationship, effectively shortens the period of drug research and development, greatly reduces the cost of new drug research and development, and provides important basis and guarantee for new drug research and development and drug reuse.

Drawings

FIG. 1 is a flow chart of a method for predicting drug-target interaction that incorporates information about the structure of a multi-layered drug;

FIG. 2 is a flow chart of drug characteristic information extraction;

FIG. 3 is a flow chart of target feature information extraction.

Detailed Description

The embodiments of the present invention are implemented on the premise of the technical solution of the present invention, and detailed embodiments and specific operation procedures are given, but the scope of the present invention is not limited to the following embodiments.

Example 1

The present invention is described in detail below with reference to examples so that those skilled in the art can practice the invention with reference to the present specification.

In the embodiment, a Windows system is used as a development environment, a Pycharm is used as a development platform, a Python is used as a development language, and the prediction method of the drug-target interaction with the multi-layer drug structure information is adopted to predict the drug-target interaction relationship and predict the COVID-19 potential therapeutic drugs.

In this embodiment, a method for predicting drug-target interaction by fusing multilayer drug structure information includes the following steps:

step 1: giving a target spot, namely a Delta target spot of COVID-19, finding out 54 drugs having interaction relation with the target spot in a PubChem database, and setting a data tag as 1;

step 2: randomly selecting 108 medicines which do not interact with the Delta target from a PubChem database, constructing a negative example, and setting a data label to be 0;

and step 3: acquiring the chemical structure of the medicine and the sequence structure of the Delta target in the

steps

1 and 2 from a PubChem database;

and 4, step 4: converting the chemical structure of the medicine into a form of a molecular structure diagram by using an RDkit tool package in a Python library, and storing the molecular structure diagram as a file in an hkl format, wherein each medicine is stored as one file;

and 5: taking a file in a drug hkl format and a sequence structure of a Delta target as input, and loading a stored model to obtain an evaluation index and a predicted Score of an interaction relation between the drug and the Delta target, wherein the evaluation index comprises an Accuracy (ACC), an F1 value and an AUC;

TP: a true positive case, correctly predicting the positive class as a positive class number; FP: false positive case, the negative class is mispredicted to be a positive class number; FN: false negative examples, mispredict the positive class as a negative class number; TN: and in the true negative case, the negative class is correctly predicted as the negative class number. AUC is expressed using the area under the ROC curve;

step 6: and 5, sequencing the prediction scores Score in the step 5 in a descending manner to obtain the medicine information ranked at the top 5.

According to the steps, the medicine-target point relation prediction effect is compared with a Deep DTA model, a Deep DTI model, a Deep Conv-DTI model and an ML-DTI model. As can be seen from table 1, the method proposed in the present invention is significantly superior to other methods in AUC, F1 values and prediction accuracy.

TABLE 1 comparison of prediction results for different models for drug-target relationship

The method of the invention is used for predicting the potential therapeutic drugs of COVID-19 and Delta target, and in the experimental result, four drugs including Tramadol in the top five drugs have been clinically treated by COVID-19 or have literature support to have inhibitory effect on COVID-19, as shown in Table 2. Tramadol, Amitriptyline and Dextrometorphan all have close interaction relation with Delta target. Among them, Dexamethasone and Dextrometorphan are widely used as clinical treatments for COVID-19 and successfully alleviate the complications of COVID-19. Tramadol can protect COVID-19 patients from disease complications by increasing antioxidant enzymes, superoxide dismutase and glutathione peroxidase, while reducing the effects of malondialdehyde. It has been shown that the probability of infection of cells is reduced by 90% after treatment with different concentrations of Amitriptyline, which also provides the basis for the use of Amitriptyline for COVID-19 therapy.

TABLE 2 the first five therapeutic agents related to COVID-19 recommended by the present invention

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A method for predicting drug-target interaction fused with multilayer drug structure information is characterized by comprising the following steps:

2. The method for predicting the drug-target interaction fusing the structural information of the multilayer drug according to claim 1, wherein the step 1 specifically comprises:

3. The method for predicting the drug-target interaction fusing the structural information of the multilayer drug according to claim 1, wherein the step 2 specifically comprises:

Wherein N represents the kind of the drug,

a feature matrix representing the drug substance,

step 2.2: and extracting the characteristic information of the medicine by adopting a molecular complement graph convolutional neural network and a Transformer network.

4. The method for predicting the drug-target interaction fusing the structural information of the multilayer drug according to claim 3, wherein the step 2.2 specifically comprises:

wherein,

a connection matrix between the original graph G and the complement graph G' representing the ith drug;

wherein,

is the adjacency matrix with self-attention added,

is that

The weight matrix of (1), wherein

Taking an Encoder part in a Transformer network as an input to extract features; in a Transformer network, vector information from different hidden layers in an MCGCN is processed by different multi-head attention modules, and for a vector from a first hidden layer in the MCGCN

MultiHead_j(Q,K,V)＝Concat(head₁,…,head_i) (3)

splicing the feature vectors processed by the multi-head attention module by using a formula (4), sending the feature vectors into an original layer normalization part in a Transformer network, sending output vectors of the layer normalization into a full-connection feedforward neural network, and sending the output vectors before full connectionThe output of the feed neural network is used as the final characteristic vector M of the medicine_alldrug；

AllMultiHead＝Concat(MultiHead₁,…,MultiHead_j) (4)。

5. The method for predicting drug-target interaction fusing structural information of multilayer drugs according to claim 1, wherein the step 3 specifically comprises:

step 3.1: randomly initializing a lookup table corresponding to all occurring amino acids in the target sequence; corresponding the amino acid in each target sequence with a lookup table to construct an embedded matrix M of the target sequence_tar(ii) a The embedded matrix M_tarThe length of (2) is the maximum length in the target point sequence, and the width is consistent with the width of the lookup table;

step 3.2: extracting characteristic information in the target point sequence by using a convolutional neural network, and embedding the embedded matrix M obtained in the step 3.1_tarAs an input to a convolutional neural network; the filling of empty tags is automatically performed for target sequences smaller than the length of the embedding matrix.

6. The method for predicting drug-target interaction fusing structural information of multilayer drugs according to claim 5, wherein the step 3.2 specifically comprises:

step 3.2.2: the optimized vector in the ELU activation function is sent to a global maximum pooling layer, and the most important local feature is extracted;

step 3.3.3: splicing the output vectors of each maximum pooling layerInputting the spliced vector into a fully-connected neural network as the final characteristic vector M of the target point_alltar。

7. The method for predicting drug-target interaction fusing structural information of multilayer drugs according to claim 1, wherein the step 4 specifically comprises:

step 4.2: the final vector obtained in step 4.1 is represented as M_allInputting the label into a fully-connected neural network, and training a model; optimizing the model by adopting a binary cross entropy function optimized by an L2 norm, and storing the model _ best with the best effect:

8. the method for predicting drug-target interaction fusing structural information of multilayer drugs according to claim 7, wherein the step 5 specifically comprises:

and (4) loading the model _ best in the step (4.2), inputting the drug-target point information in the verification data into the model, judging whether the drug and the target point have an interaction relation, and outputting a corresponding evaluation index.