Background
Protein-ligand interactions are ubiquitous and indispensable in life processes, and play a very important role in recognition and signaling of biomolecules. Therefore, the method accurately identifies the ligand binding residues in the protein sequence, is beneficial to researching the protein structure, annotating the protein function and designing the drug target protein, and has important biological significance.
Investigations have found that many methods for predicting ligand-binding residues in protein sequences have been proposed, such as: FTSite (Ngan C H, Hall D R, Zerbe B, et al. FTSite: high access detection of ligand binding sites on unbound protein structures [ J ]. Bioinformatics,2011,28(2):286-287. Ngan C H et al. high accuracy detection of ligand binding sites on unbound protein structures [ J ]. Bioinformatics,2011,28 (2):286-287), Deepsite (Jime nez J, Doerr S, Mart i z-Rosel G, et al. Deepsite: protein-binding site detection using 3D-bound protein structures [ J ]. Bioinformatics, 7,33 (19. J.: 19. Jan D2. J.: 3043D-bound protein binding sites [ J ]. Bioinformatics,2017, 33. J. (Biocoding) based on the biological predictor of protein binding sites [ J.: 19. J.: 3033J.: 19. Biocoding proteins J.,. 12J.,. Biocoding proteins J., (Biocoding) and (Biocoding proteins J.: 3033J.: 33. sup.: 19. sup.,25. sup., 2005,21(9): 1908-: laurie A T R et al, an energy-based protein-ligand binding site Prediction method [ J ]. bioinformatics,2005,21(9): 1908-: zhou J et al, use a convolutional neural network with sequence features to predict DNA binding residues [ C ]//2016 IEEE International bioinformatics and biomedical conference (BIBM), IEEE,2016:78-85, in proteins. Although the existing methods can be used for predicting ligand binding residues in protein sequences, the methods are expensive due to the fact that a large amount of experimental data and machine learning algorithms are commonly used, and due to the fact that noise information in a training set is not paid enough attention, prediction accuracy cannot be guaranteed to be optimal, and needs to be further improved.
In summary, the existing ligand binding residue prediction method has a great gap from the requirement of practical application in the aspects of calculation cost and prediction precision, and needs to be improved urgently.
Disclosure of Invention
In order to overcome the defects of the conventional ligand binding residue prediction method in two aspects of calculation cost and prediction precision, the invention provides the ligand binding residue prediction method based on the deep convolutional neural network, which is low in calculation cost and high in prediction precision.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for ligand-binding residue prediction based on a deep convolutional neural network, the method comprising the steps of:
1) inputting a protein sequence S with the residue number L and to be subjected to ligand binding residue prediction;
2) for the input protein sequence S to be subjected to ligand binding residue prediction, searching the protein sequence database UniRef90(ftp:// ftp. uniprot. org/pub/databases/uniprot/unirefef 90/) using HHblits program (https:// toolkit. tuebingen. mpg. de/#/HHblits) to generate a multi-sequence alignment message comprising M sequences, denoted MSA;
3) for the residue at row i and column j of the MSA, the frequency of occurrence of the residue type in the row and column is counted as:
wherein N isiAnd NjIndicates the number of times the residue type appears in the ith row and the jth column, respectively;
4) MSA is expanded into an L × M × 21 three-dimensional cube of residues, in which the residue type at any position is denoted:
wherein P (i), P (j) E (0,1), δ (P (i) x 21), and δ (P (j) x 21) are integers which are rounded off for P (i) x 21 and P (j) x 21, AA1,AA2,..,AA21Represents 20 common amino acid and vacancy types;
5) dividing the three-dimensional residue cube obtained by calculation in the step 4) into 21 planes with the size of L multiplied by M, counting the frequency of the occurrence of the residue type at any position in the cube in the plane where the residue type exists, and recording the frequency as:
Qx,y,zrepresents the number of times the residue type with position (x, y, z) in the cube appears in the plane in which it lies;
6) taking G (x, y, z) calculated in the step 5) as an element of a corresponding position in a three-dimensional space to form a three-dimensional characteristic cube;
7) building a binding residue of a deep convolutional neural network prediction protein sequence S, wherein the network comprises five layers, namely a convolutional layer, a pooling layer, a convolutional layer, a pooling layer and a full-link layer, the output of each layer is used as the input of the next layer, the full-link layer uses a sigmoid activation function to enable the output value to be in the range of (0,1), and the output of the network is recorded as:
g(I)=net(pool2(conv2(pool1(conv1(I))))),
i denotes the input of the network, conv1, conv2 denote the operation of the first and second convolutional layers, pool1, pool2 denote the operation of the first and second pooling layers, net denotes the operation of the fully connected layer;
8) using a protein sequence of known binding residues to generate a three-dimensional feature cube through steps 2) -6), inputting the feature matrix into the constructed deep convolutional neural network, adjusting parameters in the network by adopting a cross entropy loss function to obtain a deep convolutional neural network model, and recording the cross entropy loss function as:
u represents the true tag of the residue to be determined in the protein sequence,
expressing the prediction output value of the network model, and representing the difference between the prediction output and the real label by L;
9) and inputting the three-dimensional feature cube generated by the protein sequence S into the deep convolutional neural network model, setting an output probability threshold as threshold, and determining that the position which is greater than the threshold in the output value is a binding residue.
The technical conception of the invention is as follows: firstly, acquiring multi-sequence association information containing M sequences by using a HHblits program according to input protein sequence information with the residue number L to be subjected to ligand binding residue prediction; then, counting the frequency of the occurrence of a certain position residue type in the M sequences in the row and the column, and expanding the M sequences into a three-dimensional residue cube; secondly, counting the frequency of the residue type at a certain position in the three-dimensional residue cube appearing in the plane where the residue type is located, and expanding the three-dimensional residue cube into a three-dimensional feature cube according to the frequency data; thirdly, building a deep convolutional neural network, and training the network by utilizing a protein sequence of known binding residues; and finally, converting the protein sequence to be predicted into a three-dimensional characteristic cube, inputting the three-dimensional characteristic cube into the trained deep convolutional neural network model, and predicting whether residues in the protein sequence are binding residues or not.
The beneficial effects of the invention are as follows: on one hand, a three-dimensional residue cube is constructed from multi-sequence association information, and the feature information of the residue cube is further extracted to construct a three-dimensional feature cube, so that preparation is made for improving the prediction accuracy; on the other hand, a deep convolutional neural network is constructed to predict ligand binding residues, so that the prediction efficiency and accuracy of the ligand binding residues are further improved.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 and 2, a ligand binding residue prediction method based on a deep convolutional neural network comprises the following steps:
1) inputting a protein sequence S with the residue number L and to be subjected to ligand binding residue prediction;
2) for the input protein sequence S to be subjected to ligand binding residue prediction, searching the protein sequence database UniRef90(ftp:// ftp. uniprot. org/pub/databases/uniprot/unirefef 90/) using HHblits program (https:// toolkit. tuebingen. mpg. de/#/HHblits) to generate a multi-sequence alignment message comprising M sequences, denoted MSA;
3) for the residue at row i and column j of the MSA, the frequency of occurrence of the residue type in the row and column is counted as:
wherein N isiAnd NjIndicates the number of times the residue type appears in the ith row and the jth column, respectively;
4) MSA is expanded into an L × M × 21 three-dimensional cube of residues, in which the residue type at any position is denoted:
wherein P (i), P (j) E (0,1), δ (P (i) x 21), and δ (P (j) x 21) are integers which are rounded off for P (i) x 21 and P (j) x 21, AA1,AA2,…,AA21Represents 20 common amino acid and vacancy types;
5) dividing the three-dimensional residue cube obtained by calculation in the step 4) into 21 planes with the size of L multiplied by M, counting the frequency of the occurrence of the residue type at any position in the cube in the plane where the residue type exists, and recording the frequency as:
Qx,y,zrepresents the number of times the residue type with position (x, y, z) in the cube appears in the plane in which it lies;
6) taking G (x, y, z) calculated in the step 5) as an element of a corresponding position in a three-dimensional space to form a three-dimensional characteristic cube;
7) building a binding residue of a deep convolutional neural network prediction protein sequence S, wherein the network comprises five layers, namely a convolutional layer, a pooling layer, a convolutional layer, a pooling layer and a fully-connected layer, the output of each layer is used as the input of the next layer, the fully-connected layer uses a sigmoid activation function to enable the output value to be in the range of (0,1), and the output of the network is recorded as:
g(I)=net(pool2(conv2(pool1(conv1(I))))),
i denotes the input of the network, conv1, conv2 denote the operation of the first and second convolutional layers, pool1, pool2 denote the operation of the first and second pooling layers, net denotes the operation of the fully connected layer;
8) using a protein sequence of known binding residues to generate a three-dimensional feature cube through steps 2) -6), inputting the feature matrix into the constructed deep convolutional neural network, adjusting parameters in the network by adopting a cross entropy loss function to obtain a deep convolutional neural network model, and recording the cross entropy loss function as:
u represents the true tag of the residue to be determined in the protein sequence,
expressing the prediction output value of the network model, and representing the difference between the prediction output and the real label by L;
9) and inputting the three-dimensional feature cube generated by the protein sequence S into the deep convolutional neural network model, setting an output probability threshold as threshold, and determining that the position which is greater than the threshold in the output value is a binding residue.
In this embodiment, the ligand binding residue prediction of the protein sequence 1XEF is taken as an example, and a ligand binding residue prediction method based on a deep convolutional neural network comprises the following steps:
1) inputting a residue number 241 to be subjected to ligand binding residue prediction protein sequence 1XEF, and recording the residue number as S;
2) for the input protein sequence S to be subjected to ligand binding residue prediction, using HHblits (https:// toolkit. tuebingen. mpg. de/#/HHblits) program to search protein sequence database UniRef90(ftp:// ftp. uniprox. org/pub/databases/uniprox/unirref 90/) to generate a multi-sequence alignment information comprising 120 sequences, denoted MSA;
3) for the residue at row i and column j of the MSA, the frequency of occurrence of the residue type in the row and column is counted as:
wherein N isiAnd NjIndicating the occurrence of the residue type in i-th row and j-th column, respectivelyThe number of times;
4) MSA was expanded into a 241 x 120 x 21 three-dimensional cube of residues, in which the residue type at any position is denoted:
wherein P (i), P (j) E (0,1), δ (P (i) x 21), and δ (P (j) x 21) are integers which are rounded off for P (i) x 21 and P (j) x 21, AA1,AA2,…,AA21Represents 20 common amino acids and vacancy types respectively;
5) dividing the three-dimensional residue cube obtained by calculation in the step 4) into 21 planes with the size of 241 × 120, and counting the frequency of the occurrence of the residue type at any position in the cube in the plane where the residue type exists, and recording the frequency as:
Qx,y,zrepresents the number of times the residue type with position (x, y, z) in the cube appears in the plane in which it lies;
6) taking G (x, y, z) calculated in the step 5) as an element of a corresponding position in a three-dimensional space to form a three-dimensional characteristic cube;
7) building a binding residue of a deep convolutional neural network prediction protein sequence S, wherein the network comprises five layers, namely a convolutional layer, a pooling layer, a convolutional layer, a pooling layer and a fully-connected layer, the output of each layer is used as the input of the next layer, the fully-connected layer uses a sigmoid activation function to enable the output value to be in the range of (0,1), and the output of the network is recorded as:
g(I)=net(pool2(conv2(pool1(conv1(I))))),
i denotes the input of the network, conv1, conv2 denote the operation of the first and second convolutional layers, pool1, pool2 denote the operation of the first and second pooling layers, net denotes the operation of the fully connected layer;
8) using a protein sequence with known binding residues, and performing steps 2) -6) to generate a three-dimensional feature cube, inputting the feature matrix into the constructed deep convolutional neural network, adjusting parameters in the network by adopting a cross entropy loss function to obtain a deep convolutional neural network model, wherein the cross entropy loss function is recorded as:
u represents the true tag of the residue to be determined in the protein sequence,
expressing the prediction output value of the network model, and representing the difference between the prediction output and the real label by L;
9) and inputting the three-dimensional feature cube generated by the protein sequence S into the deep convolutional neural network model, setting the output probability threshold value to be 0.5, and determining the position which is more than 0.5 in the output value as a binding residue.
The above description is the prediction result obtained by the present invention using the prediction of ligand binding residue of protein sequence 1XEF as an example, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.