CN112116949B

CN112116949B - Protein folding identification method based on triple loss

Info

Publication number: CN112116949B
Application number: CN202010947616.5A
Authority: CN
Inventors: 於东军; 刘岩
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2022-08-16
Anticipated expiration: 2040-09-10
Also published as: CN112116949A

Abstract

The invention discloses a protein folding identification method based on triple loss, which comprises the following steps: coding the protein by using one-hot coding, inputting the coded protein into an SSA program to obtain a contact map between protein residues, inputting the contact map serving as input data into a pre-trained deep learning framework, and outputting the network as a characteristic of the protein specific to folding recognition; the features of the query protein are compared to template proteins of known protein folding classes in the protein database, and the folding class of the template protein closest to the query protein is assigned to the query protein. The invention uses the training thought of triple loss to ensure that the protein structures of the same type are closer and the protein structures of different types are farther, thereby ensuring that the characteristic expression of the protein has stronger discriminative power and the recognition efficiency is higher.

Description

Protein folding identification method based on triple loss

Technical Field

The invention belongs to the field of protein structure prediction in bioinformatics, and particularly relates to a protein folding identification method based on triple loss.

Background

The determination of the type of protein folding can reveal a second set of genetic codes of life, specifically how the primary structure of a protein determines its spatial structure. It is known that the three-dimensional structure of protein plays an important role in studying the function and properties of protein, and the correct determination of the folding recognition of protein is a key loop for predicting the three-dimensional structure of protein. In addition, since the folding pattern of proteins has a profound influence on the heterogeneity and molecular function of proteins, it has a great promoting effect in the fields of artificial design of proteins, search for lethal mechanisms, inclusion body renaturation, and the like. Therefore, the rapid and accurate identification of the folding type of the protein is of great significance to the development of life science and medical science.

In the early stages of research, conventional experimental methods, such as X-ray crystallography and nuclear magnetic resonance spectroscopy, are commonly used to determine the structure of proteins. However, the disadvantages of these conventional methods are also quite significant, costly and time consuming. In addition, with the development of experimental techniques and the continuous advance of human structural genomes, a large number of proteins with known folding types are accumulated in the protein database. Due to the application of the related knowledge in bioinformatics, there is an urgent need to develop a method for directly and rapidly identifying protein folding from protein sequences, and the method is also of great significance in finding and understanding the functions of proteins.

In previous studies, researchers at home and abroad have proposed various protein folding recognition methods, which can be roughly classified into two types in principle: template-based methods and machine learning-based methods.

The method for identifying protein folding based on the template mainly applies the evolutionary relationship of the protein. The method can be summarized as follows: in a first step, proteins of known structure are searched from a common protein structure database (PDB) and used as templates for the proteins to be queried. In addition, in order to make the template prediction method more reliable, the simplified version of data is generally used, and the similarity between sequences needs to be smaller than 40%; the second step is to detect the evolutionary relationship between the template protein and the query protein. Specifically, the method adopts a multi-sequence alignment algorithm to store the sequence information of amino acids into a profile file so as to mine the evolution information, and in general, the HHblits program is most commonly used. Third, to obtain the best sequence alignment result, a scoring function is typically designed to evaluate the similarity score between the protein and the template protein, and the accuracy of the sequence alignment determines the final prediction. These template matching methods bring great roles to the field of protein folding recognition. For example, FFAS (Lukasz J, Zhanwen L, Xiao-Hui C, et al. FFAS server: novel features and applications [ J ]. Nucleic Acids Research (suppl _2): suppl _2.) sequence spectra-sequence spectra alignments were performed using the sequence spectra obtained from the PSIBLAST program to find the final result with the highest scoring function score. Since the FFAS does not consider the structure information, the FFAS-3D (Dong X, Lukasz J, Zhanwen L, et al. FFAS-3D: improving the focus recognition by the cloning optimized structural features and the template-linking [ J ]. Bioinformatics (5):5.) proposed later adds the structure information of the protein, so that the final effect is obviously improved. The template matching based method is easy to determine some folding types with a very small amount of data compared to the machine learning method, but the template matching based method has its limitations, and the method can only identify the known folding types, but the reality is that most of the folding types are unknown. Furthermore, such methods are more dependent on protein homology. The effect is also worse when the sequence alignment is lower. The corresponding machine learning technology also greatly promotes the development of the field of protein folding identification. Among them, the SVM classification method is one of the most widely used methods, for example: ACCFALD (Qiwen D, Shuigeng Z, Jihong G.A new taxonomy-based protein homology on autocovariance-covariance transformation [ J ]. Bioinformatics (20):20 ]) and TAXFALD (Yang J Y, Chen X. visiting taxonomy-based protein homology by using global and local features [ J ]. Proteins,2011,79(7):2053-2064 ]) methods, which are mainly distinguished by the way of feature extraction, ACCFALD calculates covariance coefficients from PSSM and self covariance coefficients as protein sequence feature vectors, effectively extracts protein information, while TAXFALD extracts global and local structural features, effectively improving protein folding accuracy.

In recent years, the development of deep neural networks brings great changes to the field of artificial intelligence, and particularly, the field of computer vision and natural language processing can be developed rapidly. In the field of protein folding recognition, Hou et al (Hou J, Adhikari B, Cheng J. deep SF: deep connected neural network for mapping protein sequences to folds [ J ]. Bioinformatics,2018,34(8): 1295-. Thereafter, Zhu et al (Zhu J, Zhuang H, Li S C, et al, improving protein fold recognition by extracting from the predicted protein-fold contacts [ J ]. Bioinformatics,2017,33(23):3749-3757.) used a 2D convolutional neural network to extract protein fold type-specific features from the protein residue-to-residue contact map, further improving the accuracy of protein fold recognition.

The deep neural network method can automatically extract features from input data, greatly improves the accuracy of fold recognition, but the current related method can only use a series of samples of known fold types to train the deep neural network, and then use the features of the middle layer to generalize the proteins of unknown fold types. The main disadvantages of this method are its indirection and inefficiency: the characteristics of the prayer interlayer can well generalize new protein, and the characteristic dimensionality of the interlayer is very large, which can cause 'dimensionality disaster'.

Disclosure of Invention

The invention aims to provide a protein folding identification method based on triple loss.

The technical scheme for realizing the purpose of the invention is as follows: a protein folding identification method based on triple loss, comprising the steps of:

step 1: training data preprocessing: respectively coding N groups of protein training data by using one-hot codes to obtain the digital expression of the protein sequence;

step 2: inputting the One-hot code of the protein sequence into an SSA protein residue-residue contact map prediction tool, and predicting to obtain a contact map between protein residues;

and step 3: fixing the contact graph into a set size to obtain N fixed-size matrixes;

and 4, step 4: generating ternary group data by the N matrixes, inputting the ternary group data into a convolutional neural network, iterating to set times by using a random gradient descent algorithm by taking the triple loss as a target function, and selecting a convolutional neural network model with the minimum triple loss for storage;

and 5: processing the query protein and all template proteins according to the steps 1-3, and then respectively inputting the processed query protein and all template proteins into a stored convolutional neural network, and taking the result output by the convolutional neural network as the characteristic that the protein is specific to the folding type;

step 6: and calculating the similarity between the query protein and the template protein, and allocating the folding type of the template protein with the highest similarity to the query protein.

Preferably, the contact map size is fixed to 256 × 256 in step 3.

Preferably, the sampling or padding operation is used in step 3 to fix the size of the contact map.

Preferably, each triplet comprises one dockerin, one positive protein, which is of the same folding type as the dockerin, and one negative protein, which is of a different folding type than the dockerin.

Preferably, the objective function is specifically:

in the formula (I), the compound is shown in the specification,

expressed is the Euclidean distance between the anchor protein and the positive sample protein,

expressed is the Euclidean distance between the dockerin and the protein of the negative sample, m is the minimum distance between the Euclidean distance between the dockerin and the protein of the positive sample and the Euclidean distance between the dockerin and the protein of the negative sample, and] ₊ when the value in the parentheses is larger than 0, the value is regarded as loss, and when the value is smaller than 0, 0 is regarded as loss.

Preferably, the cosine distance of the query protein from the template protein is calculated as the similarity score, with smaller distances giving higher scores.

Compared with the prior art, the invention has the following remarkable advantages:

1. the identification precision of protein folding identification is improved: the method uses the triple loss guide convolution neural network training strategy, so that the model can automatically learn strong protein structural characteristics from a protein residue contact diagram, and the accuracy of identifying the protein folding type is improved;

2. the method accelerates the identification speed of protein folding, namely, although the training process of the deep neural network is slower, once the training of the network model parameters is completed, the prediction process is very quick, if a GPU accelerator is used, the identification speed is quicker, and the identification time of protein folding is greatly shortened.

Drawings

FIG. 1 is a diagram of the network structure of protein folding.

FIG. 2 is a flow chart of a protein fold identification method based on triple loss.

Detailed Description

As shown in FIG. 2, a triple loss-based protein folding recognition method, firstly, a protein is encoded by using one-hot coding to obtain a digital expression of a protein sequence, then the digital expression is input into an SSA program to obtain a contact map between protein residues, for better expression, the contact map between the protein residues is named as RRcontact, then the RRcontact is used as input data and is input into a pre-trained deep learning framework, and the output of a network is a protein folding recognition-specific feature and is named as f; finally, f (query) of the query protein is compared to the template protein f (template) of known protein folding classes in the protein database, and the folding class of the template protein closest to the query protein is assigned to the query protein.

The process is described in more detail below with reference to the accompanying drawings:

step 2: the One-hot code of the protein sequence is input into an SSA protein residue-residue contact map prediction tool, so that a contact map RRcontact between protein residues is predicted. The present invention uses predicted protein residue-to-residue contact patterns as a primary reason for neural network feature inputs because it contains a large amount of structural information. In addition, the main reason for using SSA as a prediction tool of protein residue-to-residue contact map is that it only needs to input protein sequence to generate protein residue-to-residue contact end-to-end, and it is fast and accurate.

And step 3: the contact map RRcontact is fixed to 256 × 256 in size, specifically, by sampling or padding operation. The sampling operation is as follows: when the length of the protein sequence is more than 256, residues immediately after 256 are removed, and for the protein sequence less than 256, 0 is added. All data sizes processed in this way are 256 × 256, and N matrices of size 256 × 256 are obtained.

And 4, step 4: and (5) network training. Generating triple data by using N matrixes with the size of 256 multiplied by 256 according to the following requirements: each triplet contains an anchor protein, a positive protein positive (the same fold type as the anchor) and a negative sample protein negative (a different fold type from the anchor).

Inputting the triple group data into a convolutional neural network, taking triple loss as a target function, using a random gradient descent algorithm for continuous iteration, selecting an optimal convolutional neural network model for storage, wherein the mathematical expression of the triple loss function is as follows:

wherein

The Euclidean distance between the anchor protein anchor and the negative sample protein negative is shown,

the expression is the Euclidean distance between the anchorin anchor and the negative sample protein negative, and m is the minimum interval between the Euclidean distance between the anchorin anchor and the positive protein positive (and the Euclidean distance between the anchorin and the negative sample protein negative), and is artificially set] ₊ When the value in the parentheses is larger than 0, the loss is assumed to be 0, and when the value is smaller than 0, the loss is assumed to be 0.

The input triple data does not input a label corresponding to the input triple data, which indicates that the network directly learns the characteristics of the protein during training, but does not classify the protein. Furthermore, the core idea of triple loss is to automatically extract fold-type specific features directly from protein residue-to-residue contacts using convolutional neural networks and to make the distances between proteins of the same fold type closer and between proteins of different fold types further apart.

As shown in fig. 1, in a further embodiment, the convolutional neural network structure specifically includes:

an input layer: uniformly designing a matrix with 256 multiplied by 256 and a channel of 1 as an input data size;

and (3) rolling layers: the first convolutional layer contains 96 convolutional kernels of size 7 × 7, with a sliding step of 2; the second convolution layer is 192 convolution kernels of size 3 × 3; the third convolution layer is 384 convolution kernels with the size of 3 multiplied by 3; the fourth convolution layer is 384 convolution kernels of 3 × 3 size; the fifth convolution layer is 192 convolution kernels of size 3 × 3;

BatchNorm is a technique that can be used to prevent overfitting, which is typically placed between the convolutional and pooling layers;

maximum pooling layer: the size of the pooling layer is 2 multiplied by 2, and the step is 2;

full connection layer: the full link of the first layer comprises 2048 nodes, and the second full link layer comprises 1024 nodes;

dropout layer is a network trained fitting technique with a general parameter value set to 0.5;

and 5: and (5) feature extraction. And (3) processing the query protein and all template proteins according to the steps 1-3, respectively inputting the processed query protein and all template proteins into the convolutional neural network trained in the step 4, and outputting a result as a characteristic that the protein is specific to the folding type.

Step 6: cosine distance is used as a similarity comparison criterion for protein features. Assuming that the template protein is characterized by f ₁ The query protein is characterized by f ₂ Then a similarity score for this query protein can be obtained:

the similarity scores of the query protein and all template proteins are ranked and the fold type of the most similar template protein is assigned to the query protein.

In conclusion, the method is a high-precision and high-speed protein folding identification method based on the deep convolutional neural network and the triple loss, can automatically learn the structural characteristics of the protein, does not need excessive human intervention, has a speed obviously higher than that of other methods, and also obviously improves the identification accuracy.

The present invention utilizes triplet losses to direct the convolutional neural network to directly optimize the features themselves, rather than the features of the intermediate layers. Furthermore, the triplet loss may allow closer distance between proteins of the same fold type and further distance between proteins of different fold types. The protein features extracted using the present invention are more discriminatory.

Claims

1. A protein folding identification method based on triple loss is characterized by comprising the following steps:

step 1: training data preprocessing: respectively coding N groups of protein training data by using One-hot codes to obtain the digital expression of protein sequences;

and 4, step 4: generating ternary group data by the N matrixes, inputting the ternary group data into a convolutional neural network, iterating to a set number of times by using a random gradient descent algorithm by taking the ternary group data as a target function, and selecting and storing a convolutional neural network model with the minimum ternary group loss;

2. The triple-loss based protein fold identification method of claim 1, wherein the contact map size is fixed to 256 x 256 in step 3.

3. The triple loss based protein folding identification method of claim 1, wherein the contact map size is fixed in step 3 by sampling or filling.

4. The triple-loss based protein folding identification method of claim 1, wherein each triple comprises an dockerin, a positive protein, and a negative protein, wherein the positive protein folding type is the same as the dockerin, and the negative protein folding type is different from the dockerin.

5. The triple loss-based protein folding identification method according to claim 1, wherein the objective function is specifically:

in the formula (I), the compound is shown in the specification,

expressed is the Euclidean distance between the dockerin and the protein of the negative sample, m is the minimum distance between the Euclidean distance between the dockerin and the protein of the positive sample and the Euclidean distance between the dockerin and the protein of the negative sample, and] ₊ indicating that the value in parentheses is greater than 0 and taking that value as a loss,when the value is less than 0, 0 is taken as the loss.

6. The triple loss-based protein folding identification method according to claim 1, wherein the cosine distance between the query protein and the template protein is calculated as a similarity score, and the smaller the distance, the higher the score.