CN112116949B - Protein folding identification method based on triple loss - Google Patents

Protein folding identification method based on triple loss Download PDF

Info

Publication number
CN112116949B
CN112116949B CN202010947616.5A CN202010947616A CN112116949B CN 112116949 B CN112116949 B CN 112116949B CN 202010947616 A CN202010947616 A CN 202010947616A CN 112116949 B CN112116949 B CN 112116949B
Authority
CN
China
Prior art keywords
protein
folding
loss
template
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010947616.5A
Other languages
Chinese (zh)
Other versions
CN112116949A (en
Inventor
於东军
刘岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202010947616.5A priority Critical patent/CN112116949B/en
Publication of CN112116949A publication Critical patent/CN112116949A/en
Application granted granted Critical
Publication of CN112116949B publication Critical patent/CN112116949B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a protein folding identification method based on triple loss, which comprises the following steps: coding the protein by using one-hot coding, inputting the coded protein into an SSA program to obtain a contact map between protein residues, inputting the contact map serving as input data into a pre-trained deep learning framework, and outputting the network as a characteristic of the protein specific to folding recognition; the features of the query protein are compared to template proteins of known protein folding classes in the protein database, and the folding class of the template protein closest to the query protein is assigned to the query protein. The invention uses the training thought of triple loss to ensure that the protein structures of the same type are closer and the protein structures of different types are farther, thereby ensuring that the characteristic expression of the protein has stronger discriminative power and the recognition efficiency is higher.

Description

Protein folding identification method based on triple loss
Technical Field
The invention belongs to the field of protein structure prediction in bioinformatics, and particularly relates to a protein folding identification method based on triple loss.
Background
The determination of the type of protein folding can reveal a second set of genetic codes of life, specifically how the primary structure of a protein determines its spatial structure. It is known that the three-dimensional structure of protein plays an important role in studying the function and properties of protein, and the correct determination of the folding recognition of protein is a key loop for predicting the three-dimensional structure of protein. In addition, since the folding pattern of proteins has a profound influence on the heterogeneity and molecular function of proteins, it has a great promoting effect in the fields of artificial design of proteins, search for lethal mechanisms, inclusion body renaturation, and the like. Therefore, the rapid and accurate identification of the folding type of the protein is of great significance to the development of life science and medical science.
In the early stages of research, conventional experimental methods, such as X-ray crystallography and nuclear magnetic resonance spectroscopy, are commonly used to determine the structure of proteins. However, the disadvantages of these conventional methods are also quite significant, costly and time consuming. In addition, with the development of experimental techniques and the continuous advance of human structural genomes, a large number of proteins with known folding types are accumulated in the protein database. Due to the application of the related knowledge in bioinformatics, there is an urgent need to develop a method for directly and rapidly identifying protein folding from protein sequences, and the method is also of great significance in finding and understanding the functions of proteins.
In previous studies, researchers at home and abroad have proposed various protein folding recognition methods, which can be roughly classified into two types in principle: template-based methods and machine learning-based methods.
The method for identifying protein folding based on the template mainly applies the evolutionary relationship of the protein. The method can be summarized as follows: in a first step, proteins of known structure are searched from a common protein structure database (PDB) and used as templates for the proteins to be queried. In addition, in order to make the template prediction method more reliable, the simplified version of data is generally used, and the similarity between sequences needs to be smaller than 40%; the second step is to detect the evolutionary relationship between the template protein and the query protein. Specifically, the method adopts a multi-sequence alignment algorithm to store the sequence information of amino acids into a profile file so as to mine the evolution information, and in general, the HHblits program is most commonly used. Third, to obtain the best sequence alignment result, a scoring function is typically designed to evaluate the similarity score between the protein and the template protein, and the accuracy of the sequence alignment determines the final prediction. These template matching methods bring great roles to the field of protein folding recognition. For example, FFAS (Lukasz J, Zhanwen L, Xiao-Hui C, et al. FFAS server: novel features and applications [ J ]. Nucleic Acids Research (suppl _2): suppl _2.) sequence spectra-sequence spectra alignments were performed using the sequence spectra obtained from the PSIBLAST program to find the final result with the highest scoring function score. Since the FFAS does not consider the structure information, the FFAS-3D (Dong X, Lukasz J, Zhanwen L, et al. FFAS-3D: improving the focus recognition by the cloning optimized structural features and the template-linking [ J ]. Bioinformatics (5):5.) proposed later adds the structure information of the protein, so that the final effect is obviously improved. The template matching based method is easy to determine some folding types with a very small amount of data compared to the machine learning method, but the template matching based method has its limitations, and the method can only identify the known folding types, but the reality is that most of the folding types are unknown. Furthermore, such methods are more dependent on protein homology. The effect is also worse when the sequence alignment is lower. The corresponding machine learning technology also greatly promotes the development of the field of protein folding identification. Among them, the SVM classification method is one of the most widely used methods, for example: ACCFALD (Qiwen D, Shuigeng Z, Jihong G.A new taxonomy-based protein homology on autocovariance-covariance transformation [ J ]. Bioinformatics (20):20 ]) and TAXFALD (Yang J Y, Chen X. visiting taxonomy-based protein homology by using global and local features [ J ]. Proteins,2011,79(7):2053-2064 ]) methods, which are mainly distinguished by the way of feature extraction, ACCFALD calculates covariance coefficients from PSSM and self covariance coefficients as protein sequence feature vectors, effectively extracts protein information, while TAXFALD extracts global and local structural features, effectively improving protein folding accuracy.
In recent years, the development of deep neural networks brings great changes to the field of artificial intelligence, and particularly, the field of computer vision and natural language processing can be developed rapidly. In the field of protein folding recognition, Hou et al (Hou J, Adhikari B, Cheng J. deep SF: deep connected neural network for mapping protein sequences to folds [ J ]. Bioinformatics,2018,34(8): 1295-. Thereafter, Zhu et al (Zhu J, Zhuang H, Li S C, et al, improving protein fold recognition by extracting from the predicted protein-fold contacts [ J ]. Bioinformatics,2017,33(23):3749-3757.) used a 2D convolutional neural network to extract protein fold type-specific features from the protein residue-to-residue contact map, further improving the accuracy of protein fold recognition.
The deep neural network method can automatically extract features from input data, greatly improves the accuracy of fold recognition, but the current related method can only use a series of samples of known fold types to train the deep neural network, and then use the features of the middle layer to generalize the proteins of unknown fold types. The main disadvantages of this method are its indirection and inefficiency: the characteristics of the prayer interlayer can well generalize new protein, and the characteristic dimensionality of the interlayer is very large, which can cause 'dimensionality disaster'.
Disclosure of Invention
The invention aims to provide a protein folding identification method based on triple loss.
The technical scheme for realizing the purpose of the invention is as follows: a protein folding identification method based on triple loss, comprising the steps of:
step 1: training data preprocessing: respectively coding N groups of protein training data by using one-hot codes to obtain the digital expression of the protein sequence;
step 2: inputting the One-hot code of the protein sequence into an SSA protein residue-residue contact map prediction tool, and predicting to obtain a contact map between protein residues;
and step 3: fixing the contact graph into a set size to obtain N fixed-size matrixes;
and 4, step 4: generating ternary group data by the N matrixes, inputting the ternary group data into a convolutional neural network, iterating to set times by using a random gradient descent algorithm by taking the triple loss as a target function, and selecting a convolutional neural network model with the minimum triple loss for storage;
and 5: processing the query protein and all template proteins according to the steps 1-3, and then respectively inputting the processed query protein and all template proteins into a stored convolutional neural network, and taking the result output by the convolutional neural network as the characteristic that the protein is specific to the folding type;
step 6: and calculating the similarity between the query protein and the template protein, and allocating the folding type of the template protein with the highest similarity to the query protein.
Preferably, the contact map size is fixed to 256 × 256 in step 3.
Preferably, the sampling or padding operation is used in step 3 to fix the size of the contact map.
Preferably, each triplet comprises one dockerin, one positive protein, which is of the same folding type as the dockerin, and one negative protein, which is of a different folding type than the dockerin.
Preferably, the objective function is specifically:
Figure BDA0002675861420000041
in the formula (I), the compound is shown in the specification,
Figure BDA0002675861420000042
expressed is the Euclidean distance between the anchor protein and the positive sample protein,
Figure BDA0002675861420000043
expressed is the Euclidean distance between the dockerin and the protein of the negative sample, m is the minimum distance between the Euclidean distance between the dockerin and the protein of the positive sample and the Euclidean distance between the dockerin and the protein of the negative sample, and] + when the value in the parentheses is larger than 0, the value is regarded as loss, and when the value is smaller than 0, 0 is regarded as loss.
Preferably, the cosine distance of the query protein from the template protein is calculated as the similarity score, with smaller distances giving higher scores.
Compared with the prior art, the invention has the following remarkable advantages:
1. the identification precision of protein folding identification is improved: the method uses the triple loss guide convolution neural network training strategy, so that the model can automatically learn strong protein structural characteristics from a protein residue contact diagram, and the accuracy of identifying the protein folding type is improved;
2. the method accelerates the identification speed of protein folding, namely, although the training process of the deep neural network is slower, once the training of the network model parameters is completed, the prediction process is very quick, if a GPU accelerator is used, the identification speed is quicker, and the identification time of protein folding is greatly shortened.
Drawings
FIG. 1 is a diagram of the network structure of protein folding.
FIG. 2 is a flow chart of a protein fold identification method based on triple loss.
Detailed Description
As shown in FIG. 2, a triple loss-based protein folding recognition method, firstly, a protein is encoded by using one-hot coding to obtain a digital expression of a protein sequence, then the digital expression is input into an SSA program to obtain a contact map between protein residues, for better expression, the contact map between the protein residues is named as RRcontact, then the RRcontact is used as input data and is input into a pre-trained deep learning framework, and the output of a network is a protein folding recognition-specific feature and is named as f; finally, f (query) of the query protein is compared to the template protein f (template) of known protein folding classes in the protein database, and the folding class of the template protein closest to the query protein is assigned to the query protein.
The process is described in more detail below with reference to the accompanying drawings:
step 1: training data preprocessing: respectively coding N groups of protein training data by using one-hot codes to obtain the digital expression of the protein sequence;
step 2: the One-hot code of the protein sequence is input into an SSA protein residue-residue contact map prediction tool, so that a contact map RRcontact between protein residues is predicted. The present invention uses predicted protein residue-to-residue contact patterns as a primary reason for neural network feature inputs because it contains a large amount of structural information. In addition, the main reason for using SSA as a prediction tool of protein residue-to-residue contact map is that it only needs to input protein sequence to generate protein residue-to-residue contact end-to-end, and it is fast and accurate.
And step 3: the contact map RRcontact is fixed to 256 × 256 in size, specifically, by sampling or padding operation. The sampling operation is as follows: when the length of the protein sequence is more than 256, residues immediately after 256 are removed, and for the protein sequence less than 256, 0 is added. All data sizes processed in this way are 256 × 256, and N matrices of size 256 × 256 are obtained.
And 4, step 4: and (5) network training. Generating triple data by using N matrixes with the size of 256 multiplied by 256 according to the following requirements: each triplet contains an anchor protein, a positive protein positive (the same fold type as the anchor) and a negative sample protein negative (a different fold type from the anchor).
Inputting the triple group data into a convolutional neural network, taking triple loss as a target function, using a random gradient descent algorithm for continuous iteration, selecting an optimal convolutional neural network model for storage, wherein the mathematical expression of the triple loss function is as follows:
Figure BDA0002675861420000051
wherein
Figure BDA0002675861420000052
The Euclidean distance between the anchor protein anchor and the negative sample protein negative is shown,
Figure BDA0002675861420000053
the expression is the Euclidean distance between the anchorin anchor and the negative sample protein negative, and m is the minimum interval between the Euclidean distance between the anchorin anchor and the positive protein positive (and the Euclidean distance between the anchorin and the negative sample protein negative), and is artificially set] + When the value in the parentheses is larger than 0, the loss is assumed to be 0, and when the value is smaller than 0, the loss is assumed to be 0.
The input triple data does not input a label corresponding to the input triple data, which indicates that the network directly learns the characteristics of the protein during training, but does not classify the protein. Furthermore, the core idea of triple loss is to automatically extract fold-type specific features directly from protein residue-to-residue contacts using convolutional neural networks and to make the distances between proteins of the same fold type closer and between proteins of different fold types further apart.
As shown in fig. 1, in a further embodiment, the convolutional neural network structure specifically includes:
an input layer: uniformly designing a matrix with 256 multiplied by 256 and a channel of 1 as an input data size;
and (3) rolling layers: the first convolutional layer contains 96 convolutional kernels of size 7 × 7, with a sliding step of 2; the second convolution layer is 192 convolution kernels of size 3 × 3; the third convolution layer is 384 convolution kernels with the size of 3 multiplied by 3; the fourth convolution layer is 384 convolution kernels of 3 × 3 size; the fifth convolution layer is 192 convolution kernels of size 3 × 3;
BatchNorm is a technique that can be used to prevent overfitting, which is typically placed between the convolutional and pooling layers;
maximum pooling layer: the size of the pooling layer is 2 multiplied by 2, and the step is 2;
full connection layer: the full link of the first layer comprises 2048 nodes, and the second full link layer comprises 1024 nodes;
dropout layer is a network trained fitting technique with a general parameter value set to 0.5;
and 5: and (5) feature extraction. And (3) processing the query protein and all template proteins according to the steps 1-3, respectively inputting the processed query protein and all template proteins into the convolutional neural network trained in the step 4, and outputting a result as a characteristic that the protein is specific to the folding type.
Step 6: cosine distance is used as a similarity comparison criterion for protein features. Assuming that the template protein is characterized by f 1 The query protein is characterized by f 2 Then a similarity score for this query protein can be obtained:
Figure BDA0002675861420000061
the similarity scores of the query protein and all template proteins are ranked and the fold type of the most similar template protein is assigned to the query protein.
In conclusion, the method is a high-precision and high-speed protein folding identification method based on the deep convolutional neural network and the triple loss, can automatically learn the structural characteristics of the protein, does not need excessive human intervention, has a speed obviously higher than that of other methods, and also obviously improves the identification accuracy.
The present invention utilizes triplet losses to direct the convolutional neural network to directly optimize the features themselves, rather than the features of the intermediate layers. Furthermore, the triplet loss may allow closer distance between proteins of the same fold type and further distance between proteins of different fold types. The protein features extracted using the present invention are more discriminatory.

Claims (6)

1. A protein folding identification method based on triple loss is characterized by comprising the following steps:
step 1: training data preprocessing: respectively coding N groups of protein training data by using One-hot codes to obtain the digital expression of protein sequences;
step 2: inputting the One-hot code of the protein sequence into an SSA protein residue-residue contact map prediction tool, and predicting to obtain a contact map between protein residues;
and step 3: fixing the contact graph into a set size to obtain N fixed-size matrixes;
and 4, step 4: generating ternary group data by the N matrixes, inputting the ternary group data into a convolutional neural network, iterating to a set number of times by using a random gradient descent algorithm by taking the ternary group data as a target function, and selecting and storing a convolutional neural network model with the minimum ternary group loss;
and 5: processing the query protein and all template proteins according to the steps 1-3, and then respectively inputting the processed query protein and all template proteins into a stored convolutional neural network, and taking the result output by the convolutional neural network as the characteristic that the protein is specific to the folding type;
step 6: and calculating the similarity between the query protein and the template protein, and allocating the folding type of the template protein with the highest similarity to the query protein.
2. The triple-loss based protein fold identification method of claim 1, wherein the contact map size is fixed to 256 x 256 in step 3.
3. The triple loss based protein folding identification method of claim 1, wherein the contact map size is fixed in step 3 by sampling or filling.
4. The triple-loss based protein folding identification method of claim 1, wherein each triple comprises an dockerin, a positive protein, and a negative protein, wherein the positive protein folding type is the same as the dockerin, and the negative protein folding type is different from the dockerin.
5. The triple loss-based protein folding identification method according to claim 1, wherein the objective function is specifically:
Figure FDA0003712933220000011
in the formula (I), the compound is shown in the specification,
Figure FDA0003712933220000012
expressed is the Euclidean distance between the anchor protein and the positive sample protein,
Figure FDA0003712933220000013
expressed is the Euclidean distance between the dockerin and the protein of the negative sample, m is the minimum distance between the Euclidean distance between the dockerin and the protein of the positive sample and the Euclidean distance between the dockerin and the protein of the negative sample, and] + indicating that the value in parentheses is greater than 0 and taking that value as a loss,when the value is less than 0, 0 is taken as the loss.
6. The triple loss-based protein folding identification method according to claim 1, wherein the cosine distance between the query protein and the template protein is calculated as a similarity score, and the smaller the distance, the higher the score.
CN202010947616.5A 2020-09-10 2020-09-10 Protein folding identification method based on triple loss Active CN112116949B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010947616.5A CN112116949B (en) 2020-09-10 2020-09-10 Protein folding identification method based on triple loss

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010947616.5A CN112116949B (en) 2020-09-10 2020-09-10 Protein folding identification method based on triple loss

Publications (2)

Publication Number Publication Date
CN112116949A CN112116949A (en) 2020-12-22
CN112116949B true CN112116949B (en) 2022-08-16

Family

ID=73801893

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010947616.5A Active CN112116949B (en) 2020-09-10 2020-09-10 Protein folding identification method based on triple loss

Country Status (1)

Country Link
CN (1) CN112116949B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115881211B (en) * 2021-12-23 2024-02-20 上海智峪生物科技有限公司 Protein sequence alignment method, protein sequence alignment device, computer equipment and storage medium
CN114550824B (en) * 2022-01-29 2022-11-22 河南大学 Protein folding identification method and system based on embedding characteristics and unbalanced classification loss

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794351A (en) * 2010-03-09 2010-08-04 哈尔滨工业大学 Protein secondary structure engineering prediction method based on large margin nearest central point
CN107622182A (en) * 2017-08-04 2018-01-23 中南大学 The Forecasting Methodology and system of protein partial structurtes feature

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015095122A2 (en) * 2013-12-16 2015-06-25 The Johns Hopkins University Flip (fluorescence immunoprecipitation) for high-throughput immunoprecipitation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101794351A (en) * 2010-03-09 2010-08-04 哈尔滨工业大学 Protein secondary structure engineering prediction method based on large margin nearest central point
CN107622182A (en) * 2017-08-04 2018-01-23 中南大学 The Forecasting Methodology and system of protein partial structurtes feature

Also Published As

Publication number Publication date
CN112116949A (en) 2020-12-22

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110070909B (en) Deep learning-based multi-feature fusion protein function prediction method
CN113220919B (en) Dam defect image text cross-modal retrieval method and model
CN112116949B (en) Protein folding identification method based on triple loss
Li et al. Protein loop modeling using deep generative adversarial network
CN116417093A (en) Drug target interaction prediction method combining transducer and graph neural network
CN108984642A (en) A kind of PRINTED FABRIC image search method based on Hash coding
CN112116950B (en) Protein folding identification method based on depth measurement learning
CN114708903A (en) Method for predicting distance between protein residues based on self-attention mechanism
CN113257337A (en) Protein multi-sequence comparison method based on metagenome
CN117421591A (en) Multi-modal characterization learning method based on text-guided image block screening
CN115221281A (en) Intellectual property retrieval system and retrieval method thereof
Villegas-Morcillo et al. Protein fold recognition from sequences using convolutional and recurrent neural networks
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
CN110765781A (en) Man-machine collaborative construction method for domain term semantic knowledge base
CN113222002A (en) Zero sample classification method based on generative discriminative contrast optimization
CN116386733A (en) Protein function prediction method based on multi-view multi-scale multi-attention mechanism
CN114758721B (en) Deep learning-based transcription factor binding site positioning method
CN113516209B (en) Comparison task adaptive learning method for few-sample intention recognition
CN115713970A (en) Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network
CN115240775A (en) Cas protein prediction method based on stacking ensemble learning strategy
CN113887653A (en) Positioning method and system for tightly-coupled weak supervised learning based on ternary network
CN113707213A (en) Protein-ligand binding site prediction method based on deep learning
CN113257341A (en) Method for predicting distribution of distance between protein residues based on depth residual error network
CN117746997B (en) Cis-regulation die body identification method based on multi-mode priori information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant