CN110390995B - Alpha spiral transmembrane protein topological structure prediction method and device - Google Patents

Alpha spiral transmembrane protein topological structure prediction method and device Download PDF

Info

Publication number
CN110390995B
CN110390995B CN201910585644.4A CN201910585644A CN110390995B CN 110390995 B CN110390995 B CN 110390995B CN 201910585644 A CN201910585644 A CN 201910585644A CN 110390995 B CN110390995 B CN 110390995B
Authority
CN
China
Prior art keywords
tmh
protein
prediction
training
region
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910585644.4A
Other languages
Chinese (zh)
Other versions
CN110390995A (en
Inventor
沈红斌
冯世豪
杨静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201910585644.4A priority Critical patent/CN110390995B/en
Publication of CN110390995A publication Critical patent/CN110390995A/en
Application granted granted Critical
Publication of CN110390995B publication Critical patent/CN110390995B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

An alpha helix transmembrane protein topological structure prediction method, which organizes a training set, a verification set and a test set according to the definition of transmembrane alpha helix TMH; extracting position specificity scoring matrixes PSSM, HMM, water solubility, secondary structure, torsion angle and hydropathic index characteristics from the sequences in the training set, the verification set and the test set; training the whole sequence-based depth residual error network model and the sliding window-based depth residual error network model using a training set. After the outputs of the two networks are averaged and integrated, a dynamic threshold algorithm is adopted to obtain a TMH area; the support vector machine model is trained using a training set. The input of the model is the boundary part of other regions non-TMH and TMH; the output is the position of non-TMH relative to the cell membrane. The TMH region in the protein is predicted, then the non-TMH position is predicted, and the final topological structure of the protein can be obtained by combining the prediction results of the two parts.

Description

Alpha spiral transmembrane protein topological structure prediction method and device
Technical Field
The invention belongs to the technical field of biological detection, and particularly relates to a method and a device for predicting an alpha helical transmembrane protein topological structure based on multi-scale deep learning.
Background
The cell membrane is a barrier to the cell and is capable of isolating the internal environment from the external environment of the cell. The cell membrane consists of a phospholipid bilayer with a number of membrane proteins embedded thereon. Membrane proteins play an important role in a series of biological processes, such as cell signaling, ion conductivity, cell aggregation, cell recognition and intercellular communication. Thus, many drugs are designed to bind to membrane proteins, which in turn affect biological processes.
Of all membrane proteins, the alpha helical transmembrane protein is in large part. It is estimated that 27% of proteins in humans are alpha helical transmembrane proteins. They are usually distributed in the plasma membrane of eukaryotes, the inner membrane of bacterial cells, and even the outer membrane. The transmembrane alpha helix topological information of the protein can help scientists identify binding sites and design new drugs. However, since membrane proteins are difficult to dissolve, purify and crystallize, and are too large for NMR, it is very challenging to experimentally determine the structure of membrane proteins. Membrane protein structures are reported to account for only 1% of all structures in the PDB database. Therefore, there is a great need in the art for a computational prediction method that can accurately predict the topological structure of a membrane protein.
Over the past three decades, many predictive methods have been developed in the field. These methods can be divided into three categories:
the first type of prediction method uses only the hydropathic index to predict TMH. These methods use a sliding window of 19 amino acid residues in length as input to the model. The average hydropathic index of 19 amino acid residues is the hydropathic index of the central residue. A fixed threshold is then used to determine whether this amino acid residue is located on TMH. In addition, the well-known positive-insert rule is also proposed at this stage. The rule is that the short loops located inside the cell are composed mainly of Lys and Arg residues. Subsequent work on this rule has a long-term impact;
the second category of methods uses machine learning algorithms and statistical models to obtain more accurate prediction results, such as hidden markov models, support vector machines and k-nearest neighbor models. Meanwhile, besides the hydropathic index, the models also adopt stronger evolution information characteristics;
the third category of predictive algorithms is fusion methods. The main idea of these methods is to obtain the final result by fusing several topology prediction methods. Experiments show that the method can obviously improve the performance of the protein with high reliability.
Although there has been a great deal of research work in this area, most of these work only predict the region of the alpha helix that is completely buried within the membrane. This means that these works suggest that TMH refers to a helical segment that is completely embedded within the cell membrane. For example, in FIG. 3, only the helix region is considered to be the transmembrane α -helical region, while the remaining tail region is not considered. However, these tail regions have been reported to play a crucial role in biological processes such as cell-cell communication and cell recognition. And its positional information can also help scientists better understand the function of proteins. In addition, as the evaluation criteria become more stringent, the accuracy of the conventional prediction algorithm also has room for improvement. Therefore, it is important to design an algorithm capable of accurately predicting the positions of the helix region and the tail region.
Disclosure of Invention
The embodiment of the invention provides a topological structure prediction method of an alpha helical transmembrane protein.
The algorithm for predicting the topological structure of the alpha helical transmembrane protein based on the multi-scale deep learning model is disclosed by the embodiment of the invention. The algorithm is mainly divided into two parts: predicted TMH region and predicted non-TMH region location. In the prediction TMH region, a depth residual network based on the whole sequence and based on two different scales of a fixed sliding window is used to extract more advanced features from PSSM, HMM and structure information features to predict TMH. The deep learning is used and a machine learning model is combined. Aiming at the problems of over-segmentation and under-segmentation, a dynamic threshold algorithm is designed, and the prediction precision of the depth model is further improved. In the non-TMH region position prediction algorithm, because training samples are few, the algorithm adopts a support vector machine model and uses an HMM and a hydrophilic index as input features of the model. The algorithm uses an integration method to account for possible inaccuracies in the prediction process. For a non-TMH region, 10 boundary regions with the TMH region are extracted as input. Through a support vector machine model, 10 prediction scores are obtained in total. The average of these 10 scores was taken as the final predicted score. Finally, the maximum and minimum distribution method is used for all non-TMH regions in a protein to obtain the final prediction result. The topological structure of the alpha-helical transmembrane protein can be obtained by combining the predicted results of the two parts.
The invention has the following beneficial effects:
1. the definition of TMH used in the present invention is different from other works in the field. Referring to FIG. 3, the TMH region of the present invention includes both the helic region completely embedded in the cell membrane and the tail region outside the cell membrane and connected to the helic region. These tail regions play an important role in understanding the biological function of proteins.
2. The invention uses a multi-scale depth residual error network. In particular, both networks based on whole sequences and networks based on fixed-length sliding windows are included. There is some complementarity between the predicted results of the two networks. By integrating the prediction results of the two networks, the prediction accuracy of the model can be further improved.
3. The invention combines deep learning with machine learning. In the TMH position prediction model, a dynamic threshold model is used for processing the prediction result of the deep learning model, the problems of over-segmentation and under-segmentation in the prediction process are successfully solved, and the effect of the model is improved.
4. The invention widely uses the integrated thought in the model building process. In the predictive TMH algorithm, two deep learning models of different scales are integrated. In the algorithm for predicting the position of the non-TMH region, the prediction results of 10 boundary regions are integrated, so that the influence caused by the inaccuracy of the prediction of the position of the TMH region is reduced, and the prediction precision is ensured.
5. The invention also achieves better performance on TMH which is difficult to predict.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 is a flow chart of TMH prediction according to one embodiment of the present invention.
Fig. 2 is a flowchart of non-TMH region location prediction according to one embodiment of the present invention.
FIG. 3 is a schematic representation of an alpha helical transmembrane protein.
FIG. 4 is a diagram illustrating the effect of a dynamic threshold algorithm on over-segmentation and under-segmentation problems according to one embodiment of the present invention.
Detailed Description
The invention relates to the field of alpha helix transmembrane protein biology, in particular to an alpha helix transmembrane protein topological structure prediction algorithm (MemBrain2.1) based on multi-scale deep learning. The algorithm is mainly divided into two parts: transmembrane alpha helical region (TMH) prediction and other region (non-THM) location prediction. In the first section, the invention employs two different scale deep learning models and a dynamic threshold algorithm. The first model predicts TMH positions based on the entire sequence and the second model predicts TMH positions based on a fixed length sliding window. The two models have better complementarity due to different scales, and the accuracy of TMH position prediction can be improved by fusing the two models. The dynamic threshold algorithm can detect over-segmentation and under-segmentation phenomena and correct the prediction result of deep learning. In the second part, the invention adopts a support vector machine model to match with a maximum-minimum distribution method to predict the position of the non-TMH region. The support vector machine model enables the model to pay more attention to the decisive training samples, and the maximum minimum distribution method enables the model to pay more attention to the relative size of the predicted value rather than the absolute size. Both of which may improve the robustness of the model. And combining the prediction results of the two parts, the topological structure of the alpha helical transmembrane protein can be obtained.
According to one or more embodiments, as shown in fig. 1 and fig. 2, an alpha helical transmembrane protein topology prediction algorithm based on multi-scale deep learning comprises the following steps:
s1, organizing a training set, a verification set and a test set according to the definition of the TMH;
s2, extracting Position Specificity Scoring Matrix (PSSM), HMM, water solubility, secondary structure, torsion angle and hydropathic index features for the sequences in the training set, validation set and test set by using PSI-BLAST, HHblits, SPIDER3 tool;
and S3, training the depth residual error network model based on the whole sequence and the depth residual error network model based on the sliding window by using the training set. After the outputs of the two networks are averaged and integrated, a dynamic threshold algorithm is adopted to obtain a TMH prediction result;
and S4, training the support vector machine model by using the training set. The input of the model is the boundary part of the non-TMH region and the TMH region, and the output is a real number between 0 and 1, which indicates whether the current non-TMH region is located outside (outside) or inside (inside). Then determining a final prediction result by adopting a maximum and minimum distribution method;
s5, for a protein to be predicted, firstly predicting the TMH position in the protein, then predicting the position of the non-TMH region, and combining the prediction results of the two parts, the final topological structure of the protein can be obtained.
Further, the step S1 is to organize the training set, the verification set, and the test set according to the new definition of TMH as follows:
s11, all α -helical transmembrane protein structures were extracted from OPM database, and there were 1783 PDB files in total. According to the number of the protein chains in the file, 1783 are divided into PDB files with the protein chains as units;
s12, selecting 40 test proteins used in the TMSEG work as the test set of the embodiment. For the remaining PDB files, if the protein chain is broken, or the protein is less than 20 amino acids long, or there is no transmembrane α helix in the protein, it is directly knocked out. Thus, 5741 protein chains are obtained;
s13, using UniqueProt software to remove redundancy between 5741 proteins and the test set by taking HVAL >0 as a standard, and then removing redundancy of the protein to obtain 318 proteins in total. 39 of these proteins were randomly selected as the validation set, and the remaining 279 proteins were selected as the training set.
S14, obtaining whether each amino acid residue in the protein belongs to TMH according to the PDB file, and the position of each non-TMH region. In this example, the amino acid residue belonging to TMH needs to satisfy the following requirements: residues are located on an alpha helix; this alpha helix is partly in the cell membrane.
Further, the step S2 extracts the structure information such as PSSM, HMM, secondary structure, water solubility, twist angle, etc. from the protein sequence information using BLAST, HHblits and SPIDER3 software, and simultaneously acquires the hydropathic index information. The method comprises the following specific steps:
position-specific scoring matrices (PSSMs) are a commonly used characterization motif in biological sequences. It contains abundant evolutionary information and has proven to be a very useful feature in previous TMH prediction efforts. To obtain the PSSM matrix, a multiple sequence alignment file is first generated. In this example, the NR (non-redundant) database was searched using BLAST software. The specific execution commands and parameters are:
psiblast-query sequence.fasta-db nr-out_ascii_pssm PSSM.matrix-save_pssm_after_last_round-evalue 1e-3-max_target_seqs 10000-num_iterations 3-num_threads 6
the PSSM matrix can be extracted from the multiple sequence alignment by:
Figure BDA0002114325330000051
where i ═ 1, …, L, and L represent the length of the protein sequence, and j ═ 1, …,20, and represent 20 amino acids. PPM finger position probability matrix, PPMi,jIndicates the probability of the j-th amino acid appearing in the i-th column of the multiple sequence alignment. bjIndicates the background frequency of the j-th amino acid. For one amino acid residue, the PSSM matrix has 20 dimensions.
HMM features are another feature that contains evolutionary information. It was generated by HHblits sequence alignment tool. Compared with BLAST, HHblits obtain homologous sequences by using an HMM-HMM alignment algorithm, the sensitivity is higher, and the result is more accurate. For one amino acid residue, the HMM features total 30 dimensions. In the present invention, HHblits software was used to search the Unicluster 30 database for HMM features. The specific execution commands and parameters are:
hhblits–i sequence.fasta-n 3-e 0.001-d uniclust30_2017_10-cpu 6-ohhm sequence.hmm-diff inf-id 99-cov 50
structural information features include torsion angle, water solubility, and secondary structure. These features are predicted by the SPIDER3 software. The structural information features a total of 14 dimensions for one amino acid residue.
The hydropathic index describes the degree of hydrophilicity or hydrophobicity of an amino acid branch. The larger the hydrophilicity index, the more hydrophobic the amino acid. The Kyte-Doolittle hydropathicity index was used in the examples of the present invention. For one amino acid residue, the hydropathic index is characterized by a total of 1 dimension.
In the predictive TMH algorithm, embodiments of the present invention use PSSM, HMM, and structural information features. In the prediction non-TMH region location algorithm, the embodiment of the present invention uses HMM and hydrophilic index features.
Further, the step S3 trains the whole sequence-based depth residual error network model and the sliding window-based depth residual error network model using the training set. And averaging and integrating the outputs of the two networks, and obtaining a TMH prediction result by adopting a dynamic threshold algorithm. The method comprises the following specific steps:
and S31, determining parameters such as the number of layers, regular term coefficients, learning rate and batch size of the depth residual error model based on the whole sequence according to the effect of the model on the verification set. A total of 279 sequences in the training set;
and S32, determining parameters such as the layer number, the regular term coefficient, the learning rate, the batch size and the sliding window size of the depth residual error model based on the sliding window according to the effect of the model on the verification set. The training set consisted of 17437 positive samples (sliding window centered amino acid residues on TMH) and 20003 negative samples (sliding window centered amino acid residues on non-TMH);
s33, for an alpha-helix transmembrane protein sequence, two prediction results are obtained by using two depth residual error models trained in the two steps S31 and S32. And averaging and integrating the prediction results of the two deep learning models with different scales. Parameters in the dynamic threshold model, such as an initial threshold, a merging standard, a splitting standard and the like, are adjusted according to the effect of the model on the verification set, so that the problems of over-segmentation and under-segmentation in prediction are solved. The dynamic threshold algorithm content is as follows:
i. the prediction scores were mean filtered using a sliding window of 5 residues in length. And in the filtering process, removing the maximum value and the minimum value in the sliding window. An initial TMH prediction was obtained using an initial threshold value of 0.55.
For two adjacent TMHs, if the gap between them is no more than 5 residues and the sum of the lengths of the two TMHs is no more than 24 residues, then combining the two TMHs into one TMH.
For each TMH, if it is greater than 33 residues in length, then the TMH is detected using a threshold of 0.55 at an initial value, in increments of 0.05. If more than one TMH is identified and they do not satisfy the merge condition, then the TMH is split.
Further, in step S4, the support vector machine model is trained using the training set, which is as follows:
s41, the boundary between the TMH region and the non-TMH region has a great influence on the prediction of the position of the non-TMH. In the present invention, such a boundary refers to a window consisting of 6 amino acid residues in the TMH region and 7 amino acid residues in the non-TMH region. For a segment of non-TMH region, there are two kinds of boundary portions between the front and back regions and the TMH region. Because the two boundaries have a large difference, the present embodiment trains two support vector machine models. And integrating the prediction results of the two models to obtain a final prediction score. And training a plurality of support vector machine models by using a grid search method, and determining a final model according to the effect of the model on the verification set. The training set includes 646 samples of inside and 613 samples of outside.
And S42, obtaining the final prediction effect according to the prediction score by using a maximum and minimum distribution method. First, the prediction score with the largest score is selected as inside and the smallest score is outside. For the rest of the scores, if near the maximum score, it is instade, otherwise it is outside. The maximum-minimum allocation method focuses more on the relative size of the prediction scores than on the absolute size, so that a false-divide situation can be avoided.
Further, the step S5 predicts the topology of a protein, specifically as follows:
given a protein sequence to be predicted, the predicted TMH is first used, and if no TMH is detected, the protein is considered to be water-soluble. If at least one TMH is detected, the protein is considered to be an alpha helical transmembrane protein. And then predict the position of non-TMH region therein. The result of predicting the TMH region in the first step may be inaccurate, which may result in a large influence on the location of the non-TMH region. Thus, this example used an integrated approach to extract a total of 5 border regions consisting of 10, 8, 6, 4, 2 amino acid residues in the TMH region and 3, 5, 7, 9, 11 amino acid residues in the non-TMH region. Since one non-TMH region has two kinds of boundary regions, front and rear, a total of 10 boundary regions are extracted as inputs of the support vector machine model. By the integration method, the robustness of the model is greatly improved.
According to one or more embodiments, an alpha helical transmembrane protein topology prediction apparatus, characterized in that the prediction apparatus comprises a memory; and a processor coupled to the memory, the processor configured to execute instructions stored in the memory, the processor performing the following RPA operations:
s1, organizing a training set, a verification set and a test set according to the definition of the TMH;
and S2, acquiring the hydrophilic index information of the protein. Using PSI-BLAST, HHblits and SPIDER3 tools to respectively extract protein structure information such as PSSM, HMM, water solubility, secondary structure, torsion angle and the like from the sequences in the arrangement data set;
and S3, training the depth residual error network model based on the whole sequence and the depth residual error network model based on the sliding window by using the training set. After the outputs of the two networks are averaged and integrated, a dynamic threshold algorithm is adopted to obtain a prediction result of the TMH;
and S4, training the support vector machine model by using the training set. The input of the model is the boundary part of the non-TMH region and the TMH region, and the output is a real number between 0 and 1, which indicates whether the current non-TMH region is located outside (outside) or inside (inside). Then determining a final prediction result by adopting a maximum and minimum distribution method;
s5, for a protein to be predicted, firstly predicting TMH in the protein, then predicting the position of non-TMH region, and combining the prediction results of the two parts, the final topological structure of the protein can be obtained.
RPA, namely, Robotic Process Automation (software flow Automation), refers to a software Automation mode for realizing a service which is originally completed by a manually operated computer in each industry.
According to one or more embodiments, 279 proteins are extracted from the OPM database as training data according to the new TMH definition. The depth network based on the whole sequence and the fixed length sliding window has the same network structure and comprises 6 convolutional layers, and the optimizer is Adam. In the whole sequence based model, the training data were 279 proteins, the batch _ size was 11, and the epoch number was 100. In the sliding window based model, there are 17437 positive samples and 20003 negative samples in the training data, the batch _ size is 40, the sliding window size is 17, and the epoch number is 100. In the model for predicting the position of non-TMH region, 646 positive samples and 613 negative samples were extracted from 279 proteins in total. The sample is a border region of 13 residues in length consisting of 6 amino acid residues in the TMH region and 7 amino acid residues in the non-TMH region.
The evaluation indexes used were as follows:
Figure BDA0002114325330000081
Figure BDA0002114325330000091
Figure BDA0002114325330000092
Figure BDA0002114325330000093
Figure BDA0002114325330000094
Figure BDA0002114325330000095
the standard for correctly predicting a segment of TMH is: the predicted end point of TMH cannot deviate ± 5 residues from the true TMH end point; the length of the overlap between the predicted and true TMH is more than half the length of the predicted TMH and more than half the length of the true TMH. The TMH of an alpha helical transmembrane protein is correctly predicted to mean: the predicted TMH number is the same as the real TMH number; every true TMH is correctly predicted. The correct prediction of the topology of an alpha helical transmembrane protein is that: TMH is correctly predicted; all non-TMH region locations are correctly predicted.
The algorithm proposed by the embodiment of the invention is compared with the existing algorithms in the field on a test set. The comparative results are shown in Table 1. The algorithm proposed by the embodiment of the invention is based on several more important indexes (PRE)H,RECH,Vp,Vtop) Are clearly superior to other algorithms in the field.
TABLE 1 Effect of existing algorithms in different algorithms and fields on test set
Figure BDA0002114325330000096
Figure BDA0002114325330000101
Table 2 shows the effect of using dynamic and fixed thresholds on the test set. The fixed threshold value means that after the integration results of the depth models with different scales are obtained, a fixed threshold value processing prediction score is determined according to the effect of the models on the verification set. It can be seen that after using the dynamic threshold, PREHAnd RECHThe indexes are respectively improved by 4.6 percent and 4.7 percent. Fig. 4 shows examples of over-segmentation and under-segmentation, both of which were successfully solved using the dynamic threshold algorithm. The experimental results prove that the dynamic threshold valueEffectiveness.
TABLE 2 Effect of dynamic threshold and fixed threshold on test set
Figure BDA0002114325330000102
Table 3 shows the effect of using a fixed threshold and max-min allocation algorithm on the test set. Where MCC refers to the Marx's coefficient of the model at the location of the predicted non-TMH region after the true TMH location is known. VtopThis means how much of the protein topology is correctly predicted after the true TMH position is known. MCCpredAnd Vtop_predAnd MCC and VtopSimilarly, the difference is that the first two indices are indices with unknown true TMH position. As can be seen from table 3, the maximum-minimum allocation method is superior to the fixed threshold method. Especially if the true TMH position is unknown.
TABLE 3 Effect of Max-min-Allocation and fixed threshold methods on test sets
Figure BDA0002114325330000103
Table 4 shows the effect of integrating depth models of different scales. The results of all three comparison methods are processed by a dynamic threshold method. It can be seen that there is complementarity between the deep learning models at different scales. The effect after the integration is obviously promoted.
TABLE 4 Effect of integrating depth models of different scales
Figure BDA0002114325330000104
Table 5 shows the effect of integrating multiple boundary region methods on predicting the location of non-TMH regions when the true TMH location is unknown. Junction2 — 11 indicates that the currently used Junction region is composed of 2 amino acid residues in the TMH region and 11 amino acid residues in the non-TMH region. Other names are similar. It can be seen that the influence caused by inaccurate prediction of the TMH position can be reduced by integrating the prediction results of a plurality of boundary areas. The results of these 6 comparison methods were all processed by the maximum-minimum partition method.
TABLE 5 Effect of integrating predicted results of multiple boundary regions on test set
Figure BDA0002114325330000111
Table 6 shows the performance of the present invention on the less predictable TMH. There are two specific classes of TMH. One is a semi-transmembrane alpha helix, in which the TMH spans only half of the cell membrane and the two non-TMH regions in front of and behind it are located in the same position. The second is the near transmembrane α helix, which refers to a pair of TMHs with a gap of no more than 3 amino acid residues in between. A pair of near transmembrane alpha helices is successfully predicted means that both TMHs are predicted correctly. In the test set, there were 11 pairs of near transmembrane α helices and 6 hemitransmembrane α helices. As can be seen from Table 6, the present invention achieves better results than other algorithms in the art.
TABLE 6 Effect of the invention on the less predictable TMH
Figure BDA0002114325330000112
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A method for predicting the topological structure of an alpha helical transmembrane protein, which comprises the following steps:
s1, organizing a training set, a verification set and a test set according to the definition of transmembrane alpha helix TMH;
s2, extracting Position Specificity Scoring Matrix (PSSM), HMM, water solubility, secondary structure, torsion angle and hydropathic index characteristics from the sequences in the training set, the verification set and the test set;
s3, training the depth residual error network model based on the whole protein sequence and the depth residual error network model based on the sliding window by using the training set,
after the outputs of the two networks are averaged and integrated, a dynamic threshold algorithm is adopted to obtain a TMH prediction result;
s4, training the support vector machine model by using the training set in S1,
the input to the model is the interface between non-TMH and TMH,
the output is a real number between 0 and 1, indicating whether the current non-TMH region tends to be outside or inside,
then determining a final prediction result by adopting a maximum and minimum distribution method;
s5, for a protein to be predicted, first predicting TMH in the protein,
then, the position of non-TMH is predicted, and the final topological structure of the protein can be obtained by combining the prediction results of the two parts.
2. The alpha helical transmembrane protein topology prediction method according to claim 1, wherein the step S1 further comprises the steps of:
s11, extracting all alpha-helix transmembrane protein structures from an OPM database (interactions of Proteins in Membranes database), wherein 1783 PDB files are obtained in total, and the 1783 PDB files are divided into 7814 PDB files with protein chains as units according to the serial numbers of the protein chains;
s12, obtaining the three-dimensional coordinates of amino acid residues in the protein and the coordinates of cell membranes according to the PDB file, and simultaneously obtaining the secondary structure of the protein,
when a segment of the protein is both an alpha helix and has a moiety located within the cell membrane, then the segment of the protein is TMH;
s13, selecting 40 test proteins used in TMSEG work as a test set,
for the remaining 7774 PDB files, when the protein chain is broken, or the length of the protein is less than 20 amino acids, or there is no TMH in the protein, it is directly removed, thus obtaining 5741 protein chains;
s14, removing redundancy between 5741 proteins and the test set by taking HVAL >0 as a standard, then removing redundancy for the test set, obtaining 318 proteins in total, randomly selecting 39 proteins as a verification set, and using the rest 279 proteins as a training set.
3. The method of predicting the topological structure of an α -helical transmembrane protein according to claim 2, wherein the amino acid residues belonging to TMH are such that: the residues are located on an alpha helix, and part of this alpha helix is in the cell membrane.
4. The method of predicting the topological structure of an α -helical transmembrane protein according to claim 1, wherein: the step S3 further includes the steps of:
s31, determining the layer number, the regular term coefficient, the learning rate and the batch size parameter of the depth residual error network model based on the whole protein sequence according to the effect on the verification set;
s32, determining the layer number, the regular term coefficient, the learning rate and the batch size parameter of the depth residual error network model based on the sliding window according to the effect on the verification set;
s33, integrating the prediction results of the deep learning models with two different scales in S31 and S32 by using an averaging method, adjusting parameters in the dynamic threshold model according to the effect of the models on the verification set, and solving the problems of over-segmentation and under-segmentation in prediction.
5. The α -helical transmembrane protein topology prediction method according to claim 1, wherein in step S5, a total of 5 boundary regions consisting of 10, 8, 6, 4, 2 amino acid residues in the TMH region and 3, 5, 7, 9, 11 amino acid residues in the non-TMH region are extracted by an integrated method, and one non-TMH region has two boundary regions, i.e., a front boundary region and a rear boundary region, and a total of 10 boundary regions are extracted as input of the support vector machine model.
6. An alpha helical transmembrane protein topology prediction device, characterized in that the prediction device comprises a memory; and
a processor coupled to the memory, the processor configured to execute instructions stored in the memory, the processor to:
s1, organizing a training set, a verification set and a test set according to the definition of the TMH;
s2, extracting Position Specificity Scoring Matrix (PSSM), HMM, water solubility, secondary structure, torsion angle and hydropathic index characteristics from the sequences in the training set, the verification set and the test set;
s3, training a depth residual error network model based on the whole protein sequence and a depth residual error network model based on a sliding window by using a training set, averaging and integrating the outputs of the two networks, and obtaining a TMH prediction result by adopting a dynamic threshold algorithm;
s4, training a support vector machine model by using the training set,
the input of the model is the boundary part of TMH and non-TMH, the output is a real number between 0 and 1, which represents whether the current non-TMH area is prone to be positioned in the outer part (outside) or the inner part (inside), and then the maximum and minimum allocation method is adopted to determine the final prediction result;
s5, for a protein to be predicted, firstly predicting TMH in the protein, then predicting the position of non-TMH, and combining the prediction results of the two parts, the final topological structure of the protein can be obtained.
CN201910585644.4A 2019-07-01 2019-07-01 Alpha spiral transmembrane protein topological structure prediction method and device Active CN110390995B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910585644.4A CN110390995B (en) 2019-07-01 2019-07-01 Alpha spiral transmembrane protein topological structure prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910585644.4A CN110390995B (en) 2019-07-01 2019-07-01 Alpha spiral transmembrane protein topological structure prediction method and device

Publications (2)

Publication Number Publication Date
CN110390995A CN110390995A (en) 2019-10-29
CN110390995B true CN110390995B (en) 2022-03-11

Family

ID=68286124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910585644.4A Active CN110390995B (en) 2019-07-01 2019-07-01 Alpha spiral transmembrane protein topological structure prediction method and device

Country Status (1)

Country Link
CN (1) CN110390995B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667880A (en) * 2020-05-27 2020-09-15 浙江工业大学 Protein residue contact map prediction method based on depth residual error neural network
CN113870941A (en) * 2020-06-30 2021-12-31 苏州浦意智能医疗科技有限公司 Protein structure prediction method based on geometric network
CN113205855B (en) * 2021-06-08 2022-08-05 上海交通大学 Knowledge energy function optimization-based membrane protein three-dimensional structure prediction method
CN113611354B (en) * 2021-07-05 2023-06-02 河南大学 Protein torsion angle prediction method based on lightweight deep convolutional network

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831332A (en) * 2012-04-16 2012-12-19 南京理工大学常熟研究院有限公司 Interpretation prediction method of transmembrane helix of membrane protein
CN103413068A (en) * 2013-08-28 2013-11-27 苏州大学 Prediction method of transmembrane helix three-dimensional structure of G-protein-coupled receptor based on structure topology
CN104615911A (en) * 2015-01-12 2015-05-13 上海交通大学 Method for predicting membrane protein beta-barrel transmembrane area based on sparse coding and chain training
CN105740646A (en) * 2016-01-13 2016-07-06 湖南工业大学 BP neural network based protein secondary structure prediction method
WO2019006022A1 (en) * 2017-06-27 2019-01-03 The Broad Institute, Inc. Systems and methods for mhc class ii epitope prediction
CN109448787A (en) * 2018-10-12 2019-03-08 云南大学 Based on the protein subnucleus localization method for improving PSSM progress feature extraction with merging
CN109829902A (en) * 2019-01-23 2019-05-31 电子科技大学 A kind of lung CT image tubercle screening technique based on generalized S-transform and Teager attribute

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102597272A (en) * 2009-11-12 2012-07-18 艾索特里克斯遗传实验室有限责任公司 Copy number analysis of genetic locus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831332A (en) * 2012-04-16 2012-12-19 南京理工大学常熟研究院有限公司 Interpretation prediction method of transmembrane helix of membrane protein
CN103413068A (en) * 2013-08-28 2013-11-27 苏州大学 Prediction method of transmembrane helix three-dimensional structure of G-protein-coupled receptor based on structure topology
CN104615911A (en) * 2015-01-12 2015-05-13 上海交通大学 Method for predicting membrane protein beta-barrel transmembrane area based on sparse coding and chain training
CN105740646A (en) * 2016-01-13 2016-07-06 湖南工业大学 BP neural network based protein secondary structure prediction method
WO2019006022A1 (en) * 2017-06-27 2019-01-03 The Broad Institute, Inc. Systems and methods for mhc class ii epitope prediction
CN109448787A (en) * 2018-10-12 2019-03-08 云南大学 Based on the protein subnucleus localization method for improving PSSM progress feature extraction with merging
CN109829902A (en) * 2019-01-23 2019-05-31 电子科技大学 A kind of lung CT image tubercle screening technique based on generalized S-transform and Teager attribute

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Alpha螺旋跨膜蛋白3D结构中的残基可接触性预测研究;肖峰;《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》;20160715(第7期);第A006-81页 *
MemBrain-contact 2.0: a new two-stage machine learning model for the prediction enhancement of transmembrane protein residue contacts in the full chain;Jing Yang.et.;《Bioinformatics》;20181231;第34卷(第2期);第230-238页 *

Also Published As

Publication number Publication date
CN110390995A (en) 2019-10-29

Similar Documents

Publication Publication Date Title
CN110390995B (en) Alpha spiral transmembrane protein topological structure prediction method and device
CN105886616B (en) Efficient specific sgRNA recognition site guide sequence for pig gene editing and screening method thereof
CN110070909B (en) Deep learning-based multi-feature fusion protein function prediction method
de Lannoy et al. The long reads ahead: de novo genome assembly using the MinION
CN107886129B (en) Mobile robot map closed-loop detection method based on visual word bag
Wu et al. Analysis of several key factors influencing deep learning-based inter-residue contact prediction
CN112733904B (en) Water quality abnormity detection method and electronic equipment
Azad et al. Probabilistic methods of identifying genes in prokaryotic genomes: connections to the HMM theory
CN109599149A (en) A kind of prediction technique of RNA coding potential
CN113257337A (en) Protein multi-sequence comparison method based on metagenome
US20210398605A1 (en) System and method for promoter prediction in human genome
Amilpur et al. Edeepssp: explainable deep neural networks for exact splice sites prediction
CN111048145B (en) Method, apparatus, device and storage medium for generating protein prediction model
CN116230075A (en) Protein domain boundary prediction method based on hybrid network model
US11386340B2 (en) Method and apparatus for performing block retrieval on block to be processed of urine sediment image
JP2024006879A (en) Information processing system, information processing method and program
CN110364222B (en) Dynamic modeling-based Alzheimer's disease secretory protein data processing method
CN111028885A (en) Method and device for detecting RNA editing sites of yaks
CN117746997B (en) Cis-regulation die body identification method based on multi-mode priori information
KR102405866B1 (en) High-speed searching device and method for identity confirmation of the relationship more than second degree
CN111091865B (en) Method, device, equipment and storage medium for generating MoRFs prediction model
CN116884503B (en) Processing method, device and computing equipment of sequence and posterior matrix
CN111009287B (en) SLiMs prediction model generation method, device, equipment and storage medium
CN117059170A (en) Genomic protozoan pollutant detection method based on DNA bar code technology
CN116050424A (en) Dynamic matching method for multi-semantic text expression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant