CN110390995A - α spiral transmembrane protein topological structure prediction technique and device - Google Patents

α spiral transmembrane protein topological structure prediction technique and device Download PDF

Info

Publication number
CN110390995A
CN110390995A CN201910585644.4A CN201910585644A CN110390995A CN 110390995 A CN110390995 A CN 110390995A CN 201910585644 A CN201910585644 A CN 201910585644A CN 110390995 A CN110390995 A CN 110390995A
Authority
CN
China
Prior art keywords
tmh
protein
prediction
model
spiral
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910585644.4A
Other languages
Chinese (zh)
Other versions
CN110390995B (en
Inventor
沈红斌
冯世豪
杨静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201910585644.4A priority Critical patent/CN110390995B/en
Publication of CN110390995A publication Critical patent/CN110390995A/en
Application granted granted Critical
Publication of CN110390995B publication Critical patent/CN110390995B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

A kind of α spiral transmembrane protein topological structure prediction technique, according to the definition tissue training of cross-film α spiral TMH collection, verifying collection and test set;To sequential extraction procedures location specific scoring matrix PSSM, HMM, water solubility, secondary structure, torsion angle and the hydrophilic index feature in training set, verifying collection and test set;Use training set depth residual error network model of the training based on whole sequence and the depth residual error network model based on sliding window.The output of two kinds of networks is averaged after integrating, the region TMH is obtained using dynamic threshold algorithm;Use training set Training Support Vector Machines model.The input of model is the junction section of other regions non-TMH and the region TMH;Output is position of the non-TMH relative to cell membrane.The region TMH in protein is predicted first, the position of non-TMH is then predicted, in conjunction with two-part prediction result, so that it may obtain the final topological structure of protein.

Description

α spiral transmembrane protein topological structure prediction technique and device
Technical field
The invention belongs to technical field of biological, in particular to a kind of α spiral cross-film based on multiple dimensioned deep learning Topological Structure of Protein prediction technique and device.
Background technique
Cell membrane is the barrier of cell, can completely cut off cell interior environment and external environment.Cell membrane is by phospholipid bilayer Layer and embedded in thereon a large amount of memebrane proteins composition.Memebrane protein is in cellular signal transduction, ionic conductivity, cell condensation, cell It played an important role in a series of bioprocess such as identification and cell-cell communication.Therefore, many drugs are designed to and film Protein binding, and then influence bioprocess.
In all memebrane proteins, α spiral transmembrane protein accounts for major part.It is estimated that in human body 27% protein It is α spiral transmembrane protein.They are typically distributed on Eukaryotic plasma membrane, in the inner membrance or even outer membrane of bacterial cell.Albumen The cross-film α spiral topology information of matter can help scientist to identify binding site, design new drug.But due to memebrane protein hardly possible With dissolution, purifying and crystallization, and again too big for NMR, by the method for experiment determine memebrane protein structure it is non-often with It is challenging.It is reported that Membrane protein conformation only accounts for institute structured 1% in PDB database.Therefore, highly desirable one in field Kind can accurately predict the calculating prediction technique of memebrane protein topological structure.
In the past thirty years, many prediction techniques have been developed in field.These methods can be divided into three classes:
First kind prediction technique predicts TMH using only hydrophilic index.These methods are that 19 amino acid are residual using length Input of the sliding window of base as model.The average hydrophilic index of 19 amino acid residues is the hydrophilic finger of center residue Number.Then determine whether this amino acid residue is located on TMH using a fixed threshold value.In addition, famous Positive-inside rule is also suggested in this stage.Rule content be positioned at intracellular side short loop it is main It is made of Lys and Arg residue.This subsequent work of rule has long-range influence;
Second class method has obtained more accurate prediction result, such as hidden Ma Er using machine learning algorithm and statistical model It can husband's model, support vector machine and k- Neighborhood Model.Meanwhile other than hydrophilic index, these models, which additionally use, more to be reinforced Big evolution information characteristics;
Third class prediction algorithm is fusion method.The main thought of these methods is by merging several topological structure predictions Method obtain final result.Experiment shows the protein for high reliablity, and this method can be obviously improved performance.
Although having had a large amount of research work in the field, most of these work only predict and are entirely buried in film Interior alpha helical region domain.This means that these work think that TMH refers to the spiral fragment being entirely buried in cell membrane.Such as in Fig. 3 In, only the region of helix is considered as cross-film alpha helical region domain, and remaining region tail is not considered into.But It is, it was reported that these regions tail play a crucial role in cell-cell communication, the bioprocess such as cell recognition.And And its location information can also help scientist to more fully understand the function of protein.In addition, as evaluation criterion is more and more tighter Lattice, there is also the spaces of promotion in precision for previous prediction algorithm.Therefore, the area helix can accurately be predicted by designing one kind The algorithm of domain and tail regional location is just particularly important.
Summary of the invention
The embodiment of the invention provides a kind of α spiral transmembrane protein topological structure prediction techniques.
The calculation that α spiral transmembrane protein topological structure is predicted based on multiple dimensioned deep learning model of the embodiment of the present invention Method.Algorithm is largely divided into two parts: the prediction region TMH and prediction non-TMH regional location.In the prediction region TMH, use The different depth residual error network of two kinds of scales based on whole sequence and based on fixed sliding window, from PSSM, HMM and structure More advanced feature is extracted in information characteristics to predict TMH.While using deep learning, and combine machine learning mould Type.For over-segmentation and less divided problem, dynamic threshold algorithm is devised, further improves the precision of prediction of depth model. In prediction non-TMH regional location algorithm, since training sample is less, algorithm uses supporting vector machine model, uses HMM Input feature vector with hydrophilic index as model.In view of, there may be the problem of inaccuracy, algorithm uses during prediction Integrated approach.For a region non-TMH, the juncture area for being extracted 10 and the region TMH altogether is used as input.By supporting 10 prediction scores are obtained in vector machine model.Take the average value of this 10 scores as final prediction score.Finally for All regions non-TMH in one protein, obtain final prediction result using minimax distribution method.In conjunction with two portions The prediction result divided can be obtained by the topological structure of α spiral transmembrane protein.
The invention has the following beneficial effects:
1. the definition of TMH used in the present invention is different from other work in field.Such as Fig. 3, the region TMH was both wrapped in the present invention The region helix being entirely buried in cell membrane has been included, has been also included within outside cell membrane, the region tail being connected with the region helix.This A little regions tail are for understanding that the biological function of protein plays an important role.
2. present invention uses multiple dimensioned depth residual error networks.It specifically, had both included the net based on whole sequence Network, and include the network of the sliding window based on regular length.There are certain complementations between the prediction result of two kinds of networks Property.By integrating the prediction result of both networks, it is capable of the precision of prediction of further lift scheme.
3. the present invention combines deep learning with machine learning.In prediction TMH position model, dynamic threshold mould is used Type handles the prediction result of deep learning model, successfully solves the problems, such as over-segmentation and less divided during predicting, improves The effect of model.
4. the present invention is widely used for integrated thought in the build process of model.In prediction TMH algorithm, it is integrated with The deep learning model of two different scales.In the algorithm of prediction non-TMH regional location, it is integrated with 10 juncture areas Prediction result, the inexactness bring for reducing the prediction position TMH influence, and ensure that precision of prediction.
5. the present invention also achieves preferable performance on the TMH of some more difficult predictions.
Detailed description of the invention
The following detailed description is read with reference to the accompanying drawings, above-mentioned and other mesh of exemplary embodiment of the invention , feature and advantage will become prone to understand.In the accompanying drawings, if showing by way of example rather than limitation of the invention Dry embodiment, in which:
Fig. 1 is TMH prediction flow chart one of according to embodiments of the present invention.
Fig. 2 is non-TMH regional location prediction flow chart one of according to embodiments of the present invention.
Fig. 3 is α spiral transmembrane protein schematic diagram.
Fig. 4 be according to embodiments of the present invention one of in effect of the dynamic threshold algorithm in over-segmentation and less divided problem show It is intended to.
Specific embodiment
The present invention relates to α spiral transmembrane protein field of biology, and in particular to a kind of α based on multiple dimensioned deep learning Spiral transmembrane protein topological structure prediction algorithm (MemBrain2.1).Algorithm is broadly divided into two parts: cross-film alpha helical region Domain (TMH) prediction and other region (non-THM) position predictions.In the first portion, present invention employs two kinds of different scales Deep learning model and dynamic threshold algorithm.The first model is based on the position whole sequence prediction TMH, second model base The position TMH is predicted in the sliding window of regular length.Both models have preferable complementarity because scale is different, pass through By both Model Fusions, the precision of TMH position prediction can be improved.Dynamic threshold algorithm is able to detect that over-segmentation and owes to divide Phenomenon is cut, the prediction result of deep learning is corrected.In second part, it is maximum most that present invention employs supporting vector machine model cooperations The position in the small distribution method prediction region non-TMH.Supporting vector machine model makes model more pay close attention to the training to play a decisive role Sample, minimax distribution method make model more pay close attention to predicted value relative size rather than absolute size.Both it can mention The robustness of high model.In conjunction with two-part prediction result, the topological structure of available α spiral transmembrane protein.
According to one or more embodiment, as depicted in figs. 1 and 2, it is a kind of based on the α spiral of multiple dimensioned deep learning across Memebrane protein topological structure prediction algorithm, comprising the following steps:
S1, collected according to the definition tissue training of TMH, verify collection and test set;
S2, using PSI-BLAST, HHblits, SPIDER3 tool proposes the sequence in training set, verifying collection and test set Take location specific scoring matrix (PSSM), HMM, water solubility, secondary structure, torsion angle and hydrophilic index feature;
S3, train the depth residual error network model based on whole sequence and the depth based on sliding window residual using training set Poor network model.The output of two kinds of networks is averaged after integrating, TMH prediction result is obtained using dynamic threshold algorithm;
S4, training set Training Support Vector Machines model is used.The input of model is the friendship in the region non-TMH and the region TMH Boundary part exports as the real number between 0 to 1, indicates that the current region non-TMH is tended to be located at external (outside) still Internal (inside).Then final prediction result is determined using minimax distribution method;
S5, the protein to be predicted for one, the first position TMH in prediction protein, then predict the area non-TMH The position in domain, in conjunction with two-part prediction result, so that it may obtain the final topology structure of protein.
Further, the definition step S1 new according to TMH, the specific step of tissue training's collection, verifying collection and test set It is rapid as follows:
S11, whole α spiral transmembrane protein structures, a total of 1783 PDB files are extracted from OPM database.Root According to the number of protein chain in file, the PDB file that this 1783 are divided into as unit of protein chain;
S12, test set of 40 test proteins as the present embodiment used in TMSEG work is chosen.For remaining PDB file, if protein chain disconnect perhaps protein length less than there is no cross-film α in 20 amino acid or protein Spiral is directly rejected.5741 protein chains are obtained in this way;
S13, using UniqueProt software with HVAL > 0 be standard remove 5741 it is superfluous between protein and test set 318 protein are always obtained then again to itself de-redundancy in Yu Xing.39 protein therein is selected at random as verifying Collection, remaining 279 protein is as training set.
S14, obtain whether every amino acid residue in protein belongs to TMH and each non-TMH according to PDB file The position in region.In the present embodiment, amino acid residue belong to TMH need to meet it is claimed below: residue is located at one section of α spiral On;This section of α spiral has part in cell membrane.
Further, the step S2 uses BLAST, HHblits and SPIDER3 software according to protein sequence information PSSM, HMM, the structural informations such as secondary structure, water solubility, torsion angle are extracted, while obtaining hydrophilic index information.It is specific as follows:
Location specific scoring matrix (PSSM) is common a kind of characterization die body in biological sequence.It it is rich in into Change information, and is proved to be highly useful feature in previous TMH prediction work.Obtain PSSM matrix, it is necessary first to Generate a Multiple sequence alignments file.BLAST software search NR (non-redundant) database is used in the present embodiment It obtains.It is specific to execute order and parameter are as follows:
psiblast-query sequence.fasta-db nr-out_ascii_pssm PSSM.matrix-save_ pssm_after_last_round-evalue 1e-3-max_target_seqs 10000-num_iterations 3-num_ threads 6
PSSM matrix can be extracted from Multiple sequence alignments result by following formula:
The length of wherein i=1 ..., L, L expression protein sequence, j=1 ..., 20, indicate 20 kinds of amino acid.PPM refers to position Set probability matrix, PPMi,jIndicate that jth kind amino acid appears in the probability of the i-th column of Multiple sequence alignments.bjIndicate jth kind ammonia The background frequency of base acid.For an amino acid residue, PSSM matrix 20 is tieed up totally.
HMM feature is that another includes the feature of evolution information.It is generated by HHblits sequence alignment tools.With BLAST is compared, and HHblits obtains homologous sequence using HMM-HMM alignment algorithm, and sensitivity is higher, as a result more accurate.For For one amino acid residue, HMM feature 30 is tieed up totally.In the present invention, using HHblits software search Uniclust30 data Library obtains HMM feature.It is specific to execute order and parameter are as follows:
hhblits–i sequence.fasta-n 3-e 0.001-d uniclust30_2017_10-cpu 6-ohhm sequence.hmm-diff inf-id 99-cov 50
Structural information feature includes torsion angle, water solubility and secondary structure.These features are obtained by SPIDER3 software prediction It arrives.For an amino acid residue, structural information feature 14 is tieed up totally.
Hydrophilic index describes the hydrophily or hydrophobicity degree of Amino acid side chain.Hydrophilic index is bigger, this amino acid Hydrophobicity it is stronger.The embodiment of the present invention uses Kyte-Doolittle hydrophilic index.For an amino acid residue, Hydrophilic index feature 1 is tieed up totally.
In prediction TMH algorithm, the embodiment of the present invention uses PSSM, HMM and structural information feature.In prediction non-TMH In regional location algorithm, the embodiment of the present invention uses HMM and hydrophilic index feature.
Further, the step S3 using training set depth residual error network model of the training based on whole sequence and is based on The depth residual error network model of sliding window.The output of two kinds of networks is averaged after integrating, is obtained using dynamic threshold algorithm To TMH prediction result.It is specific as follows:
S31, the effect according to model on verifying collection, determine the number of plies, just of the depth Remanent Model based on whole sequence The then parameters such as term coefficient, learning rate, batch size.Training set totally 279 sequences;
S32, the effect according to model on verifying collection, determine the number of plies, just of the depth Remanent Model based on sliding window The then parameters such as term coefficient, learning rate, batch size, sliding window size.Training set totally 17437 positive sample (sliding window centers Amino acid residue on TMH) and 20003 negative samples (amino acid residue at sliding window center is on non-TMH);
S33, for a α spiral transmembrane protein sequence, use two depths trained in two steps of S31 and S32 Degree Remanent Model obtains two prediction results.It is averaged the prediction result of the deep learning model of integrated two kinds of different scales.Root According to effect of the model on verifying collection, the parameter in dynamic threshold model is adjusted, such as initial threshold merges standard, fragmentation criterion Deng, with this solve the problems, such as prediction in over-segmentation and less divided.Dynamic threshold algorithm content is as follows:
I. mean filter is done to prediction score using the sliding window that length is 5 residues.In filtering, remove sliding Maximum value and minimum value in window.The initial threshold that use value is 0.55 obtains initial TMH prediction result.
Ii. for two adjacent TMH, if the gap between them is not more than 5 residues, and the length of two TMH It spends and is not more than 24 residues, then the two TMH are merged into a TMH.
It iii. is just 0.55 with initial value, increment is if its length is greater than 33 residues for each TMH 0.05 threshold test TMH therein.If there is more than one TMH is identified, and they are unsatisfactory for merging condition, then This TMH is split off.
Further, the step S4 uses training set Training Support Vector Machines model, specific as follows:
The intersection in the region S41, TMH and the region non-TMH, for predicting that there is large effect in the position of non-TMH.This In invention, such intersection refers to by 6 in the amino acid residue in the region TMH and 7 amino acid residues in the region non-TMH The window of composition.For one section of region non-TMH, the part of both front and back and the region TMH boundary is shared.Due to both Difference of having a common boundary is larger, the present embodiment two kinds of supporting vector machine models of training.By integrating the prediction result of both models, obtain Final prediction score.Using the multiple supporting vector machine models of trellis search method training, according to effect of the model on verifying collection Fruit determines final model.Training set includes the sample of 646 inside and the sample of 613 outside.
S42, using minimax distribution method, final prediction effect is obtained according to prediction score.First all pre- It surveys in score and selects prediction score maximum as inside, the smallest is outside.For remaining fraction, if from maximum Score is close, then is inside, otherwise is outside.Minimax distribution method more pay close attention to prediction score relative size and It is not absolute size, therefore can be to avoid accidentally dividing situation.
Further, the step S5 predicts the topological structure of a protein, specific as follows:
A protein sequence to be predicted is given, first using TMH therein is predicted, if be detected without TMH, It is considered that this protein is water soluble protein.If there is at least one TMH is detected, it is considered that this protein It is α spiral transmembrane protein.Then the position in the wherein region non-TMH is predicted.Since the result in the first step prediction region TMH can Can be inaccurate, it will lead to prediction non-TMH regional location by large effect.Therefore, the present embodiment uses integrated side Method, will be by 10,8,6,4,2 in the amino acid residue in the region TMH and 3, and 5,7,9,11 amino acid in the region non-TMH is residual Totally 5 juncture areas of base composition extract.Since there is both front and back juncture area in a region non-TMH, so in total It is extracted input of 10 juncture areas as supporting vector machine model.By this integrated method, model is greatly improved Robustness.
According to one or more embodiment, a kind of α spiral transmembrane protein topological structure prediction meanss, feature exists In the prediction meanss include memory;And it is coupled to the processor of the memory, which is configured as executing and deposit The instruction of storage in the memory, the processor execute following RPA operation:
S1, collected according to the definition tissue training of TMH, verify collection and test set;
S2, the hydrophilic index information for obtaining protein.Using PSI-BLAST, HHblits, SPIDER3 tool is to arrangement number PSSM, the protein structural informations such as HMM and water solubility, secondary structure, torsion angle are extracted respectively according to the sequence of concentration;
S3, train the depth residual error network model based on whole sequence and the depth based on sliding window residual using training set Poor network model.The output of two kinds of networks is averaged after integrating, the prediction result of TMH is obtained using dynamic threshold algorithm;
S4, training set Training Support Vector Machines model is used.The input of model is the friendship in the region non-TMH and the region TMH Boundary part exports as the real number between 0 to 1, indicates that the current region non-TMH is tended to be located at external (outside) still Internal (inside).Then final prediction result is determined using minimax distribution method;
S5, the protein to be predicted for one, the first TMH in prediction protein, then predict the region non-TMH Position, in conjunction with two-part prediction result, so that it may obtain the final topology structure of protein.
RPA, i.e. Robotic Process Automation (software flow automation), refer to software automation mode Realization was the business that manual operation computer is completed originally in various industries.
According to one or more embodiment, is defined according to new TMH, 279 protein are extracted from OPM database As training data.Depth network network structure having the same based on whole sequence and based on regular length sliding window, Contain 6 layers of convolutional layer, optimizer Adam.In the model based on whole sequence, training data is 279 protein, Batch_size is that 11, epoch number is 100.In model based on sliding window, training data have 17437 positive samples and 20003 negative samples, batch_size 40, sliding window size are that 17, epoch number is 100.In the prediction area non-TMH In the position model of domain, 646 positive samples and 613 negative samples are extracted from 279 protein altogether.Sample is in the region TMH In 6 amino acid residues and in the domain non-TMH 7 amino acid residues composition length be 13 residues junctional area Domain.
The evaluation index of use is as follows:
Wherein, the standard that one section of TMH is predicted correctly are as follows: the endpoint of the TMH of prediction cannot deviate true TMH endpoint ± 5 residues;The length of prediction and true TMH lap, should account for more than half of the TMH length of prediction, account for again More than half of true TMH length.The TMH of one α spiral transmembrane protein, which is predicted correctly, to be referred to: the TMH number of prediction Mesh is identical with true TMH number;Each true TMH is predicted correctly out.The topology knot of one α spiral transmembrane protein Structure, which is predicted correctly, to be referred to: TMH is predicted correctly;All non-TMH regional locations are predicted correctly.
Existing algorithm in the algorithm of proposition of the embodiment of the present invention and field is compared on test set.Comparing result As shown in table 1.The algorithm that the embodiment of the present invention proposes (PRE in several more important indexsH, RECH, Vp, Vtop) all obvious Better than other algorithms in field.
Have effect of the algorithm on test set in 1. algorithms of different of table and field
Table 2 shows the effect using dynamic threshold and fixed threshold on test set.Fixed threshold refers to obtain different rulers After spending the integrated result of depth model, a fixed threshold process prediction point is determined according to effect of the model on verifying collection Number.As can be seen that after having used dynamic threshold, PREHAnd RECHIndex improves 4.6% and 4.7% respectively.Fig. 4 gave The example of segmentation and less divided, after having used dynamic threshold algorithm, both of these problems are all successfully addressed.The results show The validity of dynamic threshold.
The effect of 2. dynamic threshold of table and fixed threshold on test set
Table 3 shows the effect using fixed threshold and minimax allocation algorithm on test set.Wherein MCC is referred to Behind the known true position TMH, Ma Xiusi coefficient of the model on prediction non-TMH regional location.VtopIt refers to known Behind the true position TMH, the topological structure of how many protein is predicted correctly out.MCCpredWith Vtop_predWith MCC and VtopClass Seemingly, front two indices are the indexs under unknown true TMH situation unlike.From table 3 it can be seen that minimax Distribution method is better than fixed threshold method.Especially in the case where the unknown true position TMH.
The effect of 3. minimax distribution method of table and fixed threshold method on test set
Table 4 shows the effect after the depth model of integrated different scale.The result of these three control methods all have passed through The processing of dynamic threshold method.As can be seen that there is complementarity between the deep learning model of different scale.Effect after integrated It is obviously improved.
Table 4. integrates the effect of the depth model of different scale
Table 5 is shown at the unknown true position TMH, integrates multiple juncture area methods in the prediction region non-TMH position The effect set.Junction2_11 indicates that currently used juncture area is existed by 2 amino acid residues in TMH and 11 Amino acid residue composition in the region non-TMH.Other names are similar.As can be seen that by integrating the pre- of multiple juncture areas It surveys as a result, it is possible to reduce the influence of TMH position prediction inaccuracy bring.The result of this 6 kinds of control methods all have passed through maximum most The processing of small distribution method.
Table 5. integrates effect of the prediction result of multiple juncture areas on test set
Table 6 shows performance of the present invention on the TMH of more difficult prediction.Specifically there are two class TMH.One kind is half cross-film α spiral shell Rotation, this kind of TMH is only across half cell membrane, and the position in two regions non-TMH before and after it is identical.Second class be closely across Film α spiral, this kind of cross-film α spiral refer to a pair of of TMH, this is not more than 3 amino acid residues to the gap among TMH.It is a pair of closely across Film α spiral refers to that two TMH therein are predicted correctly by success prediction.In test set kind, 11 pairs of nearly cross-film α spiral shells are shared Rotation and 6 half cross-film α spirals.As can be seen from Table 6, the effect that the present invention obtains is better than other algorithms in field.
Effect of 6. present invention of table on the TMH of more difficult prediction
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the present invention Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. are various can store journey The medium of sequence code.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims (6)

1. a kind of α spiral transmembrane protein topological structure prediction technique, which comprises the following steps:
S1, according to the definition tissue training of cross-film α spiral TMH collection, verifying collection and test set;
S2, to the sequential extraction procedures location specific scoring matrix (PSSM), HMM, water-soluble in training set, verifying collection and test set Property, secondary structure, torsion angle and hydrophilic index feature;
S3, using training set depth residual error network model of the training based on whole protein sequence and based on the depth of sliding window Residual error network model,
The output of both networks is averaged after integrating, TMH prediction result is obtained using dynamic threshold algorithm;
S4, using training set Training Support Vector Machines model in S1,
The input of model is the junction section of non-TMH and TMH,
Output is the real number between 0 to 1, and it is external (outside) or internal to indicate that the current region non-TMH is tended to be located at (inside),
Then final prediction result is determined using minimax distribution method;
S5, the protein to be predicted for one, the first TMH in prediction protein,
Then the position for predicting non-TMH, in conjunction with two-part prediction result, so that it may obtain the final topology knot of protein Structure.
2. α spiral transmembrane protein topological structure prediction technique according to claim 1, which is characterized in that the step S1 is further included steps of
S11, it is extracted all from OPM database (Orientations of Proteins in Membranes database) α spiral transmembrane protein structure, 1783 PDB files in total, according to the number of protein chain, by this 1783 PDB files The PDB file being divided into as unit of protein chain, totally 7814;
The coordinate of S12, the three-dimensional coordinate that Amino Acids in Proteins residue is obtained according to PDB file and cell membrane, obtain simultaneously Secondary structure,
If a certain section of protein is both α spiral, and has the part in cell membrane, then this section of protein is exactly TMH。
S13,40 test proteins used in TMSEG work are chosen as test set,
For remaining 7774 PDB files, if protein chain disconnects or less than 20 amino acid of protein length, or There is no TMH in person's protein, directly rejected, 5741 protein chains are obtained in this way;
It S14, is that standard removes 5741 redundancies between protein and test set with HVAL > 0, then again to itself de-redundant It is remaining, 318 protein are always obtained, select 39 protein therein at random as verifying collection, remaining 279 protein is made For training set.
3. α spiral transmembrane protein topological structure prediction technique according to claim 2, which is characterized in that amino acid is residual Base belongs to TMH and needs to meet: residue is located in one section of α spiral, and this section of α spiral has part in cell membrane.
4. α spiral transmembrane protein topological structure prediction technique according to claim 1, it is characterised in that: the step S3 is further included steps of
S31, according to the effect on verifying collection, determine the number of plies of the depth Remanent Model based on whole sequence, regularization coefficient, Learning rate, batch size parameter;
S32, according to the effect on verifying collection, determine the number of plies of the depth Remanent Model based on sliding window, regularization coefficient, Learning rate, batch size parameter;
S33, using the prediction result of the deep learning model of two kinds of different scales in the method integration S31 and S32 being averaged, According to effect of the model on verifying collection, the parameter in dynamic threshold model is adjusted, solves over-segmentation and less divided in prediction Problem.
5. α spiral transmembrane protein topological structure prediction technique according to claim 1, which is characterized in that the step , will be by 10,8,6,4,2 in the amino acid residue in the region TMH and 3 using integrated method in S5,5,7,9,11 in non- Totally 5 juncture areas of the amino acid residue composition in the region TMH, which extract mono- region non-TMH, both front and back boundary Input of 10 juncture areas as supporting vector machine model is extracted in region in total.
6. a kind of α spiral transmembrane protein topological structure prediction meanss, which is characterized in that the prediction meanss include memory; And
It is coupled to the processor of the memory, which is configured as executing the instruction of storage in the memory, institute It states processor and executes following operation:
S1, according to the definition tissue training of TMH collection, verifying collection and test set;
S2, to the sequential extraction procedures location specific scoring matrix (PSSM), HMM, water-soluble in training set, verifying collection and test set Property, secondary structure, torsion angle and hydrophilic index feature;
S3 uses training set depth residual error network model of the training based on whole sequence and the depth residual error net based on sliding window The output of both networks is averaged after integrating, obtains TMH prediction result using dynamic threshold algorithm by network model;
S4, using training set Training Support Vector Machines model,
The input of model is the junction section of TMH and non-TMH, exports as the real number between 0 to 1, indicates current non-TMH Region is tended to be located at outside (outside) or inside (inside), is then determined finally using minimax distribution method Prediction result;
S5, the protein to be predicted for one, the first TMH in prediction protein, then predicts the position of non-TMH, knot Close two-part prediction result, so that it may obtain the final topological structure of protein.
CN201910585644.4A 2019-07-01 2019-07-01 Alpha spiral transmembrane protein topological structure prediction method and device Active CN110390995B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910585644.4A CN110390995B (en) 2019-07-01 2019-07-01 Alpha spiral transmembrane protein topological structure prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910585644.4A CN110390995B (en) 2019-07-01 2019-07-01 Alpha spiral transmembrane protein topological structure prediction method and device

Publications (2)

Publication Number Publication Date
CN110390995A true CN110390995A (en) 2019-10-29
CN110390995B CN110390995B (en) 2022-03-11

Family

ID=68286124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910585644.4A Active CN110390995B (en) 2019-07-01 2019-07-01 Alpha spiral transmembrane protein topological structure prediction method and device

Country Status (1)

Country Link
CN (1) CN110390995B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667880A (en) * 2020-05-27 2020-09-15 浙江工业大学 Protein residue contact map prediction method based on depth residual error neural network
CN113205855A (en) * 2021-06-08 2021-08-03 上海交通大学 Knowledge energy function optimization-based membrane protein three-dimensional structure prediction method
CN113611354A (en) * 2021-07-05 2021-11-05 河南大学 Protein torsion angle prediction method based on lightweight deep convolutional network
CN113870941A (en) * 2020-06-30 2021-12-31 苏州浦意智能医疗科技有限公司 Protein structure prediction method based on geometric network

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831332A (en) * 2012-04-16 2012-12-19 南京理工大学常熟研究院有限公司 Interpretation prediction method of transmembrane helix of membrane protein
CN103413068A (en) * 2013-08-28 2013-11-27 苏州大学 Prediction method of transmembrane helix three-dimensional structure of G-protein-coupled receptor based on structure topology
CN104615911A (en) * 2015-01-12 2015-05-13 上海交通大学 Method for predicting membrane protein beta-barrel transmembrane area based on sparse coding and chain training
CN105740646A (en) * 2016-01-13 2016-07-06 湖南工业大学 BP neural network based protein secondary structure prediction method
US20160319339A1 (en) * 2009-11-12 2016-11-03 Esoterix Genetic Laboratories, Llc Copy Number Analysis of Genetic Locus
WO2019006022A1 (en) * 2017-06-27 2019-01-03 The Broad Institute, Inc. Systems and methods for mhc class ii epitope prediction
CN109448787A (en) * 2018-10-12 2019-03-08 云南大学 Based on the protein subnucleus localization method for improving PSSM progress feature extraction with merging
CN109829902A (en) * 2019-01-23 2019-05-31 电子科技大学 A kind of lung CT image tubercle screening technique based on generalized S-transform and Teager attribute

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160319339A1 (en) * 2009-11-12 2016-11-03 Esoterix Genetic Laboratories, Llc Copy Number Analysis of Genetic Locus
CN102831332A (en) * 2012-04-16 2012-12-19 南京理工大学常熟研究院有限公司 Interpretation prediction method of transmembrane helix of membrane protein
CN103413068A (en) * 2013-08-28 2013-11-27 苏州大学 Prediction method of transmembrane helix three-dimensional structure of G-protein-coupled receptor based on structure topology
CN104615911A (en) * 2015-01-12 2015-05-13 上海交通大学 Method for predicting membrane protein beta-barrel transmembrane area based on sparse coding and chain training
CN105740646A (en) * 2016-01-13 2016-07-06 湖南工业大学 BP neural network based protein secondary structure prediction method
WO2019006022A1 (en) * 2017-06-27 2019-01-03 The Broad Institute, Inc. Systems and methods for mhc class ii epitope prediction
CN109448787A (en) * 2018-10-12 2019-03-08 云南大学 Based on the protein subnucleus localization method for improving PSSM progress feature extraction with merging
CN109829902A (en) * 2019-01-23 2019-05-31 电子科技大学 A kind of lung CT image tubercle screening technique based on generalized S-transform and Teager attribute

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JING YANG.ET.: "MemBrain-contact 2.0: a new two-stage machine learning model for the prediction enhancement of transmembrane protein residue contacts in the full chain", 《BIOINFORMATICS》 *
肖峰: "Alpha螺旋跨膜蛋白3D结构中的残基可接触性预测研究", 《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667880A (en) * 2020-05-27 2020-09-15 浙江工业大学 Protein residue contact map prediction method based on depth residual error neural network
CN113870941A (en) * 2020-06-30 2021-12-31 苏州浦意智能医疗科技有限公司 Protein structure prediction method based on geometric network
CN113205855A (en) * 2021-06-08 2021-08-03 上海交通大学 Knowledge energy function optimization-based membrane protein three-dimensional structure prediction method
CN113205855B (en) * 2021-06-08 2022-08-05 上海交通大学 Knowledge energy function optimization-based membrane protein three-dimensional structure prediction method
CN113611354A (en) * 2021-07-05 2021-11-05 河南大学 Protein torsion angle prediction method based on lightweight deep convolutional network
CN113611354B (en) * 2021-07-05 2023-06-02 河南大学 Protein torsion angle prediction method based on lightweight deep convolutional network

Also Published As

Publication number Publication date
CN110390995B (en) 2022-03-11

Similar Documents

Publication Publication Date Title
CN110390995A (en) α spiral transmembrane protein topological structure prediction technique and device
CN109377484B (en) Method and device for detecting bone age
CN105279397B (en) A kind of method of key protein matter in identification of protein interactive network
CN109784149B (en) Method and system for detecting key points of human skeleton
CN110070909B (en) Deep learning-based multi-feature fusion protein function prediction method
Dzyubachyk et al. Advanced level-set-based cell tracking in time-lapse fluorescence microscopy
CN109272002B (en) Bone age tablet classification method and device
EP3308309A1 (en) Neural network architectures for linking biological sequence variants based on molecular phenotype, and systems and methods therefor
CN111784700A (en) Lung lobe segmentation, model training, model construction and segmentation method, system and equipment
CN107609342A (en) A kind of protein conformation searching method based on the constraint of secondary structure space length
CN109191442B (en) Ultrasonic image evaluation and screening method and device
EP4036796A1 (en) Automatic modeling method and apparatus for object detection model
CN111291825A (en) Focus classification model training method and device, computer equipment and storage medium
CN104615911B (en) Method based on sparse coding and chain study prediction memebrane protein beta barrel trans-membrane regions
CN109599149A (en) A kind of prediction technique of RNA coding potential
CN108564582B (en) MRI brain tumor image automatic optimization method based on deep neural network
CN106548213A (en) Blood vessel recognition methodss and device
CN111915594A (en) End-to-end neural network-based breast cancer focus segmentation method
Bauer et al. Multi-organ cancer classification and survival analysis
Marić et al. Graphmap2-splice-aware RNA-seq mapper for long reads
CN105975480B (en) A kind of instruction identification method and system
CN111128292A (en) Key protein identification method based on protein clustering characteristic and activity co-expression
CN116564401A (en) Model training, cell segmentation system, method and storage medium
CN114931356A (en) Retina structure extraction method, system and application for OCTA image
EP3138033B1 (en) Method and apparatus for performing block retrieval on block to be processed of urine sediment image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant