CN102831332A

CN102831332A - Interpretation prediction method of transmembrane helix of membrane protein

Info

Publication number: CN102831332A
Application number: CN2012102616131A
Authority: CN
Inventors: 於东军; 沈红斌; 唐振民; 杨静宇
Original assignee: Nanjing University of Science and Technology; Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Current assignee: Nanjing University of Science and Technology; Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Priority date: 2012-04-16
Filing date: 2012-07-27
Publication date: 2012-12-19

Abstract

The invention discloses an interpretation prediction method of transmembrane helix of membrane protein. The method comprises the steps of: firstly, obtaining evolutionary information of protein by PSI-BLAST (Position-specific Iterated BLAST (Basic Local Alignment Search Tool)) program, and extracting features of each amino acid residue by a sliding window technology; then learning the distribution rule of transmembrane helix in a feature space by an SOM (Self-organizing Map), and encoding distribution rule knowledge in a weight vector of the SOM; finally, extracting an interpretation fuzzy rule set by a Wang-Mendel method; and predicting each amino acid residue of the given protein to be predicted by a fuzzy inference technology, and determining whether each amino acid residue belongs to a transmembrane helix segment by a dynamic threshold segmentation technology after obtaining a prediction curve. The method has the advantages that: 1, the distribution rule knowledge of the transmembrane helix is excavated by the learning of the SOM, and the noise of original data is reduced; and 2, the prediction mode of the transmembrane helix of the membrane protein, obtained by the fuzzy rule extraction technology, has very high interpretation performance.

Description

A kind of memebrane protein transbilayer helix Forecasting Methodology of interpretation

Technical field

The present invention relates to memebrane protein sequence transbilayer helix forecasting techniques, particularly a kind of transbilayer helix Forecasting Methodology with high interpretation.

Background technology

Memebrane protein (Transmembrane Protein) is one type of very important protein in biosome, and it all plays important effect for nutriment transportation, intercellular signal transmission and the energy exchange of cell.Simultaneously, memebrane protein also is a lot of pharmaceutically-active target spots, and most typical is the G protein family.There are some researches show that 60% ~ 70% target protein is G protein family member in the medicament research and development.In genomic data, there is 20% ~ 30% gene outcome to be predicted to be memebrane protein, yet regrettably, in PDB (Protein Data Bank) database, has only the transmembrane protein structure about 1% accurately to be measured.Because the hydrophobic property of memebrane protein; Make that the biologicall test of its structure is very difficult: it need combine with biological membrane to form stable native conformation; Be difficult to obtain crystal structure, and measure the most frequently used being to use X ray to carry out crystal diffraction and using nuclear magnetic resonance technique to measure of protein three-dimensional structure.It is all very unfavorable that the special construction of memebrane protein makes that these two kinds of methods implement.Therefore the relevant knowledge of applying biological information science, the membrane structure of striding that the forecasting techniques that uses a computer is studied memebrane protein just seems particularly important, for finding and being familiar with new transmembrane protein and studying its structure and physiological function has great significance.

Had a lot of memebrane protein transbilayer helix forecast models to occur at present, the precision of prediction of transbilayer helix improves just day by day.At present, some memebrane protein transbilayer helix Forecasting Methodologies occurred, typically had: TMHMM (A. Krogh; B. Larsson, G. von Heijne, and E. L. Sonnhammer; " Predicting transmembrane protein topology with a hidden Markov model:application to complete genomes, " J. Mol. Biol., vol. 305; Pp. 567-580,2001.) and PHOBIUS (L. Kall, A. Krogh; And E. L. Sonnhammer, " A combined transmembrane topology and signal peptide prediction method, " J. Mol. Biol.; Vol. 338, pp. 1027-36,2004.); These two kinds of methods are all used hidden Markov model, and (Hidden Markov Model HMM) carries out the prediction of transbilayer helix; Method based on neural network and dynamic programming; Like MEMSAT3 (Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics; 23 (5): 538-544,2007); Based on the method for support vector base, like SVMtm (Z. Yuan, J. S. Mattick; And R. D. Teasdale; " SVMtm:Support vector machines to predict transmembrane segments, " J. Comput. Chem., vol.25; Pp. 632 – 636,2004).

Yet these forecast models of analysis-by-synthesis can be found, what much more more their paid close attention to is degree of accuracy and the generalization ability of pursuing model, and does not consider containing and the interpretability of computation model to domain knowledge well, has ignored the interpretation of model.Computation model the time more similarly is a black box in work, lacks the effective explanation to the inherent mechanism of result of calculation, and the user is difficult to understand the inner link that exists between the forecast model input and output, also hinder with the biologist between communication with exchange.Therefore, under the prerequisite of precision that guarantees forecast model and generalization ability, how effectively strengthening its interpretation, is an urgent demand that vast Bioexperiment researcher proposes.

The method that the present invention proposes adopts the Fuzzy Rule Sets inference technology to design the prediction of memebrane protein transbilayer helix based on the evolution information of memebrane protein, has preferable model interpretation.Used PSI-BLAST program (A. A. Schaffer et al.; " Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements; " Nucleic Acids Res.; Vol. 29, pp. 2994 – 3005,2001) extract the evolution information of memebrane protein; At pattern feature regularity of distribution learning phase, used self-organizing map neural network (T. Kohonen, Self-Organization and Associative Memory, 3rd ed. New York:Springer-Verlag, 1989.); In the Rule Extraction stage, used the Learning-from-example technology that Wang-Mendel proposes (L. X. Wang and J. M. Mendel, " Generating fuzzy rules by learning from examples; " IEEE Trans. System., Man, Cybernetics; Vol. 22; No. 6, pp. 1414 – 1427,1992.).

Summary of the invention

The object of the present invention is to provide a kind of memebrane protein transbilayer helix Forecasting Methodology with high interpretation.

Technical scheme of the present invention is: a kind of memebrane protein transbilayer helix Forecasting Methodology of interpretation, and it may further comprise the steps:

The first step: feature extraction converts the amino acid residue in the protein sequence into vector form and representes.For a protein of forming by individual amino acid; Can obtain its ad-hoc location through the PSI-BLAST algorithm and get sub matrix (Position Specific Scoring Matrix; PSSM); This matrix is row 20 row; Earlier this PSSM is carried out standardization line by line, use sliding window technique to obtain the eigenmatrix of each amino acid residue then; Eigenmatrix is averaged by row; Obtain 20 dimensional feature vectors of this residue:

, wherein which residue representes;

Second step: the study of the pattern feature regularity of distribution, (Self-organizing Map SOM), learns the regularity of distribution of protein transbilayer helix sample, and eliminates the original training sample noise in feature space to use self-organizing map neural network.For given training sample set ; Wherein 0 representes the non-film of striding; Film is striden in 1 expression; Use the batch learning algorithm to train SOM, up to the SOM convergence or reach predefined study step number;

The 3rd step: fuzzy rule extracts, and extracts Fuzzy Rule Sets from the weight vector (codebook vectors) of the SOM that trains.Use the Wang-Mendel rule extraction to concentrate from the weight vector of SOM and extract Fuzzy Rule Sets, emphasis solves inconsistency and conflict property in the leaching process;

The 4th step: protein transbilayer helix prediction; For given protein to be predicted, use the Fuzzy Rule Sets of being extracted in the above-mentioned steps three, use fuzzy reasoning method; Transbilayer helix property to wherein amino acid residue is predicted one by one; Obtain prediction curve, use the method for Threshold Segmentation then, confirm whether each residue belongs to the transbilayer helix fragment.

Beneficial effect of the present invention:

The present invention compares with the pre existing survey technology; Its remarkable advantage: (1) interpretation: though the identification of existing memebrane protein transbilayer helix section has reached precision preferably; But generally do not have interpretation; The present invention makes the forecast model of being realized have high interpretation through using the fuzzy rule inference technology; (2) improve predetermined speed and precision: the Wang-Mendel method is extracting rule from the prototype vector of SOM; Rather than from original training sample extracting rule; Effectively reduce regular number, can obtain Fuzzy Rule Sets more closely, improved fuzzy prediction speed and precision effectively.

Description of drawings

Fig. 1 is the process flow diagram of interpretation memebrane protein transbilayer helix Forecasting Methodology.

Fig. 2 is the classification mark figure of the SOM after the training.

Fig. 3 is fuzzy subset's membership function definition figure.

The visable representation of Fig. 4 rule 1.

The visable representation of Fig. 5 rule 2.

The visable representation of Fig. 6 rule 3.

The visable representation of Fig. 7 rule 4.

Fig. 8 protein P16102 strides film tendentiousness curve.

Embodiment

Below in conjunction with accompanying drawing the present invention is further described.

Fig. 1 has provided process flow diagram of the present invention:

At first, use PSI-BLAST to obtain the PSSM matrix of training protein; Secondly, make up training dataset from the PSSM matrix; Then, use SOM that training dataset is learnt; At last, use the Wang-Mendel method from the weight vector of the SOM that trains, to extract Fuzzy Rule Sets.Forecast period for given protein, uses Fuzzy Inference, predicts that successively each residue belongs to the probability of transbilayer helix, uses thresholding method then, judges whether each residue belongs to transbilayer helix.Below, specifically set forth:

The first step; For a protein of forming by

individual amino acid residue; Can obtain ad-hoc location through PSI-BLAST and get sub matrix PSSM (

row 20 row), the information of protein preface is changed into matrix form:

(1)

PSSM is carried out normalization line by line:

(2)

Wherein

(3)

(4)

Using size is the moving window of , extracts the eigenmatrix of each residue:

(5)

Then, calculate the average that each lists, obtain the proper vector of 20 dimensions of corresponding residue:

(6)

Wherein,

(7)

The set of eigenvectors of all proteins residue has constituted training dataset

; Wherein 0 representes the non-film of striding, and film is striden in 1 expression.

Second step: pattern feature regularity of distribution study; Just use

, use the batch learning algorithm and train SOM.In the first?

second batch learning, first calculate the Voronoi? area?

, where?

;?

The vector sum is denoted by?

; Then, use the following iterative formula updating the weights of SOM Vector:

(8)

After the trained, each output node of SOM is marked, output node is divided into two types: 0 and 1, correspondingly respectively non-ly stride the film classification and stride the film classification, as shown in Figure 2.Can find out that from this figure two types distribution has regularity.

The 3rd step: the weight vector of the SOM after the training extracts Fuzzy Rule Sets.If

is the set of the weight vector of training back SOM; Wherein:

is the weight vector of 20 dimensions;

, belonging to classification 0 corresponding to this weight vector still is classification 1.Then, use the Wang-Mendel method from

, to extract Fuzzy Rule Sets:

(1) ambiguity in definition subclass in input and output region.

Let and

were the first

input and output variables of the universe.Definition

individual fuzzy subset and corresponding subordinate function on each domain.Note, can define the fuzzy subset of different numbers on the domain of different variablees, and also can adopt different subordinate functions.This patent is got when realizing.Fig. 3 (a), 3 (b), 3 (c) and 3 (d) provided respectively the 1st, 2,3 and output dimension on fuzzy subset's definition: on each dimension, defined two fuzzy subsets, be respectively " little " and " greatly ", and used the triangle subordinate function.

(2), use the Wang-Mendel method to extract Fuzzy Rule Sets for each weight vector in

.

At first; Corresponding current training sample

selects to have on each dimension the fuzzy subset of maximum membership function value.if

;

;

; ; ;

ties up

the individual fuzzy subset on the domain for being defined in

, and

is the subordinate function of

.So, it is selected to have the fuzzy subset

of maximum membership function value on

dimension:

(9)

Similarly; if

; ;

;

;

for being defined in

the individual fuzzy subset on the output domain,

is the subordinate function of .So; Corresponding to output valve , selected fuzzy subset is :

(10)

So far, each dimension goes up selected fuzzy subset's combination and obtains a Fuzzy Rule Sets

IF ?is

?and

?is

?and

?is ?THEN

?is

(3) possesses validity for rule.The availability of rule is the product that each dimension goes up membership function value

(10)

(4) create fuzzy rule base.A rule be can extract from every sample, when sample size is big, redundancy rule and conflict rule occurred possibly.So-called redundancy rule is meant that those have the rule of identical former piece and consequent; Conflict rule is meant that those have identical former piece but the rule of different consequents.Adding one to rule base when regular, whether the rule with the identical former piece of current rule is arranged in the judgment rule storehouse at first; If no, then be added into rule base; If have, then in rule base, only keep the rule of maximum availability.This scheme can effectively solve redundancy rule and conflict rule problem.

The 4th step: protein transbilayer helix prediction; For given protein to be predicted; Use the Fuzzy Rule Sets of being extracted; Use fuzzy reasoning method: to each amino acid residue in this protein, a characteristic of correspondence vector is arranged:

; Then, use fuzzy reasoning to predict:

Use the activity of every rule of product reasoning and calculation

(11)

Reverse gelatinization inference machine in employing center can obtain predicted value (possibility that belongs to transbilayer helix):

(12)

Wherein,

is output fuzzy subset's in

bar rule central value; is predicted value, and is regular number.

At last, the possibility that each amino acid residue is belonged to transbilayer helix is drawn as curve, uses the dynamic threshold cutting techniques can obtain each amino acid residue then and whether belongs to transbilayer helix.

Instance: the memebrane protein database of use comprises 131 memebrane proteins, and wherein the prokaryotes memebrane protein is 92,37 of eukaryotic membrane proteins, and 2 of virus membrane antigens amount to 445 transbilayer helix fragments, and average transmembrane length is 21 residues.Use this patent method, obtain 191 of fuzzy rules altogether.Given 4 typical regular examples below:

Rule 1:

IF?x1?is?F[1,1]?and?x2?is?F[2,1]?and?x3?is?F[3,1]?and?x4?is?F[4,1]?and?x5?is?F[5,1]?and?x6?is?F[6,1]?and?x7?is?F[7,2]?and?x8?is?F[8,2]?and?x9?is?F[9,2]?and?x10?is?F[10,2]?and?x11?is?F[11,1]?and?x12?is?F[12,1]?and?x13?is?F[13,2]?and?x14?is?F[14,2]?and?x15?is?F[15,2]?and?x16?is?F[16,2]?and?x17?is?F[17,2]?and?x18?is?F[18,2]?and?x19?is?F[19,2]?and?x20?is?F[20,2]

THEN?y?is?F[1]?with rule?degree?=?0.002

Rule 2:

IF?x1?is?F[1,1]?and?x2?is?F[2,1]?and?x3?is?F[3,1]?and?x4?is?F[4,1]?and?x5?is?F[5,1]?and?x6?is?F[6,1]?and?x7?is?F[7,2]?and?x8?is?F[8,1]?and?x9?is?F[9,1]?and?x10?is?F[10,2]?and?x11?is?F[11,1]?and?x12?is?F[12,1]?and?x13?is?F[13,1]?and?x14?is?F[14,2]?and?x15?is?F[15,2]?and?x16?is?F[16,2]?and?x17?is?F[17,2]?and?x18?is?F[18,2]?and?x19?is?F[19,2]?and?x20?is?F[20,2]

THEN?y?is?F[1]?with rule?degree?=?0.013

Rule 3:

IF?x1?is?F[1,2]?and?x2?is?F[2,2]?and?x3?is?F[3,2]?and?x4?is?F[4,2]?and?x5?is?F[5,2]?and?x6?is?F[6,2]?and?x7?is?F[7,2]?and?x8?is?F[8,2]?and?x9?is?F[9,2]?and?x10?is?F[10,2]?and?x11?is?F[11,2]?and?x12?is?F[12,1]?and?x13?is?F[13,1]?and?x14?is?F[14,1]?and?x15?is?F[15,1]?and?x16?is?F[16,1]?and?x17?is?F[17,1]?and?x18?is?F[18,1]?and?x19?is?F[19,1]?and?x20?is?F[20,1]

THEN?y?is?F[2]?with rule?degree?=?0.005

Rule 4:

IF?x1?is?F[1,2]?and?x2?is?F[2,2]?and?x3?is?F[3,2]?and?x4?is?F[4,2]?and?x5?is?F[5,2]?and?x6?is?F[6,2]?and?x7?is?F[7,1]?and?x8?is?F[8,1]?and?x9?is?F[9,1]?and?x10?is?F[10,1]?and?x11?is?F[11,2]?and?x12?is?F[12,2]?and?x13?is?F[13,1]?and?x14?is?F[14,1]?and?x15?is?F[15,1]?and?x16?is?F[16,1]?and?x17?is?F[17,1]?and?x18?is?F[18,1]?and?x19?is?F[19,1]?and?x20?is?F[20,1]

THEN?y?is?F[2]?with rule?degree?=?0.029

Rule

1 and 2 is to be used to judge that a residue does not belong to the transbilayer helix fragment, and

rule

3 and 4 is to be used to judge that a residue belongs to the transbilayer helix fragment.For explaining that these rules have good interpretation, at first these four rules are converted into the form of figure, meaning as shown in Figure 4.This figure can well interpretative rule, and like the implication of

rule

1 and 2 expression be: the IF residue has high water wettability and has low hydrophobicity, and that this residue of THEN does not belong to transbilayer helix; And the implication of

rule

3 and 4 expressions is: the IF residue has low water wettability and has high hydrophobicity, and that this residue of THEN belongs to transbilayer helix.The knowledge that these rules are embodied is consistent with human present biology cognition fully.

Transbilayer helix fragment with predicted protein matter P16102 is an example below.The amino acid sequence of protein P16102 is as follows:

>P16102

MSITSVPGVVDAGVLGAQSAAAVRENALLSSSLWVNVALAGIAILVFVYM

GRTIRPGRPRLIWGATLMIPLVSISSYLGLLSGLTVGMIEMPAGHALAGE

MVRSQWGRYLTWALSTPMILLALGLLADVDLGSLFTVIAADIGMCVTGLA

AAMTTSALLFRWAFYAISCAFFVVVLSALVTDWAASASSAGTAEIFDTLR

VLTVVLWLGYPIVWAVGVEGLALVQSVGVTSWAYSVLDVFAKYVFAFILL

RWVANNERTVAVAGQTLGTMSSDD

This protein has 7 transmembrane segments, that is: 27-50,63-82,106-122,134-153,159-179,201-221 and 227-248.

Use this patent method, at first generate the PSSM matrix of this protein; Then, use sliding window technique to extract the characteristic of each residue; With the input of the characteristic of each residue, obtain the tendentiousness that this residue belongs to transbilayer helix as indistinct logic computer; At last, the tendentiousness of all residues in this protein is drawn as curve, as shown in Figure 5; Use Threshold Segmentation technology, the transbilayer helix fragment that can obtain predicting: 29-50,62-86,106-124,132-155,160-181,197-217,225-251.Can find out that from this instance it is fairly good that prediction result and actual value are coincide.

The foregoing description does not limit the present invention in any way, and every employing is equal to the technical scheme that replacement or the mode of equivalent transformation obtain and all drops in protection scope of the present invention.

Claims

1. the memebrane protein transbilayer helix Forecasting Methodology of an interpretation is characterized in that may further comprise the steps:

The first step: feature extraction converts the amino acid residue in the protein sequence into vector form and representes; For a protein of forming by individual amino acid; Obtain its ad-hoc location through the PSI-BLAST algorithm and get sub matrix (Position Specific Scoring Matrix; PSSM); This matrix is

row 20 row; Earlier this PSSM is carried out standardization line by line; Use sliding window technique to obtain the eigenmatrix of each amino acid residue then; Eigenmatrix is averaged by row; Obtain 20 dimensional feature vectors of this residue:

, wherein represent which residue;

Second step: pattern feature regularity of distribution study; Use self-organizing map neural network (Self-organizing Map; SOM); The regularity of distribution of learning sample in feature space, and eliminate the original training sample noise, for given training sample set ; Wherein 0 representes the non-film of striding; Film is striden in 1 expression, uses the batch learning algorithm to train SOM, up to the SOM convergence or reach predefined study step number;

The 3rd step: fuzzy rule extracts, and extracts fuzzy rule from the weight vector (codebook vectors) of the SOM that trains, uses the Wang-Mendel rule extraction to concentrate from the weight vector of SOM and extracts Fuzzy Rule Sets;

2. memebrane protein transbilayer helix Forecasting Methodology according to claim 1 is characterized in that the fuzzy subset of the different numbers of definition on the domain of different variablees in the described step 3.

3. memebrane protein transbilayer helix Forecasting Methodology according to claim 1 is characterized in that in the described step 4 adopting the activity of every rule of product reasoning and calculation.