CN108614957B

CN108614957B - Multi-stage protein structure prediction method based on Shannon entropy

Info

Publication number: CN108614957B
Application number: CN201810238703.6A
Authority: CN
Inventors: 张贵军; 谢腾宇; 周晓根; 王柳静; 马来发
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2021-06-18
Anticipated expiration: 2038-03-22
Also published as: CN108614957A

Abstract

A multi-stage protein structure prediction method based on Shannon entropy comprises the steps of firstly utilizing a Rosetta Abinitio protocol to search a search space, and finding out a potential natural state region through clustering background points; then, performing a prediction process in stages under the framework of a population evolution algorithm, analyzing the relation between each generation of population and the potential natural state area, and indicating the evolution state of the current population by classification; secondly, calculating state transition matrixes of two generations before and after the population and measuring the state transformation condition of the population by using the Shannon entropy; and finally, carrying out stage switching according to the accumulated times of the Shannon entropy value within a certain threshold value, and taking the last generation of population as a final prediction result. The invention provides a multi-stage protein structure prediction method based on Shannon entropy, which is used for dynamically switching stages according to the Shannon entropy so that the prediction precision and robustness of an algorithm are obviously improved.

Description

Multi-stage protein structure prediction method based on Shannon entropy

Technical Field

The invention relates to the fields of bioinformatics, intelligent optimization and computer application, in particular to a multi-stage protein structure prediction method based on Shannon entropy.

Background

The protein is the material basis of life, is an organic macromolecule, is a basic organic matter constituting cells, is the main undertaker of life activities, and is a substance with a certain spatial structure formed by the way that polypeptide chains consisting of amino acids in a dehydration condensation mode are coiled and folded. Multiple proteins can perform a particular function by folding or spiraling into a spatial structure, often by binding together to form a stable protein complex. The three-dimensional structure of proteins is of decisive importance in drug design, protein engineering and biotechnology, and therefore, protein structure prediction is an important research issue.

Experimental measurement methods for protein structure include X-ray crystallography, nuclear magnetic resonance spectroscopy, electron microscopy, and the like, and these methods are widely used for protein structure measurement. X-ray crystallography is considered one of the relatively feasible and accurate determination methods among these methods. However, X-ray crystallography requires a complex crystallization process and for some proteins that do not crystallize readily (e.g., membrane proteins), this method cannot be used for structural determination. In addition, these experimental assays are extremely time consuming, expensive, and prone to error.

According to the Anfinsen principle, a three-dimensional structure of a protein is directly predicted from an amino acid sequence by using a computer as a tool and applying an appropriate algorithm, and the prediction is a main research subject in bioinformatics at present. And the de novo prediction method is an optimization method for establishing a protein physical or knowledge energy model based on the Anfinsen hypothesis and then designing a proper optimization algorithm to solve the minimum energy conformation. On one hand, the method is helpful to reveal the protein folding mechanism in a biological sense, and further can finally clarify the second genetic code theoretical part in the biological center rule; on the other hand, this approach is universal in a practical sense, and de novo prediction methods are the only choice for sequence similarity < 20% or oligopeptides (<10 residues of small proteins). Rosetta, QUARK, etc. build energy models based on knowledge, which have been highlighted in past CASP events. However, when the method predicts a target protein with a long sequence, the search space increases exponentially, the prediction accuracy decreases sharply, and thus the problems of insufficient sampling capability, improper phase switching, incapability of retaining excellent intermediate results, and waste of computing resources are caused.

Therefore, the existing multi-stage protein structure prediction method based on the energy function has defects in stage switching and prediction accuracy, and needs to be improved.

Disclosure of Invention

In order to overcome the defects of the conventional multi-stage protein structure prediction method based on an energy function in the aspects of stage switching and prediction precision, the invention provides a Shannon entropy guided multi-stage switching protein structure prediction method with reasonable stage switching and high prediction precision.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a multi-stage protein structure prediction method based on shannon entropy, the method comprising the steps of:

1) giving input sequence information, and obtaining a fragment library of the sequence by using a Robeta server;

2) and (3) constructing a Markov state model by the following process:

2.1) acquiring nstruct background points: operating the Rosetta Abinitio protocol for nstruct times, and recording the conformation result of each operation as a background point;

2.2) calculating the root mean square difference distance RMSD between the nstruct background points to form a distance matrix D;

2.3) classifying the nstruct background points by using a k-means clustering method according to the distance matrix D to obtain m cluster centers serving as m Markov states;

3) initialization: performing the current stage NP times of Rosetta Abinitio according to the input sequence to generate an initial conformation population P ═ C { (NP), wherein the current stage is 1, the Shannon entropy threshold value alpha and the Shannon entropy maximum accumulation times count _ max₁,C₂,...,C_NPIn which C is_NPRepresents the Nth individual;

4) calculating the current population state: for individual C in the population_iI ∈ { 1.,. NP } classification: calculating C_iRMSD distance from m cluster centers, if C_iThe p cluster center is nearest, then the current state of the individual_iP, p ∈ {1, 2.., m }, and the state of the entire population is denoted as state_last＝{state₁,state₂,...,state_NP}，state_lastThe group state of the previous generation is referred to as the state + 1;

5) let the cumulative number count of shannon entropy be 0, enter the next stage, and the process is as follows:

5.1) performing corresponding phase prediction on the population, wherein the process is as follows:

5.1.1) to individuals C_iFragment Assembly to give C'_iAnd is combined withEnergy E of the conformation before and after fragment assembly was evaluated using the energy function at this stage_stage(C_i)、E′_stage(C′_i)；

5.1.2) if E_stage(C_i)＞E′_stage(C′_i) Then, the current fragment assembly C is accepted_i＝C_i'; otherwise, the selection is made using the Metropolis criteria and p ═ exp (- (E) is calculated_stage(C_i)-E_stage(C′_i) If p > rand (0,1), accepting the current fragment assembly C_i＝C_i'; otherwise, rejecting the segment assembly;

5.1.3) executing the steps 5.1.1) to 5.1.2) on all individuals to obtain a next generation population;

5.2) calculating the current population state: for individual C in the population_iI ∈ {1, 2.,. NP } classification: calculating C_iRMSD distance from m cluster centers, if C_iClosest to the q, q e {1,2,. the.m } cluster centers, then the individual's current state'_iQ, the state of the entire population is denoted as state_now＝{state′₁,state′₂,...,state′_NP}，state_nowThe current population state is indicated;

5.3) obtaining a Markov state transition matrix T according to the state statistics of the previous generation and the next generation: for conformation C_iTwo preceding and succeeding state states of i ∈ { 1.,. NP }_iP and state'_iQ indicates a transition from state p to state q, then t_pq＝t_pq+1/m，t_pqThe value of the matrix T in the p th row and the q th column represents the state transition frequency, and the initial value of the state transition frequency is 0;

5.4) calculating the Shannon Entropy value Encopy ∑ -T according to the state transition matrix T_pq lnt_pq；

5.5) update the State of the Current State_last＝state_now；

5.6) if Encopy < alpha, considering that the population state transition is more definite, and then count is equal to count + 1;

5.7) if the count is less than the count _ max, continuing to execute the current stage and returning to the step 5.1); otherwise, switching stages, namely, changing the stage to the stage +1, returning to the step 5 if the stage is less than 5), otherwise, ending the fourth stage prediction process, and outputting a prediction result.

The technical conception of the invention is as follows: firstly, searching a search space by using a Rosetta Abinitio protocol, and finding a potential natural state region by clustering background points; then, performing a prediction process in stages under the framework of a population evolution algorithm, analyzing the relation between each generation of population and the potential natural state area, and indicating the evolution state of the current population by classification; secondly, calculating state transition matrixes of two generations before and after the population and measuring the state transformation condition of the population by using the Shannon entropy; and finally, carrying out stage switching according to the accumulated times of the Shannon entropy value within a certain threshold value, and taking the last generation of population as a final prediction result.

The beneficial effects of the invention are as follows: on one hand, the potential natural state area is searched by using the clustering of the background points, so that the search space is reduced, and the calculation cost is reduced; on the other hand, the evolution condition of the population is measured according to the Shannon entropy so as to switch stages, the iteration times of each stage can be dynamically adjusted according to the size of the search space, and the prediction precision and the robustness are improved.

Drawings

FIG. 1 is a multi-stage protein structure prediction method based on Shannon entropy to perform structure prediction on protein 1ACF to obtain conformational energy and RMSD distribution compared with a natural state.

Fig. 2 is a three-dimensional structure diagram obtained by performing structure prediction on the protein 1ACF by a multi-stage protein structure prediction method based on shannon entropy.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a multi-stage protein structure prediction method based on shannon entropy includes the following steps:

2) and (3) constructing a Markov state model by the following process:

5.1.1) to individuals C_iFragment Assembly to give C'_iAnd using the energy function at this stage to evaluate the energy E of the conformation before and after fragment assembly_stage(C_i)、E′_stage(C′_i)；

5.1.2) if E_stage(C_i)＞E′_stage(C′_i) Then, the current fragment assembly C is accepted_i＝C′_i(ii) a Otherwise, the selection is made using the Metropolis criteria and p ═ exp (- (E) is calculated_stage(C_i)-E_stage(C′_i) If p > rand (0,1), accepting the current fragment assembly C_i＝C′_i(ii) a Otherwise, rejectAssembling the current fragment;

5.2) calculating the current population state: for individual C in the population_iI ∈ { 1.,. NP } classification: calculating C_iRMSD distance from m cluster centers, if C_iClosest to the q, q e {1,2,. the.m } cluster centers, then the individual's current state'_iQ, the state of the entire population is denoted as state_now＝{state′₁,state′₂,...,state′_NP}，state_nowThe current population state is indicated;

5.5) update the State of the Current State_last＝state_now；

5.6) if Encopy < alpha, considering that the population state transition is more definite, then count is equal to count + 1;

In this embodiment, the α/β sheet protein 1ACF with a sequence length of 125 is an embodiment, and a method for predicting a multi-stage protein structure based on shannon entropy includes the following steps:

2) and (3) constructing a Markov state model by the following process:

2.1) obtain nstruct 1000 background points: operating the Rosetta Abinitio protocol for nstruct times, and recording the conformation result of each operation as a background point;

2.3) classifying the nstruct background points by using a k-means clustering method according to the distance matrix D to obtain m-8 cluster centers as m Markov states;

3) initialization: the population size NP is 300, the current stage is 1, the Shannon entropy threshold value alpha is 0.01, the Shannon entropy maximum accumulation times count _ max is 50, the current stage NP of Rosetta Abinitio is executed according to the input sequence, and the initial conformation population P is generated { C ═ C₁,C₂,...,C_NPIn which C is_NPRepresents the Nth individual;

4) calculating the current population state: for individual C in the population_iI ∈ { 1.,. NP } classification: calculating C_iRMSD distance from m cluster centers, if C_iThe p cluster center is nearest, then the current state of the individual_iP, p ∈ { 1.. m }, and the state of the entire population is denoted as state_last＝{state₁,state₂,...,state_NP}，state_lastThe group state of the previous generation is referred to as the state + 1;

5.1.2) if E_stage(C_i)＞E′_stage(C′_i) Then, the current fragment assembly C is accepted_i＝C′_i(ii) a Otherwise, the selection is made using the Metropolis criteria and p ═ exp (- (E) is calculated_stage(C_i)-E_stage(C′_i) If p > rand (0,1), then accept this segmentAssembly C_i＝C′_i(ii) a Otherwise, rejecting the segment assembly;

5.4) calculating the Shannon Entropy value Encopy ∑ -T according to the state transition matrix T_pqlnt_pq；

5.5) update the State of the Current State_last＝state_now；

5.7) if the count is less than the count _ max, continuing to execute the current stage and returning to the step 5.1); otherwise, switching stages, namely stage +1, if the stage is less than 5, returning to the step 5), otherwise, ending the fourth-stage prediction process, and outputting a prediction result.

Using the alpha/beta sheet protein 1ACF with a sequence length of 125 as an example, the above method is used to obtain the near-native conformation of the protein, and the minimum root mean square deviation is

The predicted structure is shown in FIG. 2, and the energy sum of conformation in the prediction process is compared with the natural stateThe RMSD distribution of (a) is shown in fig. 1.

The above description is the optimization effect of the present invention using 1ACF protein as an example, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.

Claims

1. A multi-stage protein structure prediction method based on Shannon entropy is characterized in that: the protein structure prediction method comprises the following steps:

2) and (3) constructing a Markov state model by the following process:

5.1) executing a prediction process of a corresponding stage on the population, wherein the process is as follows:

5.1.2) if E_stage(C_i)＞E′_stage(C′_i) Then accept this fragment assembly, i.e. C_i＝C′_i(ii) a Otherwise, the selection is made using the Metropolis criteria and p ═ exp (- (E) is calculated_stage(C_i)-E_stage(C′_i) If p > rand (0,1), accepting the current fragment assembly C_i＝C′_i(ii) a Otherwise, rejecting the segment assembly;

5.3) obtaining a Markov state transition matrix T according to the previous generation population state and the current population state: for conformation C_iTwo preceding and succeeding state states of i ∈ { 1.,. NP }_iP and state'_iQ indicates a transition from state p to state q, then t_pq＝t_pq+1/m，t_pqThe value of the matrix T in the p th row and the q th column represents the state transition frequency, and the initial value of the state transition frequency is 0;

5.5) update the State of the Current State_last＝state_now；