CN110706739A

CN110706739A - Protein conformation space sampling method based on multi-mode internal and external intersection

Info

Publication number: CN110706739A
Application number: CN201910788537.1A
Authority: CN
Inventors: 张贵军; 赵凯龙; 夏瑜豪; 刘俊; 彭春祥; 周晓根
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-08-26
Filing date: 2019-08-26
Publication date: 2020-01-17
Anticipated expiration: 2039-08-26
Also published as: CN110706739B

Abstract

A multi-mode internal and external crossing-based protein conformation space sampling method introduces a crowding strategy and a greedy search strategy, wherein the intra-mode crossing and greedy search strategy is beneficial to convergence of individuals to local extrema, and the extra-mode crossing and crowding strategy is beneficial to keeping population diversity. After a population is initialized, through continuous fragment assembly, population individuals are divided into a plurality of modal sets, when the execution modes are crossed internally, a squeezing strategy is applied to sample the conformations, when the execution modes are crossed externally, a greedy search strategy is applied to sample the conformations, and the conflict problems of algorithm quick convergence and diversity retention can be better solved through the crossed use of the two strategies.

Description

Protein conformation space sampling method based on multi-mode internal and external intersection

Technical Field

The invention relates to the fields of bioinformatics and computer application, in particular to a protein conformation space sampling method based on multi-mode internal and external crossing.

Background

Protein structure prediction is an important content of genome function research and is a problem to be solved urgently in the field of bioinformatics at present. Proteins are important elements for life, such as enzyme proteins that catalyze biochemical reactions, carrier proteins that transport oxygen and nutrients, antibody proteins that are responsible for recognition of signals or participate in immune reactions, and the like. The function and structure of protein have close relationship, only the amino acid sequence with certain spatial structure can exert its specific biological function, the research of protein structure prediction tries to decipher the second genetic code and find out the relation between the amino acid sequence and the protein structure, which is the important research content of bioinformatics at present. Besides the biological theory significance, the protein structure prediction also has important practical application significance. It is not enough to study the function of protein and find out the pathogenic mechanism only by amino acid sequence, and it is necessary to know its spatial structure, i.e. the drug design is based on the spatial structure of protein, on the basis of knowing its spatial structure, it utilizes molecular docking algorithm and computer technology to design the inhibitory molecules of disease as candidate drugs, achieving the purpose of inhibiting some enzymes or protein activities.

In protein structure prediction, due to the complexity and inaccuracy of a force field model selected by protein structure prediction, a global stable structure predicted by an algorithm may not be well matched with the structure of an actually measured target point, so that a multi-mode optimization algorithm needs to be designed to provide other high-quality local stable structures of the protein.

Many practical optimization Problems belong to multi-modal function optimization Problems (multi-modal function optimization schemes), and usually a plurality of solutions are required to be solved, including a global optimal solution and a local optimal solution.

Therefore, it is difficult for the current protein structure prediction methods to effectively balance conformational diversity and convergence rate, and improvements are needed.

Disclosure of Invention

In order to overcome the defect that the existing protein structure prediction method cannot take account of the balance conformation diversity and the convergence rate, the invention provides an algorithm based on intra-mode crossing and extra-mode crossing, and introduces a crowd-sourcing strategy and a greedy search strategy. Intra-modal crossover and greedy search strategies facilitate convergence of individuals towards local extrema, while extra-modal crossover and crowding strategies facilitate preservation of population diversity. After a population is initialized, the population is divided into a plurality of modal sets through continuous fragment assembly, when intra-modal intersection is executed, a squeezing strategy is applied to sample the conformation, and when extra-modal intersection is executed, a greedy search strategy is applied to sample the conformation. The conflict problems of rapid convergence and diversity maintenance of the algorithm can be better solved by using the two strategies in a crossed manner.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for spatial sampling of protein conformation based on multi-modal internal-external crossing, the method comprising the steps of:

1) inputting sequence information of a predicted protein, and reading the sequence length L;

2) setting parameters: population size N, number of iterations G₁、G₂Cross probability P_cMutation operator F and mode number H;

3) initializing a population: iterating the first and second stages of Rosetta to generate an initial population P ═ P with N individuals₁,P₂,...,P_N}；

4) Let g₁＝1，g₁∈{1,2,...,G₁}；

5) Performing cross, variation and selection operations on the population, wherein the process comprises the following steps:

5.1) mutation operation: for theTarget individual P_iRandomly selecting two different from the population and different from P_iDifferent individuals, denoted P_rand1、P_rand2(ii) a Variant individuals P were generated as follows_i′：

P_i′＝P_i+F(P_rand1-P_rand2)

5.2) cross operation: generating two random numbers, r and j_rand，r∈[0,1]，j_randE {1, 2.., L }; the crossed individuals P are generated as follows_i″：

j belongs to {1,2, … L }, and represents an amino acid sequence number;

5.3) iterating the steps 5.1) and 5.2) until all the target individuals are traversed, and generating a filial generation population P';

5.4) selecting operation: scoring the individuals in the population P and P' by using an energy function, sequencing all the individuals from low to high according to energy, and selecting the first N individuals with low energy to replace the original individuals in the population P;

6)g₁＝g₁+ 1; if g is₁≤G₁Go to step 5);

7) and generating the mode according to the following process:

7.1) calculating the similarity between every two individuals in the population P according to the following formula:

and

respectively represent an individual P_iAnd P_jMiddle k number C_αThree-dimensional coordinates of atoms, wherein L is the sequence length of the structure, and the smaller the RMSE is, the more similar the two individuals are;

7.2) taking the similarity score between the two individuals as the distance between the two individuals, clustering the seeds into H classes by using a K-center clustering algorithm, and marking the class center point as C_h，h∈{1,2,...,H}；

8) Let g₂＝1，g₂∈{1,2,...,G₂}；

9) If g is₂If the number is odd, executing step 9.1) to perform intra-modal crossing; otherwise, executing a step 9.2), performing out-of-mode crossing:

9.1) intra-modal intersection based on the crowd-sourcing strategy, the procedure is as follows:

9.1.1) for target individuals P_iRandomly selecting two different from each other and P_iIndividuals P from the same class_rand1、P_rand2In [1, L-2 ]]Internally generating two different random integers r₁And r₂(ii) a Will P_iR of₁To r₁Residue # 2 and r₂To r₂Replacement of dihedral angles of residue +2 by P_rand1And P_rand2Dihedral values of the corresponding residues, resulting in crossed individuals P_i ^*；

9.1.2) calculating Individual P_i ^*With all the class center points C_hTo find a distance to the individual P_i ^*The class corresponding to the nearest class central point is found out and is corresponding to P_i ^*The most similar individual, the individual and P were calculated using the Rosetta score3 energy function_i ^*If P is the energy value of_i ^*If the energy value is lower, replacing the similar individual and updating the center point of the class;

9.2) greedy search strategy based off-modal intersection as follows:

9.2.1) for target individuals P_iRandomly selecting two different from each other and P_iIndividuals P from different classes_rand1、P_rand2In [1, L-2 ]]Internally generating two different random integers r₁And r₂(ii) a Will P_iR of₁To r₁Residue # 2 and r₂To r₂+2 residueThe dihedral angles of radicals being respectively substituted by P_rand1And P_rand2Dihedral values of the corresponding residues, resulting in crossed individuals P_i ^**；

9.2.2) calculating the offspring individuals P_i ^**With all the class center points C_hFinding the similarity with the individual P_i ^**Finding out the individual with the highest energy in the class to replace the class corresponding to the nearest class central point, and updating the central point of the class;

10)g₂＝g₂+ 1; if g is₂≤G₂Go to step 9);

11) and (4) performing population division on all individuals again according to the mode of the step 7), clustering into H classes, and outputting the class center point of each class as a final prediction result.

The invention has the beneficial effects that: by using the intra-modal intersection, the extra-modal intersection, the greedy search strategy and the crowd-sourcing strategy in an intersecting manner, the global search capability and the local convergence speed of the algorithm are improved, the algorithm convergence can be accelerated, the diversity of population individuals can be kept, and the prediction precision is improved.

Drawings

FIG. 1 is a three-dimensional structure diagram of protein 1C8C obtained by structure prediction based on a multi-modal internal-external crossing protein conformation space sampling method.

FIG. 2 is a flow chart of a protein conformation space sampling method based on multi-modal internal and external crossing.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a method for spatial sampling of protein conformation based on multi-modal internal and external crossing comprises the following steps:

3) initializing a population: iterative Rosetta first and second orderSegment, generating an initial population P ═ { P) with N individuals₁,P₂,...,P_N}；

4) Let g₁＝1，g₁∈{1,2,...,G₁}；

5.1) mutation operation: for the target individual P_iRandomly selecting two different from the population and different from P_iDifferent individuals, denoted P_rand1、P_rand2(ii) a Variant individuals P were generated as follows_i′：

P_i′＝P_i+F(P_rand1-P_rand2)

j belongs to {1,2, … L }, and represents an amino acid sequence number;

6)g₁＝g₁+ 1; if g is₁≤G₁Go to step 5);

7) and generating the mode according to the following process:

and

8) Let g₂＝1，g₂∈{1,2,...,G₂}；

9.2) greedy search strategy based off-modal intersection as follows:

9.2.1) for target individuals P_iRandomly selecting two different from each other and P_iIndividuals P from different classes_rand1、P_rand2In [1, L-2 ]]Internally generating two different random integers r₁And r₂(ii) a Will P_iR of₁To r₁Residue # 2 and r₂To r₂Replacement of dihedral angles of residue +2 by P_rand1And P_rand2Dihedral values of the corresponding residues, resulting in crossed individuals P_i ^**；

10)g₂＝g₂+ 1; if g is₂≤G₂Go to step 9);

In this embodiment, taking the protein 1C8C with the sequence length of 64 as an example, a method for spatial sampling of protein conformation based on multi-modal inside-outside crossing comprises the following steps:

1) inputting sequence information of the predicted protein 1C8C, and reading the sequence length L of 64;

2) setting parameters: the number of iterations G is 200 when the population N is equal to₁＝100，G₂300, cross probability P_c0.1, 0.5 for mutation operator F, 6 for mode number H;

4) Let g₁＝1，g₁∈{1,2,...,G₁}；

5.1) changePerforming different operations: for the target individual P_iRandomly selecting two different from the population and different from P_iDifferent individuals, denoted P_rand1、P_rand2(ii) a Variant individuals P were generated as follows_i′：

P_i′＝P_i+F(P_rand1-P_rand2)

j belongs to {1,2, … L }, and represents an amino acid sequence number;

6)g₁＝g₁+ 1; if g is₁≤G₁Go to step 5);

7) and generating the mode according to the following process:

and

respectively represent an individual P_iAnd P_jMiddle k number C_αThree-dimensional coordinates of atoms, L being the sequence length of the structure, smaller RMSE representing twoThe more similar the individuals are;

8) Let g₂＝1，g₂∈{1,2,...,G₂}；

9.2) greedy search strategy based off-modal intersection as follows:

10)g₂＝g₂+ 1; if g is₂≤G₂Go to step 9);

Using the protein 1C8C with an amino acid sequence length of 64 as an example, the method is used to obtain the near-natural state individuals of the protein in six modes, and the predicted root mean square deviation of the protein is respectively

The prediction structure is shown in fig. 1.

While the foregoing has described the preferred embodiments of the present invention, it will be apparent that the invention is not limited to the embodiments described, but can be practiced with modification without departing from the essential spirit of the invention and without departing from the spirit of the invention.

Claims

1. A protein conformation space sampling method based on multi-modal internal-external crossing is characterized by comprising the following steps:

2) setting parameters: population size N, number of iterations G₁、G₂Cross probability P_cMutation operator F, modal numberA number H;

4) Let g₁＝1，g₁∈{1,2,...,G₁}；

P_i′＝P_i+F(P_rand1-P_rand2)

j belongs to {1,2, … L }, and represents an amino acid sequence number;

6)g₁＝g₁+ 1; if g is₁≤G₁Go to step 5);

7) and generating the mode according to the following process:

and

8) Let g₂＝1，g₂∈{1,2,...,G₂}；

9.2) greedy search strategy based off-modal intersection as follows:

10)g₂＝g₂+ 1; if g is₂≤G₂Go to step 9);