LU502739B1

LU502739B1 - A Prediction Method of Interaction Between Multi-Information and Residue Binding Energy Protein

Info

Publication number: LU502739B1
Application number: LU502739A
Authority: LU
Inventors: Fei Guo
Original assignee: Univ Central South
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2024-02-29

Abstract

The invention relates to bioinformatics technology, which is a method for predict that interaction between multivariate mutual information and residue binding energy protein, which can improve the function of useful information in amino acid sequence in prediction operation and effectively reduce the influence of useless noise information. step (1): Classification of amino acids; Step (2): Define feature representation; Step (3): Establish a characteristic frequency table; Step (4): Calculate mutual information features; Step (5): Compute 3-tuple mutual information features; Step (6): Calculate the physical and chemical properties of amino acids; Step (7): Calculate amino acid contact matrix AAC; Step (8): Feature extraction of amino acid sequence; Step (9): Carry out singular value decomposition; Step (10): The interaction between two protein is obtained, which is mainly used to predict the interaction between protein.

Description

DESCRIPTION

A Prediction Method of Interaction Between Multi-Information and -Y°92739

Residue Binding Energy Protein

FIELD OF THE INVENTION

[0001] The invention relates to biological information technology, belongs to the field of macromolecular structure prediction algorithms in protein omics, and particularly relates to a prediction method of interaction between multivariate mutual information and residue binding energy protein.

BACKGROUND OF THE RELATED ART

[0002] The interaction between protein and protein is the core of many biological processes.

Identifying the interaction between protein is very important for clarifying the function of protein and identifying biological processes in cells. The interaction information between protein can help people to better understand the pathogenesis of diseases, so as to design drugs more efficiently and accurately. In the past few years, a large number of computing technologies have developed to the stage of large-scale analysis. Generally speaking, there are three main calculation methods for detecting the interaction between protein: Methods based on evolutionary information, natural language processing and amino acid sequence characteristics. Based on the evolutionary information method, the evolutionary information is extracted from the multiple sequence alignment of homologous proteins, and an evolutionary tree is constructed to analyze the relationship between protein functions. This method needs a large amount of homologous protein data and the interaction markers between these white matter, so it is greatly limited in large-scale calculation. The method based on natural language processing relies on the widely used natural language processing technology.

Such methods mine useful information from a large number of known interactions between protein stored in biological and medical literature. Due to the lack of some information in the literature, the prediction results may not be complete. Therefore, it is particularly important to improve the prediction accuracy of the interaction between protein and ensure the large-scale popularization and application of the method by using the multi-element mutual information feature extraction method based on amino acid sequence and the residue combination energy information feature extraction method.

[0003] As the key technology of protein interaction prediction method based on amino acid sequence information, feature extraction method refers to defining a series of mapping 1 functions, through which a segment of amino acid sequence in protein is mapped into a list of LU502739 feature values that can represent the sequence. These values should include the useful features of protein as comprehensively as possible, and at the same time exclude the noise information that will adversely affect the prediction results. Classical amino acid sequence feature extraction methods include autocovariance, joint triplet, local protein sequence descriptor, multi-scale local feature descriptor, local phase quantization descriptor and matrix-based protein sequence representation. These methods represent amino acid sequences abstractly from different aspects, and their prediction results are quite different. Therefore, how to design an effective feature extraction method to abstract map amino acid sequences, improve the distinguishability between sequences and reduce the interference of noise information on the prediction results has become the key technology of protein interaction prediction method.

SUMMARY OF THE INVENTION

[0004] In order to overcome the shortcomings of the prior art, the invention aims to provide a prediction method of interaction between multivariate mutual information and residue binding energy protein, the used feature extraction function can improve the role of useful information in amino acid sequence in prediction operation and effectively reduce the influence of useless noise information..

A prediction method of interaction between protein of multivariate mutual information and residue binding energy includes the following steps:

[0005] Step (1): Amino acid category grouping: 20 standard amino acids are divided into n functional groups according to the even polarity and volume, and these n functional groups are respectively marked as CO, C1, C2,..., Cn, and the original amino acid sequence is converted into group category sequence according to the functional group category of each amino acid;

[0006] Step (2): Define different types of 3-tuple and 2-tuple feature representations, the 3- tuple feature representations are " COCOCO", " COCOC1",... " CnCnCn "; The characteristics of 2 tuples are represented as " COCO”, " COC1", " CnCn".

[0007] Step (3): Count the number of 3-tuple features and 2-tuple features in the group category sequence, establish the feature frequency table, and calculate the frequency of n categories in the sequence using the frequency calculation function f (a) = (na + 1) / (L + 1);

[0008] Step (4): Calculate the mutual information characteristics of 2 tuples, and the calculation formula is: 2

Hab) = Fab) Inflab) SET

[0009] TT Fre

[0010] Where f(ab) is the frequency of simultaneous occurrence of category ab in a binary group;

[0011] Step (5): Calculate 3-tuple mutual information features. The calculation formula is:

[0012] I(abc) = I(ab)+f(alc)Inf(alc)

[0013] -f(a|bc)Inf(a|be)

[0014] Where f(alc) is the frequency of class a in all the tuples with class c, and f(a|bc) is the frequency of class a in all the triples with class bc;

[0015] Get the first part mutual information feature value through the above five steps;

[0016] Step (6): Calculate the physical and chemical properties of amino acids;

[0017] Step (7): Through statistical analysis of protein complex database, amino acid contact matrix AAC was calculated by using residue pairing frequency: 00181 RW TE)

[0019] In which i, j represents two amino acids, Ni, j =X snij is the contact number of i and j,

[0020] Calculate substitution matrix SMR, SMRi, 1 = AAC (i, Al), where i= 1, ..., 20 is one of the twenty amino acid types, 1 = 1, ..., L is one of the L positions in the given protein sequence, and Al is the amino acid type of the 1 position. Through this step, a 20xL substitution matrix SMR is obtained;

[0021] Step (8): Using gradient direction histogram HOG feature extraction algorithm to extract the features of amino acid sequences;

[0022] Step (9): The transpose matrix of SMR matrix is decomposed by singular value, through which 20 right singular vectors can be obtained.

[0023] Step (10): Input the eigenvalues obtained from steps 1 to 9 into a random forest model for prediction, and then get the interaction between two protein.

[0024] Step (6) Specific calculation steps are as follows:

[0025] Step (6.1): Calculate Moreau-Broto autocorrelation eigenvalue, and the calculation formula is: 0026] Lag dy TH Ledge 3

[0027] Where lag is the distance between residues, p is the p-th physical and chemical LU502739 property of the above-mentioned natural amino acids, 1 is the position of the sequence, 1 = 1,2, …, L-lag, and lag = 1,2, ..., lg, after being expressed by six physical and chemical properties, 1gx6 characteristic values are obtained.

[0028] Step (6.2): The obtained 1gx6 eigenvalues are normalized;

[0029] Step (6.3): Count the frequency of 20 amino acids in the sequence.

[0030] Step (8) The specific calculation process is as follows:

[0031] Step (8.1): Calculate the gradient values Gh (i, 1) and Gv (i, 1) in the horizontal and vertical directions, the calculation formula is:

Salt, Dy = À EME LO — SEELE, LE 20

[0032] { D-NNRU-— LA, [=290 { SMR{LI +13 4, fad

Goll = {AMBUIS IY SNA — 1 ESEL

[0033] | Be SMEG 1, fwd

[0034] Step (8.2): Calculate gradient amplitude? Dos JOLY + 6,008

[0035] Step (8.3): Calculate gradient direction A

[0036] Step (8.4): Divide the gradient amplitude matrix and the gradient direction matrix into nine sub-matrices with the same size;

[0037] Step (8.5): The histogram of each gradient direction is counted, and the histogram size of each gradient direction is taken as a characteristic value.

[0038] Through the above steps, each sequence gets x eigenvalues, and the two sequences get 2x eigenvalues in total.

[0039] The invention has the characteristics and beneficial effects that:

[0040] Because the invention integrates multi-element mutual information of amino acid sequence and residue binding energy information. Compared with traditional sequence information, multivariate mutual information not only considers the characteristics of each amino acid accompanying its two ortho-peptide amino acids, but also considers the mutual information of its components. At the same time, gradient histogram and singular value decomposition can extract texture features of protein matrix. The addition of these new information and features provides powerful help to accurately predict the interaction between protein, therefore, when this method is used to analyze and predict the interaction between protein and protein, the accuracy of the prediction results is better than that of other existing 4 methods. This method can not only accurately predict the interaction between protein, but also LU502739 discover new interaction relationships in protein interaction network, which is of great significance to improve various protein interaction networks.

BRIEF DESCRIPTION OF THE DRAWINGS:

[0041] Figure 1 is a flow chart of the calculation process of the present invention

[0042] Figure 2. The feature representation of binary and triple groups and the establishment of frequency table;

[0043] Figure 3. Accuracy of Moreau-Broto autocorrelation feature when using different lg values;

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0044] The invention is characterized in that it sequentially comprises the following steps:

[0045] Step (1): Classification of amino acids. 20 standard amino acids were divided into 7 functional groups according to their even polarity and volume. These seven functional groups are respectively named CO, C1, C2, ...,C6. The original amino acid sequence is converted into a group category sequence according to the functional group category of each amino acid.

[0046] Step (2): Define different types of 3-tuple and 2-tuple feature representations. The characteristics of the 3-tuple are expressed as " COCOCO", " COCOC1", " C6C6C6".The characteristics of 2 tuples are expressed as " COCO", " COC1", " C6C6".

[0047] Step (3): Count the number of 3-tuple features and 2-tuple features in the group category sequence, and establish the feature frequency table, as shown in Figure 2. Use the frequency calculation function f (a) = (na +1)/(L+1) to calculate the frequency of seven categories in the sequence.

[0048] Step (4): 28 2-tuple mutual information features are calculated. The calculation formula is:

[0049] ! Fluid

[0050] Where f(ab) is the frequency of occurrence of the binary group ab.

[0051] Step (5): Calculate 84 3-tuple mutual information features. The calculation formula is:

[0052] I(abc) = I(ab)+f(alc)Inf(alc)

[0053] -f(ajbc)Inf(a[bc)

[0054] Where f(alc) is the frequency of class a in all the tuples with class c, and f(a|bc) is the LU502739 frequency of class a in all the triples with class bc.

[0055] Through the above five steps, 238 mutual information feature values can be obtained.

[0056] Step (6): Calculate the physical and chemical properties of amino acids. Each amino acid sequence can get 200 eigenvalues, and a pair of amino acid sequences to predict the interaction can get 400 eigenvalues. The specific calculation method is as follows:

[0057] Step (6.1): Calculate the autocorrelation eigenvalue of Moreau-Broto. The calculation formula is: 0058 NMBACUHag pl = Fe 2 We X Name)

[0059] Where lag is the distance between residues, p is the p-th physical and chemical property of the above-mentioned natural amino acids, L is the position of the sequence, 1 = 1,2, ..., L-lag, and lag = 1,2, ..., lg, where lg generally takes the value of 30. After being expressed by six physical and chemical properties, 30x 6 = 180 eigenvalues can be obtained.

[0060] Step (6.2): The obtained 180 eigenvalues are normalized.

[0061] Step (6.3): Count the frequency of 20 amino acids in the sequence.

[0062] Step (7): Through statistical analysis of protein complex database, amino acid contact matrix AAC was calculated by using residue pairing frequency:

[0063] EN af Cad = (Nef Ca)

[0064] In which i and j represent two kinds of amino acids. Ni, j = Xnij is the contact number of i and j.

[0065] Calculate substitution matrix SMR, SMRi, 1 = AAC (i, Al), where i= 1, ..., 20 is one of the twenty amino acid types, 1 = 1, ..., L is one of the L positions in the given protein sequence, and Al is the amino acid type of the 1 position. Through this step, a 20xL substitution matrix SMR is obtained;

[0066] Step (8): The gradient direction histogram HOG feature extraction algorithm is used to extract the features of amino acid sequences. The specific calculation process is as follows:

[0067] Step (8.1): Calculate the gradient values Gh (i, 1) and Gv (i, 1) in the horizontal and vertical directions. The calculation formula is:

Guth, D = CRMRBUE + LI JWR LA Laid

[0068] | {Fe BRR — LA, ju 30 6

( sMEGIeD-0, sa 0502739

Eli di (SMES 1) SERIES 1, Ixia

[0069] | & FMR EN fo do

[0070] Step (8.2): Calculate gradient amplitude? 442 = ELD GLE

[0071] Step (8.3): Calculate gradient direction ali, 1) = tan” Ce

[0072] Step (8.4): The gradient amplitude matrix and gradient direction matrix are divided into nine sub-matrices with the same size.

[0073] Step (8.5): Statistic the histogram of each gradient direction. The histogram size of each gradient direction is taken as an eigenvalue. [0074 Through the above steps, 81 eigenvalues can be obtained from each sequence, and 162 eigenvalues can be obtained from the two sequences.

[0075] Step (9): The transpose matrix of SMR matrix is decomposed by singular value.

Twenty right singular vectors can be obtained by singular value decomposition. This step can get 800 eigenvalues.

[0076] Step (10): Through steps 1 to 9, a total of 238+400+162+800 = 1600 eigenvalues can be obtained. These eigenvalues are input into a random forest model for prediction, so as to get the interaction between two protein.

[0077] According to the above calculation method, we used the interaction data set between protein and protein, which was generally recognized by 12 researchers, and analyzed the performance of our prediction method through random forest model. Including S.cerevisiae,

H .pylori2918, human8161 and E .coli. At the same time, the method is tested and analyzed on three real protein interaction networks, such as single-core network CD9, multi-core network Ras-Raf-Mek-Erk-Elk-Srf metabolic pathway and cross network Wnt. On the

S.cerevisiae data set, the accuracy of interaction prediction by using binary mutual information, ternary mutual information and multivariate mutual information is 93. 56%, 93.88% and 94.23% respectively. Obviously, using the combined multivariate mutual information for feature extraction can obtain better performance than using one kind of feature extraction alone. For Moreau-Broto autocorrelation eigenvalue calculation method, in order to get the best lg, this paper tests nine different lg values (Ig = 5, 10, 15, 20, 25, 30, 35, 40, 45). Figure 3 shows the accuracy of the prediction results when different lg values are used respectively. As can be seen from the curve in the figure, when lg increases from 5 to 30, the prediction accuracy increases; however, when lg increases from 30 to 45, the accuracy 7 decreases. The best prediction accuracy is obtained when lg is 30, and the accuracy rate is LU502739 92.76%. The accuracy rates of gradient direction histogram and singular value decomposition used in the method are 93. 86% and 92. 93% respectively when used alone. In our method, four kinds of feature extraction methods are integrated, and the prediction accuracy rate is 94.56%. The random forest classifier used in this method is superior to the SVM classifier in prediction results. The forest classifier is an integrated model, and it can detect the importance of features, so the accuracy of prediction results is improved by 2%.

[0078] This method has high accuracy when applied to the prediction of protein interaction network. On the single-core network CD9, our method can identify 14 of 16 protein interactions, and the accuracy rate is 87.50%. On the multi-core network of Ras-Raf-Mek-

Erk-Elk-Srf metabolic pathway, we can correctly predict 174 out of 189 protein interactions, with an accuracy rate of 92.06%. The cross network of metabolic pathways related to Wnt is very important in signal transduction. Our method found 91 interactions among 96, and the accuracy rate was 94. 79%, which was superior to the existing prediction methods. At present, other existing methods can achieve accuracy of 81. 25%, 90.00% and 76.04% respectively on these three types of network structures. It can be seen that our method has higher accuracy than the existing methods.

[0079] In protein omics, the biggest difficulty in predicting the interaction between protein is that the existing information is not clear enough, and useful information is hidden in too simple sequence information. If we want to accurately predict the interaction, we should not only use direct sequence information, but also need a good information extraction method to abstract more abundant and useful interaction information and physical and chemical attribute information from the underlying sequence information. To predict the interaction between protein, it is the main contribution of this invention to design a general feature extraction method that can extract useful information from sequence information.

[0080] The basic idea of the invention is: Extract various types of attribute information, and predict the interaction through an effective classifier. The method comprises the following steps: firstly, calculating the frequency information of various amino acids, their binary groups and triple groups in the sequence; and then, on the basis of these frequency information, further integrating and abstracting multivariate mutual information, and mining the relationship between different amino acids and their tuples from simple sequence data.

Secondly, the invention also fully considers the influence of the physical and chemical properties of amino acids on the interaction, and extracts residue binding energy information from the sequence to further improve the prediction accuracy. 8

[0081] The invention mainly comprises the following steps: Multivariate mutual information LU502739 in amino acid sequences was calculated, and 238 mutual information characteristic values were obtained. Calculate Moreau-Broto autocorrelation eigenvalue and count the frequency of amino acids in 20 sequences to get 400 eigenvalues. The amino acid contact matrix is calculated by using the residue pairing frequency, and then the substitution matrix is calculated. By using the gradient direction histogram to process the substitution matrix, 162 eigenvalues can be obtained. At the same time, 800 eigenvalues can be obtained by singular value decomposition of substitution matrix. The 1600 eigenvalues were classified by random forest classifier, so as to judge whether there is interaction between the two proteins.

[0082] The calculation process of the invention is simple and easy to realize, and the required hardware equipment and computing resources are relatively low, so it has wide usability. Our method can be realized by C++ and MATLAB, and the task of predicting thousands of samples can be completed in a short time on a common computer with 2 .5GHz 6-core CPU and 32GB memory. At the same time, in order to balance the relationship between performance and effect, the number of decision trees and the number of available features of each subtree of random forest classifier are selected as 500 and 400 respectively.

By adjusting these parameters, the speed of classification calculation can also be increased, so that the prediction operation can be performed more quickly. 9

Claims

1. A prediction method of interaction between multivariate mutual information and LU502739 residue binding energy protein is characterized by comprising the following steps: Step (1): Amino acid category grouping: 20 standard amino acids are divided into n functional groups according to the even polarity and volume, and these n functional groups are respectively marked as CO, C1, C2,..., Cn, and the original amino acid sequence is converted into group category sequence according to the functional group category of each amino acid; Step (2): Define different types of 3-tuple and 2-tuple feature representations, the 3-tuple feature representations are " COCOCO", " COCOC1",... " CnCnCn ";The characteristics of 2 tuples are represented as " COCO", " COC1", " CnCn ". Step (3): Count the number of 3-tuple features and 2-tuple features in the group category sequence, establish a feature frequency table, and use the frequency calculation function f (a) = (na+1)/(L+1) to calculate the frequency of n categories in the sequence respectively; Step (4): Calculate the mutual information characteristics of 2-tuple, and the calculation formula is: Flaps Where f(ab) is the frequency of simultaneous occurrence of category ab in a binary group; Step (5): Calculate the 3-tuple mutual information features. The calculation formula is: I(abc) = I(ab)+f(ajc)Inf(a|e)-f(a|bc)Inf(a|bc) Where f(a|c) is the frequency of class a in all the tuples with class c, and f(a|bc) is the frequency of class a in all the triples with class bc; Get the first part mutual information feature value through the above five steps; Step (6): Calculate the physical and chemical properties of amino acids; Step (7): Through statistical analysis of protein complex database, amino acid contact matrix AAC was calculated by using residue pairing frequency: ST UN af Coad = IN Gnd In which i, j represents two amino acids, Ni, j = *-5 nij is the contact number of i and j, Calculate substitution matrix SMRi, 1=AAC(®, Al), where 1 = 1, ..., 20 is one of the twenty amino acid types, 1 = 1, ..., L is one of the L positions in the given protein sequence,

and Al is the amino acid type of 1 position. Through this step, a 20xL substitution matrix LU502739 SMR is obtained; Step (8): Using gradient direction histogram HOG feature extraction algorithm to extract the features of amino acid sequences; Step (9): The transpose matrix of SMR matrix is decomposed by singular value, through which 20 right singular vectors can be obtained. Step (10): The eigenvalues obtained from steps 1 to 9 are input into a random forest model for prediction, so as to obtain the interaction between two protein.

2. The protein interaction prediction method of multivariate mutual information and residue binding energy according to claim 1 is characterized in that the specific calculation steps of step (6) are as follows: Step (6.1): Calculate Moreau-Broto autocorrelation eigenvalue, and the calculation formula is: , “255 NMRA (lag. p) = es > {Hep % Xusagie) iad Where lag is the distance between residues, p is the p-th physical and chemical property of the above-mentioned natural amino acids, 1 is the position of the sequence, 1 = 1,2, ..., L- lag, and lag = 1,2, ..., Ig. After being expressed by six physical and chemical properties, 1gx6 characteristic values are obtained. Step (6.2): The obtained 1gx6 eigenvalues are normalized; Step (6.3): Count the frequency of 20 amino acids in the sequence.

3. The protein interaction prediction method of multivariate mutual information and residue binding energy according to claim 1 is characterized in that the specific calculation process of step (8) is as follows: Step (8.1): Calculate the gradient values Gh (i, 1) and Gv (i, 1) in the horizontal and vertical directions. The calculation formula is: Gp LG — AERO LO SMR — 1,1, Li 20 D— BME ~ LE, fan 20 { SMRELEE IH {m1 BL (EMR UD SRH 1, ii OG -SHEBE I~ 1, {w= 11

Step (8.2): Calculate gradient amplitude ¥ LH = Gali, JF + 60605 „BET Step (8.3): Calculate gradient direction HBAS Step (8.4): Divide the gradient amplitude matrix and the gradient direction matrix into nine sub-matrices with the same size; Step (8.5): The histogram of each gradient direction is counted, and the histogram size of each gradient direction is taken as a characteristic value.

Through the above steps, each sequence gets x eigenvalues, and the two sequences get 2x eigenvalues in total. 12