LU502739B1 - A Prediction Method of Interaction Between Multi-Information and Residue Binding Energy Protein - Google Patents
A Prediction Method of Interaction Between Multi-Information and Residue Binding Energy Protein Download PDFInfo
- Publication number
- LU502739B1 LU502739B1 LU502739A LU502739A LU502739B1 LU 502739 B1 LU502739 B1 LU 502739B1 LU 502739 A LU502739 A LU 502739A LU 502739 A LU502739 A LU 502739A LU 502739 B1 LU502739 B1 LU 502739B1
- Authority
- LU
- Luxembourg
- Prior art keywords
- calculate
- protein
- amino acid
- interaction
- sequence
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 33
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 33
- 230000003993 interaction Effects 0.000 title claims abstract description 23
- 150000001413 amino acids Chemical class 0.000 claims abstract description 40
- 239000011159 matrix material Substances 0.000 claims abstract description 28
- 230000006916 protein interaction Effects 0.000 claims abstract description 21
- 125000003275 alpha amino acid group Chemical group 0.000 claims abstract description 20
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 239000000126 substance Substances 0.000 claims abstract description 12
- 238000004364 calculation method Methods 0.000 claims description 28
- 125000000524 functional group Chemical group 0.000 claims description 9
- 238000006467 substitution reaction Methods 0.000 claims description 9
- 238000007637 random forest analysis Methods 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 238000007619 statistical method Methods 0.000 claims description 3
- 239000013598 vector Substances 0.000 claims description 3
- XOOUIPVCVHRTMJ-UHFFFAOYSA-L zinc stearate Chemical compound [Zn+2].CCCCCCCCCCCCCCCCCC([O-])=O.CCCCCCCCCCCCCCCCCC([O-])=O XOOUIPVCVHRTMJ-UHFFFAOYSA-L 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 6
- 238000000354 decomposition reaction Methods 0.000 abstract description 5
- 230000006870 function Effects 0.000 description 5
- 239000000284 extract Substances 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 230000037353 metabolic pathway Effects 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000031018 biological processes and functions Effects 0.000 description 2
- 102100023471 E-selectin Human genes 0.000 description 1
- 241000283074 Equus asinus Species 0.000 description 1
- 241000588724 Escherichia coli Species 0.000 description 1
- 101000622123 Homo sapiens E-selectin Proteins 0.000 description 1
- 241000596871 Ixia Species 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012530 fluid Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000002887 multiple sequence alignment Methods 0.000 description 1
- 230000008506 pathogenesis Effects 0.000 description 1
- 230000004853 protein function Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 150000003839 salts Chemical class 0.000 description 1
- 230000019491 signal transduction Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 210000004885 white matter Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
- G16B15/30—Drug targeting using structural data; Docking or binding prediction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Chemical & Material Sciences (AREA)
- Biophysics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Medical Informatics (AREA)
- Medicinal Chemistry (AREA)
- Artificial Intelligence (AREA)
- Pharmacology & Pharmacy (AREA)
- Bioinformatics & Computational Biology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to bioinformatics technology, which is a method for predict that interaction between multivariate mutual information and residue binding energy protein, which can improve the function of useful information in amino acid sequence in prediction operation and effectively reduce the influence of useless noise information. step (1): Classification of amino acids; Step (2): Define feature representation; Step (3): Establish a characteristic frequency table; Step (4): Calculate mutual information features; Step (5): Compute 3-tuple mutual information features; Step (6): Calculate the physical and chemical properties of amino acids; Step (7): Calculate amino acid contact matrix AAC; Step (8): Feature extraction of amino acid sequence; Step (9): Carry out singular value decomposition; Step (10): The interaction between two protein is obtained, which is mainly used to predict the interaction between protein.
Description
A Prediction Method of Interaction Between Multi-Information and -Y°92739
Residue Binding Energy Protein
[0001] The invention relates to biological information technology, belongs to the field of macromolecular structure prediction algorithms in protein omics, and particularly relates to a prediction method of interaction between multivariate mutual information and residue binding energy protein.
[0002] The interaction between protein and protein is the core of many biological processes.
Identifying the interaction between protein is very important for clarifying the function of protein and identifying biological processes in cells. The interaction information between protein can help people to better understand the pathogenesis of diseases, so as to design drugs more efficiently and accurately. In the past few years, a large number of computing technologies have developed to the stage of large-scale analysis. Generally speaking, there are three main calculation methods for detecting the interaction between protein: Methods based on evolutionary information, natural language processing and amino acid sequence characteristics. Based on the evolutionary information method, the evolutionary information is extracted from the multiple sequence alignment of homologous proteins, and an evolutionary tree is constructed to analyze the relationship between protein functions. This method needs a large amount of homologous protein data and the interaction markers between these white matter, so it is greatly limited in large-scale calculation. The method based on natural language processing relies on the widely used natural language processing technology.
Such methods mine useful information from a large number of known interactions between protein stored in biological and medical literature. Due to the lack of some information in the literature, the prediction results may not be complete. Therefore, it is particularly important to improve the prediction accuracy of the interaction between protein and ensure the large-scale popularization and application of the method by using the multi-element mutual information feature extraction method based on amino acid sequence and the residue combination energy information feature extraction method.
[0003] As the key technology of protein interaction prediction method based on amino acid sequence information, feature extraction method refers to defining a series of mapping 1 functions, through which a segment of amino acid sequence in protein is mapped into a list of LU502739 feature values that can represent the sequence. These values should include the useful features of protein as comprehensively as possible, and at the same time exclude the noise information that will adversely affect the prediction results. Classical amino acid sequence feature extraction methods include autocovariance, joint triplet, local protein sequence descriptor, multi-scale local feature descriptor, local phase quantization descriptor and matrix-based protein sequence representation. These methods represent amino acid sequences abstractly from different aspects, and their prediction results are quite different. Therefore, how to design an effective feature extraction method to abstract map amino acid sequences, improve the distinguishability between sequences and reduce the interference of noise information on the prediction results has become the key technology of protein interaction prediction method.
[0004] In order to overcome the shortcomings of the prior art, the invention aims to provide a prediction method of interaction between multivariate mutual information and residue binding energy protein, the used feature extraction function can improve the role of useful information in amino acid sequence in prediction operation and effectively reduce the influence of useless noise information..
A prediction method of interaction between protein of multivariate mutual information and residue binding energy includes the following steps:
[0005] Step (1): Amino acid category grouping: 20 standard amino acids are divided into n functional groups according to the even polarity and volume, and these n functional groups are respectively marked as CO, C1, C2,..., Cn, and the original amino acid sequence is converted into group category sequence according to the functional group category of each amino acid;
[0006] Step (2): Define different types of 3-tuple and 2-tuple feature representations, the 3- tuple feature representations are " COCOCO", " COCOC1",... " CnCnCn "; The characteristics of 2 tuples are represented as " COCO”, " COC1", " CnCn".
[0007] Step (3): Count the number of 3-tuple features and 2-tuple features in the group category sequence, establish the feature frequency table, and calculate the frequency of n categories in the sequence using the frequency calculation function f (a) = (na + 1) / (L + 1);
[0008] Step (4): Calculate the mutual information characteristics of 2 tuples, and the calculation formula is: 2
Hab) = Fab) Inflab) SET
[0009] TT Fre
[0010] Where f(ab) is the frequency of simultaneous occurrence of category ab in a binary group;
[0011] Step (5): Calculate 3-tuple mutual information features. The calculation formula is:
[0012] I(abc) = I(ab)+f(alc)Inf(alc)
[0013] -f(a|bc)Inf(a|be)
[0014] Where f(alc) is the frequency of class a in all the tuples with class c, and f(a|bc) is the frequency of class a in all the triples with class bc;
[0015] Get the first part mutual information feature value through the above five steps;
[0016] Step (6): Calculate the physical and chemical properties of amino acids;
[0017] Step (7): Through statistical analysis of protein complex database, amino acid contact matrix AAC was calculated by using residue pairing frequency: 00181 RW TE)
[0019] In which i, j represents two amino acids, Ni, j =X snij is the contact number of i and j,
[0020] Calculate substitution matrix SMR, SMRi, 1 = AAC (i, Al), where i= 1, ..., 20 is one of the twenty amino acid types, 1 = 1, ..., L is one of the L positions in the given protein sequence, and Al is the amino acid type of the 1 position. Through this step, a 20xL substitution matrix SMR is obtained;
[0021] Step (8): Using gradient direction histogram HOG feature extraction algorithm to extract the features of amino acid sequences;
[0022] Step (9): The transpose matrix of SMR matrix is decomposed by singular value, through which 20 right singular vectors can be obtained.
[0023] Step (10): Input the eigenvalues obtained from steps 1 to 9 into a random forest model for prediction, and then get the interaction between two protein.
[0024] Step (6) Specific calculation steps are as follows:
[0025] Step (6.1): Calculate Moreau-Broto autocorrelation eigenvalue, and the calculation formula is: 0026] Lag dy TH Ledge 3
[0027] Where lag is the distance between residues, p is the p-th physical and chemical LU502739 property of the above-mentioned natural amino acids, 1 is the position of the sequence, 1 = 1,2, …, L-lag, and lag = 1,2, ..., lg, after being expressed by six physical and chemical properties, 1gx6 characteristic values are obtained.
[0028] Step (6.2): The obtained 1gx6 eigenvalues are normalized;
[0029] Step (6.3): Count the frequency of 20 amino acids in the sequence.
[0030] Step (8) The specific calculation process is as follows:
[0031] Step (8.1): Calculate the gradient values Gh (i, 1) and Gv (i, 1) in the horizontal and vertical directions, the calculation formula is:
Salt, Dy = À EME LO — SEELE, LE 20
[0032] { D-NNRU-— LA, [=290 { SMR{LI +13 4, fad
Goll = {AMBUIS IY SNA — 1 ESEL
[0033] | Be SMEG 1, fwd
[0034] Step (8.2): Calculate gradient amplitude? Dos JOLY + 6,008
[0035] Step (8.3): Calculate gradient direction A
[0036] Step (8.4): Divide the gradient amplitude matrix and the gradient direction matrix into nine sub-matrices with the same size;
[0037] Step (8.5): The histogram of each gradient direction is counted, and the histogram size of each gradient direction is taken as a characteristic value.
[0038] Through the above steps, each sequence gets x eigenvalues, and the two sequences get 2x eigenvalues in total.
[0039] The invention has the characteristics and beneficial effects that:
[0040] Because the invention integrates multi-element mutual information of amino acid sequence and residue binding energy information. Compared with traditional sequence information, multivariate mutual information not only considers the characteristics of each amino acid accompanying its two ortho-peptide amino acids, but also considers the mutual information of its components. At the same time, gradient histogram and singular value decomposition can extract texture features of protein matrix. The addition of these new information and features provides powerful help to accurately predict the interaction between protein, therefore, when this method is used to analyze and predict the interaction between protein and protein, the accuracy of the prediction results is better than that of other existing 4 methods. This method can not only accurately predict the interaction between protein, but also LU502739 discover new interaction relationships in protein interaction network, which is of great significance to improve various protein interaction networks.
[0041] Figure 1 is a flow chart of the calculation process of the present invention
[0042] Figure 2. The feature representation of binary and triple groups and the establishment of frequency table;
[0043] Figure 3. Accuracy of Moreau-Broto autocorrelation feature when using different lg values;
[0044] The invention is characterized in that it sequentially comprises the following steps:
[0045] Step (1): Classification of amino acids. 20 standard amino acids were divided into 7 functional groups according to their even polarity and volume. These seven functional groups are respectively named CO, C1, C2, ...,C6. The original amino acid sequence is converted into a group category sequence according to the functional group category of each amino acid.
[0046] Step (2): Define different types of 3-tuple and 2-tuple feature representations. The characteristics of the 3-tuple are expressed as " COCOCO", " COCOC1", " C6C6C6".The characteristics of 2 tuples are expressed as " COCO", " COC1", " C6C6".
[0047] Step (3): Count the number of 3-tuple features and 2-tuple features in the group category sequence, and establish the feature frequency table, as shown in Figure 2. Use the frequency calculation function f (a) = (na +1)/(L+1) to calculate the frequency of seven categories in the sequence.
[0048] Step (4): 28 2-tuple mutual information features are calculated. The calculation formula is:
[0049] ! Fluid
[0050] Where f(ab) is the frequency of occurrence of the binary group ab.
[0051] Step (5): Calculate 84 3-tuple mutual information features. The calculation formula is:
[0052] I(abc) = I(ab)+f(alc)Inf(alc)
[0053] -f(ajbc)Inf(a[bc)
[0054] Where f(alc) is the frequency of class a in all the tuples with class c, and f(a|bc) is the LU502739 frequency of class a in all the triples with class bc.
[0055] Through the above five steps, 238 mutual information feature values can be obtained.
[0056] Step (6): Calculate the physical and chemical properties of amino acids. Each amino acid sequence can get 200 eigenvalues, and a pair of amino acid sequences to predict the interaction can get 400 eigenvalues. The specific calculation method is as follows:
[0057] Step (6.1): Calculate the autocorrelation eigenvalue of Moreau-Broto. The calculation formula is: 0058 NMBACUHag pl = Fe 2 We X Name)
[0059] Where lag is the distance between residues, p is the p-th physical and chemical property of the above-mentioned natural amino acids, L is the position of the sequence, 1 = 1,2, ..., L-lag, and lag = 1,2, ..., lg, where lg generally takes the value of 30. After being expressed by six physical and chemical properties, 30x 6 = 180 eigenvalues can be obtained.
[0060] Step (6.2): The obtained 180 eigenvalues are normalized.
[0061] Step (6.3): Count the frequency of 20 amino acids in the sequence.
[0062] Step (7): Through statistical analysis of protein complex database, amino acid contact matrix AAC was calculated by using residue pairing frequency:
[0063] EN af Cad = (Nef Ca)
[0064] In which i and j represent two kinds of amino acids. Ni, j = Xnij is the contact number of i and j.
[0065] Calculate substitution matrix SMR, SMRi, 1 = AAC (i, Al), where i= 1, ..., 20 is one of the twenty amino acid types, 1 = 1, ..., L is one of the L positions in the given protein sequence, and Al is the amino acid type of the 1 position. Through this step, a 20xL substitution matrix SMR is obtained;
[0066] Step (8): The gradient direction histogram HOG feature extraction algorithm is used to extract the features of amino acid sequences. The specific calculation process is as follows:
[0067] Step (8.1): Calculate the gradient values Gh (i, 1) and Gv (i, 1) in the horizontal and vertical directions. The calculation formula is:
Guth, D = CRMRBUE + LI JWR LA Laid
[0068] | {Fe BRR — LA, ju 30 6
( sMEGIeD-0, sa 0502739
Eli di (SMES 1) SERIES 1, Ixia
[0069] | & FMR EN fo do
[0070] Step (8.2): Calculate gradient amplitude? 442 = ELD GLE
[0071] Step (8.3): Calculate gradient direction ali, 1) = tan” Ce
[0072] Step (8.4): The gradient amplitude matrix and gradient direction matrix are divided into nine sub-matrices with the same size.
[0073] Step (8.5): Statistic the histogram of each gradient direction. The histogram size of each gradient direction is taken as an eigenvalue. [0074 Through the above steps, 81 eigenvalues can be obtained from each sequence, and 162 eigenvalues can be obtained from the two sequences.
[0075] Step (9): The transpose matrix of SMR matrix is decomposed by singular value.
Twenty right singular vectors can be obtained by singular value decomposition. This step can get 800 eigenvalues.
[0076] Step (10): Through steps 1 to 9, a total of 238+400+162+800 = 1600 eigenvalues can be obtained. These eigenvalues are input into a random forest model for prediction, so as to get the interaction between two protein.
[0077] According to the above calculation method, we used the interaction data set between protein and protein, which was generally recognized by 12 researchers, and analyzed the performance of our prediction method through random forest model. Including S.cerevisiae,
H .pylori2918, human8161 and E .coli. At the same time, the method is tested and analyzed on three real protein interaction networks, such as single-core network CD9, multi-core network Ras-Raf-Mek-Erk-Elk-Srf metabolic pathway and cross network Wnt. On the
S.cerevisiae data set, the accuracy of interaction prediction by using binary mutual information, ternary mutual information and multivariate mutual information is 93. 56%, 93.88% and 94.23% respectively. Obviously, using the combined multivariate mutual information for feature extraction can obtain better performance than using one kind of feature extraction alone. For Moreau-Broto autocorrelation eigenvalue calculation method, in order to get the best lg, this paper tests nine different lg values (Ig = 5, 10, 15, 20, 25, 30, 35, 40, 45). Figure 3 shows the accuracy of the prediction results when different lg values are used respectively. As can be seen from the curve in the figure, when lg increases from 5 to 30, the prediction accuracy increases; however, when lg increases from 30 to 45, the accuracy 7 decreases. The best prediction accuracy is obtained when lg is 30, and the accuracy rate is LU502739 92.76%. The accuracy rates of gradient direction histogram and singular value decomposition used in the method are 93. 86% and 92. 93% respectively when used alone. In our method, four kinds of feature extraction methods are integrated, and the prediction accuracy rate is 94.56%. The random forest classifier used in this method is superior to the SVM classifier in prediction results. The forest classifier is an integrated model, and it can detect the importance of features, so the accuracy of prediction results is improved by 2%.
[0078] This method has high accuracy when applied to the prediction of protein interaction network. On the single-core network CD9, our method can identify 14 of 16 protein interactions, and the accuracy rate is 87.50%. On the multi-core network of Ras-Raf-Mek-
Erk-Elk-Srf metabolic pathway, we can correctly predict 174 out of 189 protein interactions, with an accuracy rate of 92.06%. The cross network of metabolic pathways related to Wnt is very important in signal transduction. Our method found 91 interactions among 96, and the accuracy rate was 94. 79%, which was superior to the existing prediction methods. At present, other existing methods can achieve accuracy of 81. 25%, 90.00% and 76.04% respectively on these three types of network structures. It can be seen that our method has higher accuracy than the existing methods.
[0079] In protein omics, the biggest difficulty in predicting the interaction between protein is that the existing information is not clear enough, and useful information is hidden in too simple sequence information. If we want to accurately predict the interaction, we should not only use direct sequence information, but also need a good information extraction method to abstract more abundant and useful interaction information and physical and chemical attribute information from the underlying sequence information. To predict the interaction between protein, it is the main contribution of this invention to design a general feature extraction method that can extract useful information from sequence information.
[0080] The basic idea of the invention is: Extract various types of attribute information, and predict the interaction through an effective classifier. The method comprises the following steps: firstly, calculating the frequency information of various amino acids, their binary groups and triple groups in the sequence; and then, on the basis of these frequency information, further integrating and abstracting multivariate mutual information, and mining the relationship between different amino acids and their tuples from simple sequence data.
Secondly, the invention also fully considers the influence of the physical and chemical properties of amino acids on the interaction, and extracts residue binding energy information from the sequence to further improve the prediction accuracy. 8
[0081] The invention mainly comprises the following steps: Multivariate mutual information LU502739 in amino acid sequences was calculated, and 238 mutual information characteristic values were obtained. Calculate Moreau-Broto autocorrelation eigenvalue and count the frequency of amino acids in 20 sequences to get 400 eigenvalues. The amino acid contact matrix is calculated by using the residue pairing frequency, and then the substitution matrix is calculated. By using the gradient direction histogram to process the substitution matrix, 162 eigenvalues can be obtained. At the same time, 800 eigenvalues can be obtained by singular value decomposition of substitution matrix. The 1600 eigenvalues were classified by random forest classifier, so as to judge whether there is interaction between the two proteins.
[0082] The calculation process of the invention is simple and easy to realize, and the required hardware equipment and computing resources are relatively low, so it has wide usability. Our method can be realized by C++ and MATLAB, and the task of predicting thousands of samples can be completed in a short time on a common computer with 2 .5GHz 6-core CPU and 32GB memory. At the same time, in order to balance the relationship between performance and effect, the number of decision trees and the number of available features of each subtree of random forest classifier are selected as 500 and 400 respectively.
By adjusting these parameters, the speed of classification calculation can also be increased, so that the prediction operation can be performed more quickly. 9
Claims (3)
1. A prediction method of interaction between multivariate mutual information and LU502739 residue binding energy protein is characterized by comprising the following steps: Step (1): Amino acid category grouping: 20 standard amino acids are divided into n functional groups according to the even polarity and volume, and these n functional groups are respectively marked as CO, C1, C2,..., Cn, and the original amino acid sequence is converted into group category sequence according to the functional group category of each amino acid; Step (2): Define different types of 3-tuple and 2-tuple feature representations, the 3-tuple feature representations are " COCOCO", " COCOC1",... " CnCnCn ";The characteristics of 2 tuples are represented as " COCO", " COC1", " CnCn ". Step (3): Count the number of 3-tuple features and 2-tuple features in the group category sequence, establish a feature frequency table, and use the frequency calculation function f (a) = (na+1)/(L+1) to calculate the frequency of n categories in the sequence respectively; Step (4): Calculate the mutual information characteristics of 2-tuple, and the calculation formula is: Flaps Where f(ab) is the frequency of simultaneous occurrence of category ab in a binary group; Step (5): Calculate the 3-tuple mutual information features. The calculation formula is: I(abc) = I(ab)+f(ajc)Inf(a|e)-f(a|bc)Inf(a|bc) Where f(a|c) is the frequency of class a in all the tuples with class c, and f(a|bc) is the frequency of class a in all the triples with class bc; Get the first part mutual information feature value through the above five steps; Step (6): Calculate the physical and chemical properties of amino acids; Step (7): Through statistical analysis of protein complex database, amino acid contact matrix AAC was calculated by using residue pairing frequency: ST UN af Coad = IN Gnd In which i, j represents two amino acids, Ni, j = *-5 nij is the contact number of i and j, Calculate substitution matrix SMRi, 1=AAC(®, Al), where 1 = 1, ..., 20 is one of the twenty amino acid types, 1 = 1, ..., L is one of the L positions in the given protein sequence,
and Al is the amino acid type of 1 position. Through this step, a 20xL substitution matrix LU502739 SMR is obtained; Step (8): Using gradient direction histogram HOG feature extraction algorithm to extract the features of amino acid sequences; Step (9): The transpose matrix of SMR matrix is decomposed by singular value, through which 20 right singular vectors can be obtained. Step (10): The eigenvalues obtained from steps 1 to 9 are input into a random forest model for prediction, so as to obtain the interaction between two protein.
2. The protein interaction prediction method of multivariate mutual information and residue binding energy according to claim 1 is characterized in that the specific calculation steps of step (6) are as follows: Step (6.1): Calculate Moreau-Broto autocorrelation eigenvalue, and the calculation formula is: , “255 NMRA (lag. p) = es > {Hep % Xusagie) iad Where lag is the distance between residues, p is the p-th physical and chemical property of the above-mentioned natural amino acids, 1 is the position of the sequence, 1 = 1,2, ..., L- lag, and lag = 1,2, ..., Ig. After being expressed by six physical and chemical properties, 1gx6 characteristic values are obtained. Step (6.2): The obtained 1gx6 eigenvalues are normalized; Step (6.3): Count the frequency of 20 amino acids in the sequence.
3. The protein interaction prediction method of multivariate mutual information and residue binding energy according to claim 1 is characterized in that the specific calculation process of step (8) is as follows: Step (8.1): Calculate the gradient values Gh (i, 1) and Gv (i, 1) in the horizontal and vertical directions. The calculation formula is: Gp LG — AERO LO SMR — 1,1, Li 20 D— BME ~ LE, fan 20 { SMRELEE IH {m1 BL (EMR UD SRH 1, ii OG -SHEBE I~ 1, {w= 11
Step (8.2): Calculate gradient amplitude ¥ LH = Gali, JF + 60605 „BET Step (8.3): Calculate gradient direction HBAS Step (8.4): Divide the gradient amplitude matrix and the gradient direction matrix into nine sub-matrices with the same size; Step (8.5): The histogram of each gradient direction is counted, and the histogram size of each gradient direction is taken as a characteristic value.
Through the above steps, each sequence gets x eigenvalues, and the two sequences get 2x eigenvalues in total. 12
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
LU502739A LU502739B1 (en) | 2022-08-31 | 2022-08-31 | A Prediction Method of Interaction Between Multi-Information and Residue Binding Energy Protein |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
LU502739A LU502739B1 (en) | 2022-08-31 | 2022-08-31 | A Prediction Method of Interaction Between Multi-Information and Residue Binding Energy Protein |
Publications (1)
Publication Number | Publication Date |
---|---|
LU502739B1 true LU502739B1 (en) | 2024-02-29 |
Family
ID=90195306
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
LU502739A LU502739B1 (en) | 2022-08-31 | 2022-08-31 | A Prediction Method of Interaction Between Multi-Information and Residue Binding Energy Protein |
Country Status (1)
Country | Link |
---|---|
LU (1) | LU502739B1 (en) |
-
2022
- 2022-08-31 LU LU502739A patent/LU502739B1/en active IP Right Grant
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Babu et al. | Medical disease prediction using grey wolf optimization and auto encoder based recurrent neural network | |
WO2022126971A1 (en) | Density-based text clustering method and apparatus, device, and storage medium | |
Tuset et al. | Sagittal otolith shape used in the identification of fishes of the genus Serranus | |
Yu et al. | Hybrid clustering solution selection strategy | |
US20080082356A1 (en) | System and method to optimize control cohorts using clustering algorithms | |
CN110136779B (en) | Sample feature extraction and prediction method for key difference nodes of biological network | |
Lee et al. | Modeling of inter‐sample variation in flow cytometric data with the joint clustering and matching procedure | |
CN107832456A (en) | A kind of parallel KNN file classification methods based on the division of critical Value Data | |
Ahmed et al. | Prediction of COVID-19 disease severity using machine learning techniques | |
WO2020143305A1 (en) | Group information classification method and apparatus, computer device, and storage medium | |
LU502739B1 (en) | A Prediction Method of Interaction Between Multi-Information and Residue Binding Energy Protein | |
WO2022011855A1 (en) | False positive structural variation filtering method, storage medium, and computing device | |
Yuan et al. | CSCIM_FS: Cosine similarity coefficient and information measurement criterion-based feature selection method for high-dimensional data | |
CN113643768A (en) | Method, device, medium and terminal for constructing plant metabolite database | |
CN113554176A (en) | Metabolic feature spectrum inference method, system, computer device, and storage medium | |
Dong | Application of Big Data Mining Technology in Blockchain Computing | |
CN109801672A (en) | Interaction prediction method between multivariate mutual information and residue combination calorie-protein matter | |
LU502421B1 (en) | A Method for Predicting Disease Association in Biological Association Network | |
WO2022266928A1 (en) | Metabolic characteristic spectrum inference method and system, and computer device and storage medium | |
CN114783539A (en) | Traditional Chinese medicine component analysis method and system based on spectral clustering | |
Kim et al. | An ensemble regularization method for feature selection in mass spectral fingerprints | |
CN111383708A (en) | Small molecule target prediction algorithm based on chemical genomics and application thereof | |
Anitha et al. | The predicting diseases of employees with VASA dataset using entropy | |
Pouyan et al. | Distance metric learning using random forest for cytometry data | |
CN109815989A (en) | A kind of multi-model fusion estimation system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FG | Patent granted |
Effective date: 20240229 |