CN110245157A

CN110245157A - A kind of data difference analysis method and system based on Multilayer networks

Info

Publication number: CN110245157A
Application number: CN201910471042.6A
Authority: CN
Inventors: 薛宁; 宁万山; 许浩东; 邓万锟; 郭亚萍
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-05-31
Filing date: 2019-05-31
Publication date: 2019-09-17
Anticipated expiration: 2039-05-31
Also published as: CN110245157B

Abstract

The invention discloses a kind of data difference analysis method and system based on Multilayer networks, belongs to data analysis field.This method is first to establish data set, and the data in data set are changed；Then estimation variation front and back data aggregate probability is removed using Multilayer networks method；It is gone to select optimal window width according to maximum likelihood method, to different window widths, any one point fetched every time according to concentration, building joint probability distribution is gone with point remaining in data set, calculating this, any one puts the joint probability density value in the joint probability distribution, the product of obtained multiple a joint probability density values is likelihood value, makes the maximum window width best window width of likelihood value；According to the best window width, variation front and back data aggregate probability density distribution is obtained by Multilayer networks method, and analyze the difference of data.This method can not be limited the significance degree for acquiring each data by data distribution, for finding the data of significant changes.

Description

A kind of data difference analysis method and system based on Multilayer networks

Technical field

The present invention relates to data analysis fields, more particularly, to a kind of data difference based on Multilayer networks point Analyse method and system.

Background technique

The data of significant changes often have key.Such as by proteomic image technology, our available each eggs White matter may play crucial regulation in this process and make in the expression quantity of experimental group and control group, the protein for expressing significant difference With.People often look for differential protein according to fold differences, it is believed that the bigger protein difference of variation multiple is more significant.So And in most cases, this hypothesis is untenable, for example 1 becomes 2 and 10 to become 20 being all 2 times of variation, but does not represent it The significance of difference is identical.In another example influencing the amino acid mutation of protein modification state, mutation front and back makes protein modification state The mutation of significant changes is often more important, and Omar et al. develops a kind of method (MIMP) for predicting to be mutated to phosphorylation. However, the calculation formula of the joint probability in MIMP is invalid for independent two-dimentional variable.And its method cannot be counted The statistical significance for calculating influence of the mutation to phosphorylation is horizontal.Currently, facing problems, people do not have very good solution side Method, thus develop new method solve the problems, such as it is all so on it is very crucial.The present invention has developed a kind of based on Multilayer networks Data difference analysis method, this method has statistical significance and no matter what distribution is data be, this method is applicable.

Summary of the invention

The present invention solves data difference analysis method in the prior art and is not only limited by data distribution, but also lacks system Meter learns the technical issues of meaning.The present invention acquires variation front and back data aggregate probability density point according to Multilayer networks method Then cloth judges the conspicuousness of data variation according to hypothesis testing.This method can not be limited by data distribution acquire it is each The significance degree of data, for finding the data of significant changes.

According to the first aspect of the invention, a kind of data difference analysis method based on Multilayer networks is provided, is contained There are following steps:

It (1) is n group by the group number scale of data intensive data, the n is positive integer；Containing before changing in any one group of data Numerical value and variation after corresponding numerical value, the value before note variation is x, and value after variation is y, with the data before changing for horizontal seat Mark is that ordinate establishes coordinate system U using the data after changing, and the corresponding coordinate points of any one group of data are (x_i, y_i), institute The value range for stating i is 1≤i≤n；

(2) estimation variation front and back data aggregate probability density distribution is removed using the Multilayer networks method based on Gaussian kernel, The formula of utilization are as follows:Wherein h is window width, and n is number According to the group number of intensive data, f (x, y) is the probability density value in coordinate system U at any point (x, y)；According to maximum likelihood method It goes to select optimal h, method particularly includes: firstly, taking data set corresponding in the coordinate points in coordinate system U every time different h Any one point, go building joint probability distribution with remaining n-1 point, then calculate any one described point in the joint Joint probability density value in probability distribution, obtains n joint probability density value, and the product of the n joint probability density value is Likelihood value makes the best h of the maximum h of likelihood value；The best h is substituted into the formula, then recycles the data set pair It should go to construct best joint probability distribution in all coordinate points in coordinate system U；

(3) fixed to change preceding size of data x, data y in the case where fixation x, after variation is acquired in step (2) institute State the probability density distribution in best joint probability distribution；Firstly, in the case where fixed x, using the distribution of y as X ' axis, with Probability density of the fixation x under the best h condition is that Y ' axis establishes coordinate system U '；Then, for any in data set One group of data (x_i, y_i), it acquires in the x_iIn the case where, the probability density distribution of the size of data y after variation, according to y_iInstitute The position on the X ' axis of coordinate system U ' is stated, this group of data (x is acquired_i, y_i) variation tendency and variation degree, method particularly includes: It is taken on the X ' axis of the coordinate system U ' a bit, makees the straight line of the X ' axis perpendicular to coordinate system U ' by the point, the straight line is by density The area that curve and X-axis are surrounded is divided into left and right two parts, remembers that the point is y₀If y_iGreater than y₀, then data point (x_i, y_i) Variation be up-regulation, the significance degree P of up-regulation is y > y_iWhen distribution in area ratio upper density curve and X ' axis surrounded Area, if y_iLess than y₀, then data point (x_i, y_i) variation be to lower, the significance degree P of downward is y < y_iWhen distribution in The area that area ratio upper density curve and X ' axis are surrounded, if y_iEqual to y₀, then data point (x_i, y_i) there is no variations.

Preferably, any one group of data are at least one amino around amino acid sites in step (1) described data set Acid mutates after preceding and mutation, the probability value which modifies.

Preferably, in step (1) described data set any one group of data be before and after lysine sites in each N number of amino acid extremely Before few amino acid generation missense mutation and after missense mutation, which occurs the probability value of succinylation；The N For integer, the value range of N is 0 N≤50 <.

Preferably, the value range of the N is 5≤N≤15.

Preferably, step (1) data set be drug-treated cell before and processing cell after, the cell generate RNA or Express the data of protein level.

Preferably, the n is more than or equal to 1000.

According to another aspect of the present invention, a kind of data difference analysis system based on Multilayer networks is provided, is wrapped It includes:

Data set establishes module: the data set establishes module for establishing the data set of difference to be analyzed；By data set The group number scale of middle data is n group, and the n is positive integer；It is corresponded to after containing numerical value and variation before changing in any one group of data Numerical value, the value before note variation is x, and value after variation is y, using the data before changing as abscissa, is with the data after changing Ordinate establishes coordinate system U, and the corresponding coordinate points of any one group of data are (x_i, y_i), the value range of the i is 1≤i ≤n；

Best window width computing module: the best window width computing module is used to calculate best window width h, and Obtain best joint probability distribution；Estimation variation front and back data aggregate probability is removed using the Multilayer networks method based on Gaussian kernel Density Distribution, the formula of utilization are as follows:Wherein h is that window is wide Degree, n are the group number of data intensive data, and f (x, y) is the probability density value in coordinate system U at any point (x, y)；According to most Maximum-likelihood method goes to select optimal h, method particularly includes: firstly, taking data set corresponding in coordinate system U every time different h Any one point in coordinate points goes building joint probability distribution with remaining n-1 point, then calculates any one described point Joint probability density value in the joint probability distribution obtains n joint probability density value, the n joint probability density The product of value is likelihood value, makes the best h of the maximum h of likelihood value；The best h is substituted into the formula, then described in recycling The corresponding all coordinate points in coordinate system U of data set go to construct best joint probability distribution；

Data difference analysis module in data set: data difference analysis module is for analyzing in data set in the data set Difference before and after data variation；It is fixed to change preceding size of data x, it acquires data y in the case where fixation x, after variation and exists Probability density distribution in step (2) the best joint probability distribution；Firstly, being made in the case where fixed x with the distribution of y It is that Y ' axis establishes coordinate system U ' with probability density of the fixation x under the best h condition for X ' axis；Then, for data set In any one group of data (x_i, y_i), it acquires in the x_iIn the case where, the probability density distribution of the size of data y after variation, root According to y_iThis group of data (x is acquired in position on the X ' axis of the coordinate system U '_i, y_i) variation tendency and variation degree, specifically Method are as follows: taken on the X ' axis of the coordinate system U ' a bit, make the straight line of the X ' axis perpendicular to coordinate system U ' by the point, this is straight The area that density curve and X-axis are surrounded is divided into left and right two parts by line, remembers that the point is y₀If y_iGreater than y₀, then data Point (x_i, y_i) variation be up-regulation, the significance degree P of up-regulation is y > y_iWhen distribution in area ratio upper density curve and X ' axis The area surrounded, if y_iLess than y₀, then data point (x_i, y_i) variation be to lower, the significance degree P of downward is y < y_iWhen The area that area ratio upper density curve and X ' axis in distribution are surrounded, if y_iEqual to y₀, then data point (x_i, y_i) do not send out Changing.

In general, through the invention it is contemplated above technical scheme is compared with the prior art, mainly have below Technological merit:

(1) the invention discloses a kind of data difference analysis method based on Multilayer networks, this method has statistics It learns meaning and no matter what distribution is data be, this method is applicable, limits without condition, facilitates people from the change of divergence Data in find crucial things.

(2) the present invention is implemented as follows: 1 becomes 2 and 10 to become 20 being all 2 times of variation, but it is aobvious not represent its difference Work property is identical.However, 1 become 3 become compared to 12 difference it is more significant.We are based on the principle, by every before estimation variation The probability density distribution of data assesses the conspicuousness of mutation front and back difference after the corresponding mutation of a data.

(3) the h value in the joint probability density distribution formula in the present invention influences the estimation of data aggregate probability density distribution Quality, in order to obtain the best estimate of joint probability density distribution, the present invention goes to select optimal h with maximum likelihood method, To different h (0 < h < 1), access goes building to combine according to any one point of concentration with n-1 point remaining in data set every time Probability distribution calculates any one the described joint probability density value of point in the joint probability distribution, obtains n joint probability Density value；Likelihood value is the product of n joint probability density value, makes the likelihood value best h of maximum h, because the probability under the h is close Degree distribution most probable meets actual distribution.

(4) size of data x before each variation is fixed in the present invention, acquired in the case where the x, size of data y after variation Probability density distribution；Hypothesis testing is carried out using the distribution, it is generally accepted that the data of P-value < 0.05 are significant changes Data increase compared to numerical value before changing, it is believed that are up-regulations；Conversely, being then to lower.

Detailed description of the invention

The flow chart of method in Fig. 1 present invention.

Fig. 2 is enrichment condition of 218 genes comprising KsuMs in cancer gene and drug target gene data set.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.

Embodiment 1

Inventive method is used for the mutation for predicting to significantly affect existing succinylation site by we.It is logical that this facilitates discovery The gene for changing succinylation network influence cancer is crossed, and provides and disease biological and treatment development is understood in depth.Prominent Become in the impact analysis of succinylation, we are from cancer gene database The Cancer Genome Atlas (TCGA) Incorporate 1,779,214 missense mutation in 33 kinds of major cancers type/hypotypes, 11,659 tumor samples.Wherein have 63693 missense mutations (KsuMs) occurred in lysine sites periphery (each 10 amino acid in left and right).As shown in Figure 1, we Probability point is acquired with succinylation site estimation platform to 63693 peptide fragments comprising KsuMs, probability point reflects the site amber Amber is acylated degree.Then, the Bayes posterior probability of estimation mutation front and back is removed using the Parzen window method based on Gaussian kernel Joint probability density:

Wherein h is window width, and n is the quantity of KsuMs, here, n=63693.The selection of h decides that probability is close The quality of estimation is spent, we go to select optimal h according to maximum likelihood method, to different h, 1 point are taken every time, with n-1 point Estimation joint probability density is gone, the probability density value of 1 point is sought, finally obtains n probability density value.Likelihood value is n probability Product the f ((x of density value₁,y₁),(x₂,y₂),...,(x_n,y_n) | h)=f ((x₁,y₁)|h)×f((x₂,y₂)|h)×…×f ((x_n,y_n)|h).Make the maximum h of likelihood value best h, best h=0.018.

Finally, probability density distribution is as shown in Fig. 2, fixed x, is acquired in the case where the x, the probability density distribution of y, I Use P-value < 0.05 to carry out hypothesis testing as threshold value, obtaining mutation front and back makes succinylation significantly increase and weaken KsuMs.We are arranged the posterior probability after up-regulation and are greater than 0.5, to guarantee that succinylation occurs for the site after being mutated, before downward Posterior probability be greater than 0.5 be used as threshold value, with guarantee mutation before for the site occur succinylation.Finally obtaining 306 makes amber Acylated KsuMs and 64 KsuMs for significantly increasing succinylation being obviously reduced of amber, is present on 218 genes.

As shown in Fig. 2, 218 genes are respectively mapped to 719 cancers in database Cancer Gene Census (CGC) On 2921 drug target gene data sets of disease gene and medicine target database D rugBank, found by hypergeometry analysis in 2 numbers According to equal significant enrichment is concentrated, enrichment degree is respectively 2.62 times (P-value=3.03E-04) and 4.15 times of (P-value= 1.20E-44), it implies that the degree of correlation of the 218 succinylation gene and cancer is higher, also illustrates the reliable journey of our results It spends higher.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims

1. a kind of data difference analysis method based on Multilayer networks, which is characterized in that contain following steps:

It (1) is n group by the group number scale of data intensive data, the n is positive integer；Containing the number before changing in any one group of data Corresponding numerical value after value and variation, the value before note variation are x, and the value after variation is y, using the data before changing as abscissa, with Data after variation are that ordinate establishes coordinate system U, and the corresponding coordinate points of any one group of data are (x_i, y_i), the i's Value range is 1≤i≤n；

(2) estimation variation front and back data aggregate probability density distribution is removed using the Multilayer networks method based on Gaussian kernel, used Formula are as follows:Wherein h is window width, and n is data set The group number of middle data, f (x, y) are the probability density value in coordinate system U at any point (x, y)；It goes to select according to maximum likelihood method Optimal h is selected, method particularly includes: firstly, taking data set corresponding appointing in the coordinate points in coordinate system U every time to different h It anticipates a point, goes building joint probability distribution with remaining n-1 point, then calculate any one described point in the joint probability Joint probability density value in distribution, obtains n joint probability density value, and the product of the n joint probability density value is likelihood Value, makes the best h of the maximum h of likelihood value；The best h is substituted into the formula, the data set is then recycled to correspond to All coordinate points in coordinate system U go to construct best joint probability distribution；

(3) fixed to change preceding size of data x, the data y acquired in the case where fixation x, after variation is described most in step (2) Probability density distribution in good joint probability distribution；Firstly, in the case where fixed x, it is solid with this using the distribution of y as X ' axis Determining probability density of the x under the best h condition is that Y ' axis establishes coordinate system U '；Then, for any one group in data set Data (x_i, y_i), it acquires in the x_iIn the case where, the probability density distribution of the size of data y after variation, according to y_iIn the seat This group of data (x is acquired in position on the X ' axis of mark system U '_i, y_i) variation tendency and variation degree, method particularly includes: described It is taken on the X ' axis of coordinate system U ' a bit, makees the straight line of the X ' axis perpendicular to coordinate system U ' by the point, the straight line is by density curve Left and right two parts are divided into the area that X-axis is surrounded, remember that the point is y₀If y_iGreater than y₀, then data point (x_i, y_i) change Change is up-regulation, and the significance degree P of up-regulation is y > y_iWhen distribution in area ratio upper density curve and the area that is surrounded of X ' axis, If y_iLess than y₀, then data point (x_i, y_i) variation be to lower, the significance degree P of downward is y < y_iWhen distribution in area Than upper density curve and X ' area that is surrounded of axis, if y_iEqual to y₀, then data point (x_i, y_i) there is no variations.

2. as described in claim 1 based on the data difference analysis method of Multilayer networks, which is characterized in that step (1) Before any one group of data mutate in the data set at least one amino acid around amino acid sites and after mutation, it is somebody's turn to do The probability value that amino acid sites are modified.

3. as claimed in claim 2 based on the data difference analysis method of Multilayer networks, which is characterized in that step (1) Any one group of data are that at least one amino acid generation missense is prominent in each N number of amino acid in lysine sites front and back in the data set Before becoming and after missense mutation, which occurs the probability value of succinylation；The N is integer, and the value range of N is 0 N≤50 <.

4. a kind of data difference analysis method based on Multilayer networks as claimed in claim 3, which is characterized in that described The value range of N is 5≤N≤15.

5. as described in claim 1 based on the data difference analysis method of Multilayer networks, which is characterized in that step (1) The data set is before drug-treated cell and after processing cell, which generates RNA or expresses the data of protein level.

6. as described in claim 1 based on the data difference analysis method of Multilayer networks, which is characterized in that the n is big In equal to 1000.

7. a kind of data difference analysis system based on Multilayer networks characterized by comprising

Data set establishes module: the data set establishes module for establishing the data set of difference to be analyzed；By number in data set According to group number scale be n group, the n be positive integer；In any one group of data containing before changing numerical value and variation after corresponding number Value, the value before note variation are x, and the value after variation is y, are vertical sit with the data after changing using the data before changing as abscissa Mark establishes coordinate system U, and the corresponding coordinate points of any one group of data are (x_i, y_i), the value range of the i is 1≤i≤n；

Best window width computing module: the best window width computing module is obtained for calculating best window width h Best joint probability distribution；Estimation variation front and back data aggregate probability density is gone using the Multilayer networks method based on Gaussian kernel Distribution, the formula of utilization are as follows:Wherein h is window width, n For the group number of data intensive data, f (x, y) is the probability density value in coordinate system U at any point (x, y)；Seemingly according to maximum Right method goes to select optimal h, method particularly includes: firstly, taking the corresponding coordinate in coordinate system U of data set every time to different h Any one point in point goes building joint probability distribution with remaining n-1 point, then calculates any one described point at this Joint probability density value in joint probability distribution, obtains n joint probability density value, the n joint probability density value it Product is likelihood value, makes the best h of the maximum h of likelihood value；The best h is substituted into the formula, then recycles the data The corresponding all coordinate points in coordinate system U of collection go to construct best joint probability distribution；

Data difference analysis module in data set: data difference analysis module is for analyzing data intensive data in the data set Change the difference of front and back；It is fixed to change preceding size of data x, data y in the case where fixation x, after variation is acquired in step (2) probability density distribution in the best joint probability distribution；Firstly, in the case where fixed x, using the distribution of y as X ' Axis is that Y ' axis establishes coordinate system U ' with probability density of the fixation x under the best h condition；Then, in data set Any one group of data (x_i, y_i), it acquires in the x_iIn the case where, the probability density distribution of the size of data y after variation, according to y_i This group of data (x is acquired in position on the X ' axis of the coordinate system U '_i, y_i) variation tendency and variation degree, specific method Are as follows: it is taken on the X ' axis of the coordinate system U ' a bit, makees the straight line of the X ' axis perpendicular to coordinate system U ' by the point, which will The area that density curve and X-axis are surrounded is divided into left and right two parts, remembers that the point is y₀If y_iGreater than y₀, then data point (x_i, y_i) variation be up-regulation, the significance degree P of up-regulation is y > y_iWhen distribution in area ratio upper density curve and X ' axis institute The area surrounded, if y_iLess than y₀, then data point (x_i, y_i) variation be to lower, the significance degree P of downward is y < y_iTime-division The area that area ratio upper density curve and X ' axis in cloth are surrounded, if y_iEqual to y₀, then data point (x_i, y_i) there is no Variation.