CN110245157A - A kind of data difference analysis method and system based on Multilayer networks - Google Patents

A kind of data difference analysis method and system based on Multilayer networks Download PDF

Info

Publication number
CN110245157A
CN110245157A CN201910471042.6A CN201910471042A CN110245157A CN 110245157 A CN110245157 A CN 110245157A CN 201910471042 A CN201910471042 A CN 201910471042A CN 110245157 A CN110245157 A CN 110245157A
Authority
CN
China
Prior art keywords
data
value
variation
distribution
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910471042.6A
Other languages
Chinese (zh)
Other versions
CN110245157B (en
Inventor
薛宁
宁万山
许浩东
邓万锟
郭亚萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201910471042.6A priority Critical patent/CN110245157B/en
Publication of CN110245157A publication Critical patent/CN110245157A/en
Application granted granted Critical
Publication of CN110245157B publication Critical patent/CN110245157B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Biochemistry (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Library & Information Science (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a kind of data difference analysis method and system based on Multilayer networks, belongs to data analysis field.This method is first to establish data set, and the data in data set are changed;Then estimation variation front and back data aggregate probability is removed using Multilayer networks method;It is gone to select optimal window width according to maximum likelihood method, to different window widths, any one point fetched every time according to concentration, building joint probability distribution is gone with point remaining in data set, calculating this, any one puts the joint probability density value in the joint probability distribution, the product of obtained multiple a joint probability density values is likelihood value, makes the maximum window width best window width of likelihood value;According to the best window width, variation front and back data aggregate probability density distribution is obtained by Multilayer networks method, and analyze the difference of data.This method can not be limited the significance degree for acquiring each data by data distribution, for finding the data of significant changes.

Description

A kind of data difference analysis method and system based on Multilayer networks
Technical field
The present invention relates to data analysis fields, more particularly, to a kind of data difference based on Multilayer networks point Analyse method and system.
Background technique
The data of significant changes often have key.Such as by proteomic image technology, our available each eggs White matter may play crucial regulation in this process and make in the expression quantity of experimental group and control group, the protein for expressing significant difference With.People often look for differential protein according to fold differences, it is believed that the bigger protein difference of variation multiple is more significant.So And in most cases, this hypothesis is untenable, for example 1 becomes 2 and 10 to become 20 being all 2 times of variation, but does not represent it The significance of difference is identical.In another example influencing the amino acid mutation of protein modification state, mutation front and back makes protein modification state The mutation of significant changes is often more important, and Omar et al. develops a kind of method (MIMP) for predicting to be mutated to phosphorylation. However, the calculation formula of the joint probability in MIMP is invalid for independent two-dimentional variable.And its method cannot be counted The statistical significance for calculating influence of the mutation to phosphorylation is horizontal.Currently, facing problems, people do not have very good solution side Method, thus develop new method solve the problems, such as it is all so on it is very crucial.The present invention has developed a kind of based on Multilayer networks Data difference analysis method, this method has statistical significance and no matter what distribution is data be, this method is applicable.
Summary of the invention
The present invention solves data difference analysis method in the prior art and is not only limited by data distribution, but also lacks system Meter learns the technical issues of meaning.The present invention acquires variation front and back data aggregate probability density point according to Multilayer networks method Then cloth judges the conspicuousness of data variation according to hypothesis testing.This method can not be limited by data distribution acquire it is each The significance degree of data, for finding the data of significant changes.
According to the first aspect of the invention, a kind of data difference analysis method based on Multilayer networks is provided, is contained There are following steps:
It (1) is n group by the group number scale of data intensive data, the n is positive integer;Containing before changing in any one group of data Numerical value and variation after corresponding numerical value, the value before note variation is x, and value after variation is y, with the data before changing for horizontal seat Mark is that ordinate establishes coordinate system U using the data after changing, and the corresponding coordinate points of any one group of data are (xi, yi), institute The value range for stating i is 1≤i≤n;
(2) estimation variation front and back data aggregate probability density distribution is removed using the Multilayer networks method based on Gaussian kernel, The formula of utilization are as follows:Wherein h is window width, and n is number According to the group number of intensive data, f (x, y) is the probability density value in coordinate system U at any point (x, y);According to maximum likelihood method It goes to select optimal h, method particularly includes: firstly, taking data set corresponding in the coordinate points in coordinate system U every time different h Any one point, go building joint probability distribution with remaining n-1 point, then calculate any one described point in the joint Joint probability density value in probability distribution, obtains n joint probability density value, and the product of the n joint probability density value is Likelihood value makes the best h of the maximum h of likelihood value;The best h is substituted into the formula, then recycles the data set pair It should go to construct best joint probability distribution in all coordinate points in coordinate system U;
(3) fixed to change preceding size of data x, data y in the case where fixation x, after variation is acquired in step (2) institute State the probability density distribution in best joint probability distribution;Firstly, in the case where fixed x, using the distribution of y as X ' axis, with Probability density of the fixation x under the best h condition is that Y ' axis establishes coordinate system U ';Then, for any in data set One group of data (xi, yi), it acquires in the xiIn the case where, the probability density distribution of the size of data y after variation, according to yiInstitute The position on the X ' axis of coordinate system U ' is stated, this group of data (x is acquiredi, yi) variation tendency and variation degree, method particularly includes: It is taken on the X ' axis of the coordinate system U ' a bit, makees the straight line of the X ' axis perpendicular to coordinate system U ' by the point, the straight line is by density The area that curve and X-axis are surrounded is divided into left and right two parts, remembers that the point is y0If yiGreater than y0, then data point (xi, yi) Variation be up-regulation, the significance degree P of up-regulation is y > yiWhen distribution in area ratio upper density curve and X ' axis surrounded Area, if yiLess than y0, then data point (xi, yi) variation be to lower, the significance degree P of downward is y < yiWhen distribution in The area that area ratio upper density curve and X ' axis are surrounded, if yiEqual to y0, then data point (xi, yi) there is no variations.
Preferably, any one group of data are at least one amino around amino acid sites in step (1) described data set Acid mutates after preceding and mutation, the probability value which modifies.
Preferably, in step (1) described data set any one group of data be before and after lysine sites in each N number of amino acid extremely Before few amino acid generation missense mutation and after missense mutation, which occurs the probability value of succinylation;The N For integer, the value range of N is 0 N≤50 <.
Preferably, the value range of the N is 5≤N≤15.
Preferably, step (1) data set be drug-treated cell before and processing cell after, the cell generate RNA or Express the data of protein level.
Preferably, the n is more than or equal to 1000.
According to another aspect of the present invention, a kind of data difference analysis system based on Multilayer networks is provided, is wrapped It includes:
Data set establishes module: the data set establishes module for establishing the data set of difference to be analyzed;By data set The group number scale of middle data is n group, and the n is positive integer;It is corresponded to after containing numerical value and variation before changing in any one group of data Numerical value, the value before note variation is x, and value after variation is y, using the data before changing as abscissa, is with the data after changing Ordinate establishes coordinate system U, and the corresponding coordinate points of any one group of data are (xi, yi), the value range of the i is 1≤i ≤n;
Best window width computing module: the best window width computing module is used to calculate best window width h, and Obtain best joint probability distribution;Estimation variation front and back data aggregate probability is removed using the Multilayer networks method based on Gaussian kernel Density Distribution, the formula of utilization are as follows:Wherein h is that window is wide Degree, n are the group number of data intensive data, and f (x, y) is the probability density value in coordinate system U at any point (x, y);According to most Maximum-likelihood method goes to select optimal h, method particularly includes: firstly, taking data set corresponding in coordinate system U every time different h Any one point in coordinate points goes building joint probability distribution with remaining n-1 point, then calculates any one described point Joint probability density value in the joint probability distribution obtains n joint probability density value, the n joint probability density The product of value is likelihood value, makes the best h of the maximum h of likelihood value;The best h is substituted into the formula, then described in recycling The corresponding all coordinate points in coordinate system U of data set go to construct best joint probability distribution;
Data difference analysis module in data set: data difference analysis module is for analyzing in data set in the data set Difference before and after data variation;It is fixed to change preceding size of data x, it acquires data y in the case where fixation x, after variation and exists Probability density distribution in step (2) the best joint probability distribution;Firstly, being made in the case where fixed x with the distribution of y It is that Y ' axis establishes coordinate system U ' with probability density of the fixation x under the best h condition for X ' axis;Then, for data set In any one group of data (xi, yi), it acquires in the xiIn the case where, the probability density distribution of the size of data y after variation, root According to yiThis group of data (x is acquired in position on the X ' axis of the coordinate system U 'i, yi) variation tendency and variation degree, specifically Method are as follows: taken on the X ' axis of the coordinate system U ' a bit, make the straight line of the X ' axis perpendicular to coordinate system U ' by the point, this is straight The area that density curve and X-axis are surrounded is divided into left and right two parts by line, remembers that the point is y0If yiGreater than y0, then data Point (xi, yi) variation be up-regulation, the significance degree P of up-regulation is y > yiWhen distribution in area ratio upper density curve and X ' axis The area surrounded, if yiLess than y0, then data point (xi, yi) variation be to lower, the significance degree P of downward is y < yiWhen The area that area ratio upper density curve and X ' axis in distribution are surrounded, if yiEqual to y0, then data point (xi, yi) do not send out Changing.
In general, through the invention it is contemplated above technical scheme is compared with the prior art, mainly have below Technological merit:
(1) the invention discloses a kind of data difference analysis method based on Multilayer networks, this method has statistics It learns meaning and no matter what distribution is data be, this method is applicable, limits without condition, facilitates people from the change of divergence Data in find crucial things.
(2) the present invention is implemented as follows: 1 becomes 2 and 10 to become 20 being all 2 times of variation, but it is aobvious not represent its difference Work property is identical.However, 1 become 3 become compared to 12 difference it is more significant.We are based on the principle, by every before estimation variation The probability density distribution of data assesses the conspicuousness of mutation front and back difference after the corresponding mutation of a data.
(3) the h value in the joint probability density distribution formula in the present invention influences the estimation of data aggregate probability density distribution Quality, in order to obtain the best estimate of joint probability density distribution, the present invention goes to select optimal h with maximum likelihood method, To different h (0 < h < 1), access goes building to combine according to any one point of concentration with n-1 point remaining in data set every time Probability distribution calculates any one the described joint probability density value of point in the joint probability distribution, obtains n joint probability Density value;Likelihood value is the product of n joint probability density value, makes the likelihood value best h of maximum h, because the probability under the h is close Degree distribution most probable meets actual distribution.
(4) size of data x before each variation is fixed in the present invention, acquired in the case where the x, size of data y after variation Probability density distribution;Hypothesis testing is carried out using the distribution, it is generally accepted that the data of P-value < 0.05 are significant changes Data increase compared to numerical value before changing, it is believed that are up-regulations;Conversely, being then to lower.
Detailed description of the invention
The flow chart of method in Fig. 1 present invention.
Fig. 2 is enrichment condition of 218 genes comprising KsuMs in cancer gene and drug target gene data set.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below Not constituting a conflict with each other can be combined with each other.
Embodiment 1
Inventive method is used for the mutation for predicting to significantly affect existing succinylation site by we.It is logical that this facilitates discovery The gene for changing succinylation network influence cancer is crossed, and provides and disease biological and treatment development is understood in depth.Prominent Become in the impact analysis of succinylation, we are from cancer gene database The Cancer Genome Atlas (TCGA) Incorporate 1,779,214 missense mutation in 33 kinds of major cancers type/hypotypes, 11,659 tumor samples.Wherein have 63693 missense mutations (KsuMs) occurred in lysine sites periphery (each 10 amino acid in left and right).As shown in Figure 1, we Probability point is acquired with succinylation site estimation platform to 63693 peptide fragments comprising KsuMs, probability point reflects the site amber Amber is acylated degree.Then, the Bayes posterior probability of estimation mutation front and back is removed using the Parzen window method based on Gaussian kernel Joint probability density:
Wherein h is window width, and n is the quantity of KsuMs, here, n=63693.The selection of h decides that probability is close The quality of estimation is spent, we go to select optimal h according to maximum likelihood method, to different h, 1 point are taken every time, with n-1 point Estimation joint probability density is gone, the probability density value of 1 point is sought, finally obtains n probability density value.Likelihood value is n probability Product the f ((x of density value1,y1),(x2,y2),...,(xn,yn) | h)=f ((x1,y1)|h)×f((x2,y2)|h)×…×f ((xn,yn)|h).Make the maximum h of likelihood value best h, best h=0.018.
Finally, probability density distribution is as shown in Fig. 2, fixed x, is acquired in the case where the x, the probability density distribution of y, I Use P-value < 0.05 to carry out hypothesis testing as threshold value, obtaining mutation front and back makes succinylation significantly increase and weaken KsuMs.We are arranged the posterior probability after up-regulation and are greater than 0.5, to guarantee that succinylation occurs for the site after being mutated, before downward Posterior probability be greater than 0.5 be used as threshold value, with guarantee mutation before for the site occur succinylation.Finally obtaining 306 makes amber Acylated KsuMs and 64 KsuMs for significantly increasing succinylation being obviously reduced of amber, is present on 218 genes.
As shown in Fig. 2, 218 genes are respectively mapped to 719 cancers in database Cancer Gene Census (CGC) On 2921 drug target gene data sets of disease gene and medicine target database D rugBank, found by hypergeometry analysis in 2 numbers According to equal significant enrichment is concentrated, enrichment degree is respectively 2.62 times (P-value=3.03E-04) and 4.15 times of (P-value= 1.20E-44), it implies that the degree of correlation of the 218 succinylation gene and cancer is higher, also illustrates the reliable journey of our results It spends higher.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should all include Within protection scope of the present invention.

Claims (7)

1. a kind of data difference analysis method based on Multilayer networks, which is characterized in that contain following steps:
It (1) is n group by the group number scale of data intensive data, the n is positive integer;Containing the number before changing in any one group of data Corresponding numerical value after value and variation, the value before note variation are x, and the value after variation is y, using the data before changing as abscissa, with Data after variation are that ordinate establishes coordinate system U, and the corresponding coordinate points of any one group of data are (xi, yi), the i's Value range is 1≤i≤n;
(2) estimation variation front and back data aggregate probability density distribution is removed using the Multilayer networks method based on Gaussian kernel, used Formula are as follows:Wherein h is window width, and n is data set The group number of middle data, f (x, y) are the probability density value in coordinate system U at any point (x, y);It goes to select according to maximum likelihood method Optimal h is selected, method particularly includes: firstly, taking data set corresponding appointing in the coordinate points in coordinate system U every time to different h It anticipates a point, goes building joint probability distribution with remaining n-1 point, then calculate any one described point in the joint probability Joint probability density value in distribution, obtains n joint probability density value, and the product of the n joint probability density value is likelihood Value, makes the best h of the maximum h of likelihood value;The best h is substituted into the formula, the data set is then recycled to correspond to All coordinate points in coordinate system U go to construct best joint probability distribution;
(3) fixed to change preceding size of data x, the data y acquired in the case where fixation x, after variation is described most in step (2) Probability density distribution in good joint probability distribution;Firstly, in the case where fixed x, it is solid with this using the distribution of y as X ' axis Determining probability density of the x under the best h condition is that Y ' axis establishes coordinate system U ';Then, for any one group in data set Data (xi, yi), it acquires in the xiIn the case where, the probability density distribution of the size of data y after variation, according to yiIn the seat This group of data (x is acquired in position on the X ' axis of mark system U 'i, yi) variation tendency and variation degree, method particularly includes: described It is taken on the X ' axis of coordinate system U ' a bit, makees the straight line of the X ' axis perpendicular to coordinate system U ' by the point, the straight line is by density curve Left and right two parts are divided into the area that X-axis is surrounded, remember that the point is y0If yiGreater than y0, then data point (xi, yi) change Change is up-regulation, and the significance degree P of up-regulation is y > yiWhen distribution in area ratio upper density curve and the area that is surrounded of X ' axis, If yiLess than y0, then data point (xi, yi) variation be to lower, the significance degree P of downward is y < yiWhen distribution in area Than upper density curve and X ' area that is surrounded of axis, if yiEqual to y0, then data point (xi, yi) there is no variations.
2. as described in claim 1 based on the data difference analysis method of Multilayer networks, which is characterized in that step (1) Before any one group of data mutate in the data set at least one amino acid around amino acid sites and after mutation, it is somebody's turn to do The probability value that amino acid sites are modified.
3. as claimed in claim 2 based on the data difference analysis method of Multilayer networks, which is characterized in that step (1) Any one group of data are that at least one amino acid generation missense is prominent in each N number of amino acid in lysine sites front and back in the data set Before becoming and after missense mutation, which occurs the probability value of succinylation;The N is integer, and the value range of N is 0 N≤50 <.
4. a kind of data difference analysis method based on Multilayer networks as claimed in claim 3, which is characterized in that described The value range of N is 5≤N≤15.
5. as described in claim 1 based on the data difference analysis method of Multilayer networks, which is characterized in that step (1) The data set is before drug-treated cell and after processing cell, which generates RNA or expresses the data of protein level.
6. as described in claim 1 based on the data difference analysis method of Multilayer networks, which is characterized in that the n is big In equal to 1000.
7. a kind of data difference analysis system based on Multilayer networks characterized by comprising
Data set establishes module: the data set establishes module for establishing the data set of difference to be analyzed;By number in data set According to group number scale be n group, the n be positive integer;In any one group of data containing before changing numerical value and variation after corresponding number Value, the value before note variation are x, and the value after variation is y, are vertical sit with the data after changing using the data before changing as abscissa Mark establishes coordinate system U, and the corresponding coordinate points of any one group of data are (xi, yi), the value range of the i is 1≤i≤n;
Best window width computing module: the best window width computing module is obtained for calculating best window width h Best joint probability distribution;Estimation variation front and back data aggregate probability density is gone using the Multilayer networks method based on Gaussian kernel Distribution, the formula of utilization are as follows:Wherein h is window width, n For the group number of data intensive data, f (x, y) is the probability density value in coordinate system U at any point (x, y);Seemingly according to maximum Right method goes to select optimal h, method particularly includes: firstly, taking the corresponding coordinate in coordinate system U of data set every time to different h Any one point in point goes building joint probability distribution with remaining n-1 point, then calculates any one described point at this Joint probability density value in joint probability distribution, obtains n joint probability density value, the n joint probability density value it Product is likelihood value, makes the best h of the maximum h of likelihood value;The best h is substituted into the formula, then recycles the data The corresponding all coordinate points in coordinate system U of collection go to construct best joint probability distribution;
Data difference analysis module in data set: data difference analysis module is for analyzing data intensive data in the data set Change the difference of front and back;It is fixed to change preceding size of data x, data y in the case where fixation x, after variation is acquired in step (2) probability density distribution in the best joint probability distribution;Firstly, in the case where fixed x, using the distribution of y as X ' Axis is that Y ' axis establishes coordinate system U ' with probability density of the fixation x under the best h condition;Then, in data set Any one group of data (xi, yi), it acquires in the xiIn the case where, the probability density distribution of the size of data y after variation, according to yi This group of data (x is acquired in position on the X ' axis of the coordinate system U 'i, yi) variation tendency and variation degree, specific method Are as follows: it is taken on the X ' axis of the coordinate system U ' a bit, makees the straight line of the X ' axis perpendicular to coordinate system U ' by the point, which will The area that density curve and X-axis are surrounded is divided into left and right two parts, remembers that the point is y0If yiGreater than y0, then data point (xi, yi) variation be up-regulation, the significance degree P of up-regulation is y > yiWhen distribution in area ratio upper density curve and X ' axis institute The area surrounded, if yiLess than y0, then data point (xi, yi) variation be to lower, the significance degree P of downward is y < yiTime-division The area that area ratio upper density curve and X ' axis in cloth are surrounded, if yiEqual to y0, then data point (xi, yi) there is no Variation.
CN201910471042.6A 2019-05-31 2019-05-31 Data difference analysis method and system based on probability density estimation Active CN110245157B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910471042.6A CN110245157B (en) 2019-05-31 2019-05-31 Data difference analysis method and system based on probability density estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910471042.6A CN110245157B (en) 2019-05-31 2019-05-31 Data difference analysis method and system based on probability density estimation

Publications (2)

Publication Number Publication Date
CN110245157A true CN110245157A (en) 2019-09-17
CN110245157B CN110245157B (en) 2021-06-11

Family

ID=67885806

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910471042.6A Active CN110245157B (en) 2019-05-31 2019-05-31 Data difference analysis method and system based on probability density estimation

Country Status (1)

Country Link
CN (1) CN110245157B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207997A (en) * 2013-04-15 2013-07-17 浙江捷尚视觉科技有限公司 Kernel density estimation-based license plate character segmentation method
CN103776891A (en) * 2013-09-04 2014-05-07 中国科学院计算技术研究所 Method for detecting differentially-expressed protein
CN106533577A (en) * 2016-10-09 2017-03-22 南京工业大学 Non-Gaussian noise suppression method based on energy detection
US20170364664A1 (en) * 2014-02-25 2017-12-21 Flagship Biosciences, Inc. Method for stratifying and selecting candidates for receiving a specific therapeutic approach
CN108763872A (en) * 2018-04-25 2018-11-06 华中科技大学 A method of analysis prediction cancer mutation influences LIR die body functions
CN109815870A (en) * 2019-01-17 2019-05-28 华中科技大学 The high-throughput functional gene screening technique and system of cell phenotype image quantitative analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207997A (en) * 2013-04-15 2013-07-17 浙江捷尚视觉科技有限公司 Kernel density estimation-based license plate character segmentation method
CN103776891A (en) * 2013-09-04 2014-05-07 中国科学院计算技术研究所 Method for detecting differentially-expressed protein
US20170364664A1 (en) * 2014-02-25 2017-12-21 Flagship Biosciences, Inc. Method for stratifying and selecting candidates for receiving a specific therapeutic approach
CN106533577A (en) * 2016-10-09 2017-03-22 南京工业大学 Non-Gaussian noise suppression method based on energy detection
CN108763872A (en) * 2018-04-25 2018-11-06 华中科技大学 A method of analysis prediction cancer mutation influences LIR die body functions
CN109815870A (en) * 2019-01-17 2019-05-28 华中科技大学 The high-throughput functional gene screening technique and system of cell phenotype image quantitative analysis

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BABICH等: "Weighted parzen windows for pattern classification", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS & MACHINE INTELLIGENCE》 *
TAHERZADEH等: "predicting lysine-malonylation sites of proteins using sequence and predicted structural features", 《JOURNAL OF COMPUTATIONAL CHEMISTRY》 *
徐阳等: "WERAM:关于真核生物中组蛋白乙酰化和甲基", 《中国生物工程学会第二届青年科技论坛》 *

Also Published As

Publication number Publication date
CN110245157B (en) 2021-06-11

Similar Documents

Publication Publication Date Title
Miller et al. Characterizing spatial gene expression heterogeneity in spatially resolved single-cell transcriptomic data with nonuniform cellular densities
Chen et al. Convex clustering: An attractive alternative to hierarchical clustering
CN109637579B (en) Tensor random walk-based key protein identification method
Huff et al. Detecting positive selection from genome scans of linkage disequilibrium
Suresh et al. Recurrent neural network for genome sequencing for personalized cancer treatment in precision healthcare
Wang et al. Variational inference for coupled hidden markov models Applied to the Joint Detection of Copy Number Variations
Tran et al. A novel method for single-cell data imputation using subspace regression
Song et al. MiXcan: a framework for cell-type-aware transcriptome-wide association studies with an application to breast cancer
Huo et al. Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals
CN110245157A (en) A kind of data difference analysis method and system based on Multilayer networks
Li et al. Evolving spatial clusters of genomic regions from high-throughput chromatin conformation capture data
Nouira et al. Multitask group Lasso for Genome Wide association Studies in diverse populations
Cong et al. Big data driven oriented graph theory aided tagsnps selection for genetic precision therapy
Bhattacharya et al. Effects of gene–environment and gene–gene interactions in case-control studies: A novel Bayesian semiparametric approach
CN111785319A (en) Drug relocation method based on differential expression data
Li et al. A comparative study for identifying the chromosome-wide spatial clusters from high-throughput chromatin conformation capture data
Xia Sequence-based multiscale modeling for high-throughput chromosome conformation capture (Hi-C) data analysis
Joo Bayesian lasso: An extension for genome-wide association study
Boitard et al. Linkage disequilibrium interval mapping of quantitative trait loci
Liu et al. Inferring single-cell copy number profiles through cross-cell segmentation of read counts
Xing et al. High-dimensional sparse structured input-output models, with applications to gwas
He STATISTICAL METHODS TO STUDY TRANSPOSON SEQUENCING DATA: NONPARAMETRIC BAYESIAN MODELS WITH SAMPLING ALGORITHMS
Milite et al. Genotyping Copy Number Alterations from single-cell RNA sequencing
Kang et al. Haplotype assembly from weighted SNP fragments and related genotype information
Ingraham Probabilistic Models of Structure in Biological Sequences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Xue Yu

Inventor after: Ning Wanshan

Inventor after: Xu Haodong

Inventor after: Deng Wangun

Inventor after: Guo Yaping

Inventor before: Xue Ning

Inventor before: Ning Wanshan

Inventor before: Xu Haodong

Inventor before: Deng Wangun

Inventor before: Guo Yaping

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant