CN105938524A - Microorganism association network prediction method and apparatus - Google Patents

Microorganism association network prediction method and apparatus Download PDF

Info

Publication number
CN105938524A
CN105938524A CN201610266864.7A CN201610266864A CN105938524A CN 105938524 A CN105938524 A CN 105938524A CN 201610266864 A CN201610266864 A CN 201610266864A CN 105938524 A CN105938524 A CN 105938524A
Authority
CN
China
Prior art keywords
microorganism
association
environmental factors
measured value
bayesian model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610266864.7A
Other languages
Chinese (zh)
Inventor
陈挺
陈宁
杨煜清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201610266864.7A priority Critical patent/CN105938524A/en
Publication of CN105938524A publication Critical patent/CN105938524A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention provides a microorganism association network prediction method and apparatus. The method comprises obtaining abundances of various microorganisms in metagenome sequencing samples and an environmental factor measurement value corresponding to each metagenome sequencing sample; establishing a hierarchical Bayesian model according to a data generation process of each metagenome sequencing sample and the abundances of the various microorganisms and the corresponding environmental factor measurement values in each metagenome sequencing sample; learning the hierarchical Bayesian model by using a maximum posteriori estimation algorithm, and determining an objective function of the hierarchical Bayesian model; optimizing the objective function; and predicting association between the microorganisms and association between the microorganisms and the environmental factor measurement values by using the optimized hierarchical Bayesian model. Accuracy and practicability of prediction tasks of microorganism association and microorganism and environmental factor association are improved.

Description

A kind of microorganism related network Forecasting Methodology and device
Technical field
The present invention relates to technical field of biological information, particularly relate to a kind of microorganism related network pre- Survey method and device.
Background technology
Mutual and environmental factors between microorganism is microbiological on the impact of microorganism Crucial research topic.Owing to most microorganism all cannot be in laboratory cultures, this gives and passes The microorganism culturing that utilizes of system brings the biggest difficulty to study interactive method.Grand gene These problems of research that develop into of group sequencing technologies bring probability, can be by gathering micro-life The sample (extracting 16s rDNA) of thing, then carries out high-flux sequence, by life micro-in sample The number change of thing speculates the association between microorganism, and the shadow that environmental factors is to microorganism Ring.Along with the extensive application of grand genomic sequencing technique, the sample of various environment is made available by, Such as soil, ocean and lake, the intestinal of the most a lot of people.But by microorganism in the sample Number change infer reciprocal action, remain a very difficult problem.Common association pushes away Disconnected, refer to calculate between microorganism, the positive negative correlation between microorganism and environment, then Go again to speculate real reciprocal action according to these dependencys, as Mutualism, parasitism and competition, Thus help to understand the dynamic of microbiologic population.
Association is inferred and can be measured by different statistics, as long as these statistics can show Reasonably relation.The method that existing association is inferred can be divided into two big classes.One is calculating two Dependency between two, such as Pearson's correlation coefficient and Spearman's correlation coefficient, Ke Yiji Calculate the dependency between two species.Ruanet.al. the Local similarity that (2006) year proposes What analysis (LSA) calculated is also dependency two-by-two, however it is necessary that time serial message.Second Class is the method calculating complicated dependency, by matching one microorganism, and residue microorganism with And the relation between environmental factors calculates, it is based on the method returned mostly.The first meter Calculate the method for dependency between any two because its simple and quick feature is extensively adopted by biologist With.But this method is not appropriate for grand gene order-checking data, main cause has 2 points.The One, the method calculating dependency two-by-two does not obtain real dependency, because the process of calculating In there is constituent deviation, and these methods have ignored the feature of data, uses applicable In the method without bound data.Specifically, because total reads of each sample of obtaining of order-checking Number difference, typically can be normalized, obtain the relative abundance of OTU.After normalization, Data are the most independent, such as variable xiWith remaining variable xjNo longer independent, no matter they it Between have onrelevant,
Σ i x i = 1 → Σ j ≠ i cov ( x i , x j ) = - V a r ( x i )
It is to become tighter that this constituent deviation exists leading microorganism in the environment Weight, generally exists in its tangible Marine microorganism of this phenomenon.Accordingly, it is considered to constituent is inclined The algorithm of difference is needs.On the other hand, high-flux sequence along with series of processing steps, As sample filter, amplification and upper machine order-checking etc., these steps all can cause the reads number that obtains and There is deviation in micro organism quantity original in sample.This feature is also required to algorithm and considers simultaneously.
At present, occur in that a lot of algorithm to solve the problem of constituent deviation, as CCREPE (Faust et.al., 2012), SparCC (Friedman and Alm, 2012), SPIEC-EASI (Kurtz et.al., 2015) and CCLasso (Fang et.al., 2015).These are calculated Method processes constituent deviation by different thinkings, but does not all consider sequencing data itself Variance;Further, since microbes is effected by environmental factors, these methods are micro-in estimation During association between biology, do not consider the regulation and control of environmental factors.Such as, if two OTU Relevant if because being regulated by same environmental factors, then the association between them It is indirectly in fact, if environmental factors, cannot distinguish.
Summary of the invention
In view of the above problems, the present invention proposes a kind of microorganism related network Forecasting Methodology and dress Put, to solve in prior art because the variance of constituent deviation and sequencing data self is led Inaccurate problem is inferred in the association caused.
According to the first aspect of the invention, it is provided that a kind of microorganism related network Forecasting Methodology, The method includes:
Obtain the abundance of various microorganisms in grand gene order-checking sample, and each grand genome The environmental factors measured value that order-checking sample is corresponding;
Data generating procedure according to each grand gene order-checking sample, each grand gene order-checking The abundance of various microorganisms and the environmental factors measured value of correspondence in sample, set up layering pattra leaves This model, described layering Bayesian model is for the pass between data generating procedure, microorganism Associating between connection and microorganism with environmental factors measured value is described;
Use MAP estimation algorithm that described layering Bayesian model is learnt, determine institute State the object function of layering Bayesian model;
Described object function is optimized;
Use the association between the layering Bayesian model predictive microbiology after objective function optimization And associating between microorganism with environmental factors measured value.
Wherein, the described data generating procedure according to each grand gene order-checking sample, Mei Yihong In gene order-checking sample, the abundance of various microorganisms and the environmental factors measured value of correspondence, build Vertical layering Bayesian model, including:
Use Dirichlet-Multinomial to be conjugated distribution simulation sequencing procedure, data were produced Constituent deviation and data variance in journey carry out the first modeling;
Use the Plantago fengdouensis process of Lognormal distribution simulation microorganism, between microorganism Association and microorganism with environmental factors measured value between associate and carry out the second modeling;
According to described first modeling and second modeling result, set up based on The layering Bayesian model of Lognormal-Dirichlet-Multinomial.
Wherein, described employing MAP estimation algorithm is to described layering Bayesian model Practise, determine the object function of described layering Bayesian model, including:
Use the MAP estimation algorithm increasing by a norm penalty term to described layering Bayes Model learns, and determines the object function of described layering Bayesian model.
Wherein, the object function of described layering Bayesian model includes representing between microorganism The matrix associated between degree of accuracy matrix and microorganism with the environmental factors measured value of association;
Correspondingly, described described object function is optimized, including:
Utilize the degree of accuracy matrix of the graphical lasso algorithm association to representing between microorganism It is iterated optimizing;
Utilize proximal algorithm to representing associating between microorganism and environmental factors measured value Matrix is iterated optimizing.
Wherein, the abundance of various microorganisms in described acquisition grand gene order-checking sample, and Before the environmental factors measured value that each grand gene order-checking sample is corresponding, described method is also wrapped Include:
Described grand gene order-checking sample is carried out pretreatment.
According to the second aspect of the invention, it is provided that a kind of microorganism related network prediction means, This device includes:
Acquiring unit, for obtaining the abundance of various microorganisms in grand gene order-checking sample, with And the environmental factors measured value that each grand gene order-checking sample is corresponding;
Unit set up by model, for producing according to the data of each grand gene order-checking sample The abundance of various microorganisms and the environmental factors of correspondence in journey, each grand gene order-checking sample Measured value, sets up layering Bayesian model, and described layering Bayesian model is for producing data Between association and microorganism with environmental factors measured value between process, microorganism associate into Line description;
Model learning unit, is used for using MAP estimation algorithm to described layering Bayes's mould Type learns, and determines the object function of described layering Bayesian model;
Optimize unit, for described object function is optimized;
Predicting unit, the pre-micrometer of layering Bayesian model after using objective function optimization is raw Associating between association and microorganism with the environmental factors measured value between thing.
Wherein, unit set up by described model, specifically for using Dirichlet-Multinomial Conjugation distribution simulation sequencing procedure, the constituent deviation during data are produced and data side Difference carries out the first modeling;Use the Plantago fengdouensis process of Lognormal distribution simulation microorganism, Association between microorganism and associating between microorganism with environmental factors measured value are carried out Second modeling;According to described first modeling and second modeling result, set up based on The layering Bayesian model of Lognormal-Dirichlet-Multinomial.
Wherein, described model learning unit, increase by a norm penalty term specifically for using Described layering Bayesian model is learnt by big Posterior estimator algorithm, determines described layering pattra leaves The object function of this model.
Wherein, the object function of described layering Bayesian model includes representing between microorganism The matrix associated between degree of accuracy matrix and microorganism with the environmental factors measured value of association;
Correspondingly, described optimization unit, including:
First optimizes module, for utilizing graphical lasso algorithm to representing between microorganism The degree of accuracy matrix of association is iterated optimizing;
Second optimizes module, is used for utilizing proximal algorithm to representing microorganism and environmental factors The matrix of the association between measured value is iterated optimizing.
Wherein, described device also includes:
Pretreatment unit, for various microorganisms in described acquisition grand gene order-checking sample Before abundance, and environmental factors measured value corresponding to each grand gene order-checking sample, to institute State grand gene order-checking sample and carry out pretreatment.
The microorganism related network Forecasting Methodology of present invention offer and device, survey according to grand genome The abundance of various microorganisms in the generation process of ordinal number evidence and each grand gene order-checking sample With corresponding environmental factors measured value, set up layering Bayesian model, and to this model Practise and optimize, the association between layering Bayesian model predictive microbiology after being optimized by employing And associating between microorganism with environmental factors measured value, solving due to constituent deviation and Inaccurate problem is inferred in the association that the variance of sequencing data self is caused, and considers micro-life simultaneously Thing and the impact of environmental factors, significantly improve and close with environmental factors in microorganism association and microorganism Accuracy in the prediction task of connection and practicality.
Accompanying drawing explanation
By reading the detailed description of hereafter preferred implementation, various other advantage and benefit Those of ordinary skill in the art be will be clear from understanding.Accompanying drawing is only used for illustrating and is preferable to carry out The purpose of mode, and it is not considered as limitation of the present invention.And in whole accompanying drawing, use Identical reference marks represents identical parts.In the accompanying drawings:
Fig. 1 is the flow process of the microorganism related network Forecasting Methodology that one embodiment of the invention proposes Figure;
Fig. 2 is the flow process of the microorganism related network Forecasting Methodology that another embodiment of the present invention proposes Figure;
Fig. 3 is the structure of the microorganism related network prediction means that one embodiment of the invention proposes Schematic diagram;
Fig. 4 is the structure of the microorganism related network prediction means that another embodiment of the present invention proposes Schematic diagram.
Detailed description of the invention
Embodiments of the invention are described below in detail, and the example of described embodiment is shown in the accompanying drawings Going out, the most same or similar label represents same or similar element or has phase With or the element of similar functions.The embodiment described below with reference to accompanying drawing is exemplary, It is only used for explaining the present invention, and is not construed as limiting the claims.
Fig. 1 shows the flow chart of the microorganism related network Forecasting Methodology of the embodiment of the present invention. With reference to Fig. 1, the microorganism related network Forecasting Methodology that the embodiment of the present invention proposes, specifically include Following steps:
Step S11, obtain the abundance of various microorganisms in grand gene order-checking sample, and often The environmental factors measured value that one grand gene order-checking sample is corresponding.
In actual applications, by grand genomic data is processed, obtain in sample various The abundance of microorganism (OTU) and the value of environmental factors, remove abnormal sample and feature simultaneously, It is easy to next step calculating of model.For grand genome 16S sequencing data, need sample Carry out quality control, reads cluster and annotation, the most just can obtain the abundance of OTU;Environment Factor need sampling when the most after measured.
Step S12, the data generating procedure according to each grand gene order-checking sample, Mei Yihong In gene order-checking sample, the abundance of various microorganisms and the environmental factors measured value of correspondence, build Vertical layering Bayesian model, described layering Bayesian model is for data generating procedure, micro-life Associating between association and microorganism with the environmental factors measured value between thing is described.
Step S13, employing MAP estimation algorithm are to described layering Bayesian model Practise, determine the object function of described layering Bayesian model;
Step S14, described object function is optimized;
Step S15, use the layering Bayesian model predictive microbiology after objective function optimization it Between association and microorganism with environmental factors measured value between associate.
The microorganism related network Forecasting Methodology that the embodiment of the present invention provides, surveys according to grand genome The abundance of various microorganisms in the generation process of ordinal number evidence and each grand gene order-checking sample With corresponding environmental factors measured value, set up layering Bayesian model, and to this model Practise and optimize, the association between layering Bayesian model predictive microbiology after being optimized by employing And associating between microorganism with environmental factors measured value, solving due to constituent deviation and Inaccurate problem is inferred in the association that the variance of sequencing data self is caused, and considers micro-life simultaneously Thing and the impact of environmental factors, significantly improve and close with environmental factors in microorganism association and microorganism Accuracy in the prediction task of connection and practicality.
In an alternate embodiment of the present invention where, step S12, specifically include in the following drawings Unshowned step:
Step S121, employing Dirichlet-Multinomial are conjugated distribution simulation sequencing procedure, right Constituent deviation and data variance in data generating procedure carry out the first modeling;
Step S122, the Plantago fengdouensis process of employing Lognormal distribution simulation microorganism are right Associating between association and microorganism with the environmental factors measured value between microorganism carries out Two modelings;
Step S123, according to described first modeling and second modeling result, set up based on The layering Bayesian model of Lognormal-Dirichlet-Multinomial.
In actual applications, according to the feature of grand gene order-checking data, set up layering Bayes Model, between the association between data generating procedure, microorganism and microorganism and environmental factors Association be described;In order to consider that between sequencing data sample, reads sum is different, simply Normalization can introduce constituent deviation;And sequencing data is produced due to series of processing steps The problem of raw error, present invention Dirichlet-Multinomial conjugation distribution is simulated. Assuming that the reads number of microorganism that order-checking obtains obeys multinomial distribution, these reads numbers are subject to Proportional amount of impact shared by size and this OTU, it is further assumed that its relative scale clothes It is distributed from Di Li Cray.The relative abundance of OTU is further by microorganism corresponding in its environment Degree abundance absolutely determines, and the absolute abundance of microorganism is assumed to be affected by two aspect factors, one Aspect is the association between microorganism, is on the other hand the pass between microorganism and environmental factors Connection, this process Lognormal is modeled by the present invention.Finally, between microorganism The parameter respectively of association and microorganism and environmental factors turns to two matrixes, and the two matrix is corresponding The parameter of Lognormal distribution.
Below by a specific embodiment, in technical solution of the present invention according to each grand gene Various microorganisms in group the order-checking data generating procedure of sample, each grand gene order-checking sample The environmental factors measured value of abundance and correspondence, the implementation setting up layering Bayesian model is entered Row clearly illustrates.
Assume xiBe the vector of P dimension, P OTU in a sample checking order obtains Reads array becomes, miIt is the vector of a Q dimension, is made up of the value of Q environmental factors;hi And xiCorresponding, expression is P relative abundance in the sample corresponding for OTU;αiTable Show is P the absolute abundance in true environment corresponding for OTUs, and Z represents hidden variable, Absolute abundance affects of both receiving again: be on the one hand the association between microorganism, by Θ essence Exactness matrix represents (P P matrix), is on the other hand the impact of environmental factors, uses matrix B table Show (Q × P matrix).B0It it is the base vector of corresponding P OTU abundance.
zi~Gaussian (B0-1)
αi=exp (BTmi+zi)
hi~Dirichlet (αi)
xi~Multinomial (hi)
Because the reads x of microorganismiBeing to be obtained by order-checking, the PCR during order-checking can To model with multinomial distribution, i.e. when the reads number that sample is total determines, every kind of OTU Quantity relevant with its relative scale, be equivalent to sample by its relative scale:
P ( x i | h i ) = s ( x i ) x i 1 , ... , x i P Π j = 1 P h i j x i j - - - ( 1 )
WhereinIt it is the total reads number in i-th sample.hiCorresponding P OTUs's is relative Indexing, hasThe relative abundance of microorganism is actually according to its absolute abundance, i.e. exists Absolute quantity calculating in group obtains, and present invention Dirchlet distribution models relatively Indexing hiWith absolute abundance αiBetween relation,
P ( h i | α i ) = 1 T ( α i ) Π j = 1 P h i j α i j - 1 - - - ( 2 )
And G () is gamma function.Utilize Dirichlet-Multinomial conjugation point The character of cloth,
P ( x i | α i ) = ∫ P ( x i | h i ) P ( h i | α i ) dh i = s ( x i ) x i 1 , ... , x i P T ( α i + x i ) T ( α i ) - - - ( 3 )
Can obtain OTU-j variance in i-th sample is Var (xij)=s (xi)·C·rij·(1-rij), and jth The covariance of OTU is Cov (xij,xik)=-s (xi)·C·rij·rik, whereinrijij/s(αi) and rikik/s(αi).It can be seen that utilize this conjugation to be distributed, between the quantity of microorganism, there is one Negative dependency, this is just corresponding with constituent deviation;Meanwhile, its variance and association side Difference all with absolute abundance αiRelevant, rijThe expectation of the relative abundance exactly determined by absolute abundance Value, the most identical with sequencing data.
Further, present invention assumes that the absolute abundance of microorganism obeys Lognormal distribution, This distribution is widely used by biologist.Meanwhile, in order to consider microorganism and microorganism it Between association and the environmental factors impact on microorganism, we to Lognormal distribution do as Lower adjustment:
P ( α i | B , B 0 , Θ , m i ) = 1 ( 2 π ) P 2 | Θ | - 1 2 exp ( - 1 2 ( logα i - μ i ) T Θ ( logα i - μ i ) ) Π j = 1 P 1 α i j - - - ( 4 )
Wherein, mean μi=BTmi+B0.So, degree of accuracy matrix Θ be just used for characterizing microorganism it Between association, matrix B be used for represent the association between microorganism and environmental factors.
In an alternate embodiment of the present invention where, step S13, specifically include in the following drawings Unshowned step: use the MAP estimation algorithm increasing by a norm penalty term to described point Layer Bayesian model learns, and determines the object function of described layering Bayesian model.
In actual applications, it is layered Bayes's mould at Lognormal-Dirichlet-Multinomial On the basis of type, associating between the association between microorganism and microorganism with environmental factors is drawn Enter the sparse constraint of a norm, utilize maximum a posteriori to estimate.Association to be inferred is carried out Sparse constraint, is on the one hand because the sample that the quantity of microorganism can obtain far more than a group This number, adds the problem that sparse constraint can process over-fitting;On the other hand the number that checks order it is because Relatively big according to noise, can be retained by sparse item and the most significantly associate, improve accuracy.Due to Containing hidden variable in the layering Bayesian model of the present invention, estimate utilizing maximum a posteriori to carry out parameter During meter, hidden variable is the most together estimated, which decreases the complexity of optimization.
Below by a specific embodiment, estimate technical solution of the present invention uses maximum a posteriori Described layering Bayesian model is learnt by calculating method, determines described layering Bayesian model The implementation of object function clearly illustrates.
The MAP estimation method increasing by a norm penalty term is used to learn this layering Bayes Model.Owing to the hierarchical mode in the present invention contains hidden variable, direct estimation is the most difficult, institute To use maximum a posteriori, estimate together with hidden variable.Assume the number of grand gene order-checking OTU According to for matrix X (N*P matrix, N number of sample, P OTU), environmental data is matrix M (N*Q Matrix, N number of sample, Q environmental factors), then the MAP estimation for hidden variable Z is:
P(Z|X,M,B,B0,Q)∝P(X,Z|B,B0,Q,M)∝P(X|α)P(α|Z,B,B0,M)P(Z|B0,Q)
Wherein P (X | α) can be calculated with equation (3), andP(zi|B0,Q) It it is Gauss distribution.
Using a norm to be on the one hand because model parameter more, sample is less, needs to utilize one Norm penalty term prevents over-fitting;On the other hand it is easy for doing feature selection, retains the most aobvious The association write is for later in-depth analysis.Finally, the object function of the present invention is,
m i n B , B 0 , Q , Z f ( B , B 0 , Q , Z ) + λ 1 2 || Q || 1 + λ || B || 1 - - - ( 7 )
Wherein,
It it is log gamma function.
In an alternate embodiment of the present invention where, the object function of described layering Bayesian model Include that the degree of accuracy matrix representing the association between microorganism and microorganism are surveyed with environmental factors The matrix of the association between value.
In embodiments of the present invention, step S14, specifically include the step not shown in the following drawings Rapid:
Step S141, utilize the essence of the graphical lasso algorithm association to representing between microorganism Exactness matrix is iterated optimizing;
Step S142, utilize proximal algorithm to represent microorganism and environmental factors measured value it Between the matrix of association be iterated optimizing.
Concrete, the embodiment of the present invention utilizes proximal method and graphical to object function Lasso (Friedman et.al., 2008) method is iterated optimizing.For representing between microorganism The degree of accuracy matrix of association, utilizes graphical lasso to be optimized, and this method is the most effective; For representing the matrix associated between microorganism and environmental factors, we enter by proximal method Row optimizes, and object function is all done Two-order approximation, then passes through coordinate by each iteration Descent coordinate descent is optimized.
Below by a specific embodiment, in technical solution of the present invention to described object function The implementation being optimized clearly illustrates.
The present invention utilizes proximal method and graphical lasso method to object function (7) It is iterated optimizing.Owing in model, unknown parameter includes: Z, B, Q and B0, the present invention adopts By the method for block iterative solution, one parameter of every suboptimization, then by continuous alternating iteration Convergence eventually.
For hidden variable Z, to object function derivative it is:
∂ f ∂ z i j = - 1 N ( G ~ ′ ( α i j + x i j ) - G ~ ′ ( s ( α i ) + s ( x i ) ) - G ~ ′ ( α i j ) + G ~ ′ ( s ( α i ) ) ) α i j + 1 N Q j : ( z i - B 0 ) - - - ( 8 )
L-BFGS quasi-Newton method can be utilized to be optimized.
For matrix B, proximal pseudo-Newtonian algorithm is utilized to be iterated.For matrix B, Derivative is as follows:
∂ f ( B ) ∂ B i j = - 1 N Σ k = 1 N ( G ~ ′ ( α i j + x i j ) - G ~ ′ ( s ( α i ) + s ( x i ) ) - G ~ ′ ( α i j ) + G ~ ′ ( s ( α i ) ) ) α k j m k j - - - ( 9 )
Then utilize first derivative to approximate Hessian matrix, thus obtain Two-order approximation target letter Number, adds the constraint of a norm.For vector B0, more new regulation is:
B 0 = 1 N Σ i = 1 N z i - - - ( 10 )
For matrix Q, its object function is:
m i n Q - l o g | Q | + t r ( S Q ) + λ 1 || Q || 1 - - - ( 11 )
This object function can pass through the graphical effective iterative of lasso algorithm.
To sum up, core based on Lognormal-Dirichlet-Multinomial layering Bayesian model Heart learning algorithm is as follows:
Two parameters that the embodiment of the present invention estimates: matrix Q and B, be respectively used to explain and Associating between association and microorganism with the environmental factors between predictive microbiology.Unit in matrix Element QijIllustrate the association between OTU-i and OTU-j, if Qij=0, then OTU-i and j is condition Independent, it is otherwise that condition is correlated with, this associated weights isFor matrix B, Bij Representing the association between OTU-j and environmental factors i, its weights are Bij.After estimating association, Positive negativity and the size of absolute value of association can be utilized, in conjunction with the microbial species of OTU annotation Information, it is recommended that real interactive relation therebetween, such as competition, symbiosis and predation etc..
Fig. 2 shows the flow chart of the microorganism related network Forecasting Methodology of the embodiment of the present invention. With reference to Fig. 2, the microorganism related network Forecasting Methodology that the embodiment of the present invention proposes, obtain described Take the abundance of various microorganisms in grand gene order-checking sample, and each grand gene order-checking sample Before the environmental factors measured value of this correspondence, described method also includes step S10:
Step S10, described grand gene order-checking sample is carried out pretreatment.
In a particular application, due to actual samples and the error of sequencing procedure, some ring can be caused The value of border factor is unmeasured or does not substantially conform to the actual conditions, and the reads quantity of some OTU also can go out The biggest fluctuation, between sample, total reads number also there will be the biggest difference, and these factors are all Needs take into account, and the data obtained carry out the pretreatment such as filtration.The data so obtained are made Input for model of the present invention.
For embodiment of the method, in order to be briefly described, therefore it is all expressed as a series of action Combination, but those skilled in the art should know, and the embodiment of the present invention is not by described The restriction of sequence of movement, because according to the embodiment of the present invention, some step can use other suitable Sequence or simultaneously carry out.Secondly, those skilled in the art also should know, is retouched in description The embodiment stated belongs to preferred embodiment, and the involved action not necessarily present invention implements Necessary to example.
Fig. 3 shows that the structure of the microorganism related network prediction means of the embodiment of the present invention is shown It is intended to.With reference to Fig. 3, the microorganism related network prediction means that the embodiment of the present invention provides, bag Include acquiring unit 301, unit 302 set up by model, model learning unit 303, optimization unit 304 and predicting unit 305, wherein: acquiring unit 301 is used for obtaining grand gene order-checking The abundance of various microorganisms in sample, and environment corresponding to each grand gene order-checking sample because of Element measured value;Model sets up unit 302 for the data according to each grand gene order-checking sample The abundance of various microorganisms and the ring of correspondence in generation process, each grand gene order-checking sample Border factor measured value, sets up layering Bayesian model, and described layering Bayesian model is used for logarithm Between association and microorganism and the environmental factors measured value produced between process, microorganism Association is described;Model learning unit 303 is used for using MAP estimation algorithm to described Layering Bayesian model learns, and determines the object function of described layering Bayesian model;Excellent Change unit 304 for described object function is optimized;Predicting unit 305 is used for using mesh Scalar functions optimize after layering Bayesian model predictive microbiology between association and microorganism and Association between environmental factors measured value.
The microorganism related network prediction means that the present invention provides, unit 302 basis set up by model In the generation process of grand gene order-checking data and each grand gene order-checking sample various micro- Biological abundance and corresponding environmental factors measured value, set up layering Bayesian model, and pass through This model is learnt and optimizes, in advance by model learning unit 303 and optimization unit 304 Survey unit 305 use optimize after layering Bayesian model predictive microbiology between association with Associating between microorganism with environmental factors measured value, to solve due to constituent deviation and survey Inaccurate problem is inferred in the association that ordinal number is caused according to the variance of self, considers microorganism simultaneously With the impact of environmental factors, significantly improve and associate with environmental factors with microorganism in microorganism association Prediction task in accuracy and practicality.
In an alternate embodiment of the present invention where, unit 302 set up by described model, specifically uses It is conjugated distribution simulation sequencing procedure, to data generating procedure in using Dirichlet-Multinomial In constituent deviation and data variance carry out the first modeling;Use Lognormal distributed mode Intend the Plantago fengdouensis process of microorganism, to the association between microorganism and microorganism and environment because of Association between element measured value carries out the second modeling;Model according to described first modeling and second As a result, layering Bayesian model based on Lognormal-Dirichlet-Multinomial is set up.
In an alternate embodiment of the present invention where, described model learning unit 303, specifically use In using the MAP estimation algorithm increasing by a norm penalty term to described layering Bayes's mould Type learns, and determines the object function of described layering Bayesian model.
In an alternate embodiment of the present invention where, the object function of described layering Bayesian model Include that the degree of accuracy matrix representing the association between microorganism and microorganism are surveyed with environmental factors The matrix of the association between value;
Correspondingly, described optimization unit 304, specifically include the first optimization module and second and optimize Module, wherein, first optimizes module, is used for utilizing graphical lasso algorithm to representing micro-life The degree of accuracy matrix of the association between thing is iterated optimizing;Second optimizes module, is used for utilizing The matrix associated represented between microorganism and environmental factors measured value is carried out by proximal algorithm Iteration optimization.
In an alternate embodiment of the present invention where, as shown in Figure 4, described device also includes pre- Processing unit 300, described pretreatment unit 300, at described acquisition grand gene order-checking sample The abundance of various microorganisms in Ben, and the environmental factors that each grand gene order-checking sample is corresponding Before measured value, described grand gene order-checking sample is carried out pretreatment.
For device embodiment, due to itself and embodiment of the method basic simlarity, so describing Fairly simple, relevant part sees the part of embodiment of the method and illustrates.
In sum, the embodiment of the present invention provide microorganism related network Forecasting Methodology and dress Put, according in the generation process of grand gene order-checking data and each grand gene order-checking sample The abundance of various microorganisms and corresponding environmental factors measured value, set up layering Bayesian model, And this model is learnt and optimizes, the layering Bayesian model prediction after being optimized by employing Associating between association and microorganism with the environmental factors measured value between microorganism, with solve by The association caused in the variance of constituent deviation and sequencing data self is inferred inaccurate Problem, considers microorganism and the impact of environmental factors simultaneously, significantly improve in microorganism association and Accuracy in the prediction task that microorganism associates with environmental factors and practicality.
Through the above description of the embodiments, those skilled in the art it can be understood that Can be realized by hardware to the present invention, it is also possible to add the general hardware platform of necessity by software Mode realize.Based on such understanding, technical scheme can be with software product Form embody, this software product can be stored in a non-volatile memory medium (can To be CD-ROM, USB flash disk, portable hard drive etc.) in, including some instructions with so that one Platform computer equipment (can be personal computer, server, or the network equipment etc.) performs Method described in each embodiment of the present invention.
It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, attached Module or flow process in figure are not necessarily implemented necessary to the present invention.
It will be appreciated by those skilled in the art that the module in the device in embodiment can be according to reality Execute example description to carry out being distributed in the device of embodiment, it is also possible to carry out respective change and be positioned at difference In one or more devices of the present embodiment.The unit of above-described embodiment can merge into one Unit, it is also possible to be further split into multiple submodule.
The above is only the some embodiments of the present invention, it is noted that lead for this technology For the those of ordinary skill in territory, under the premise without departing from the principles of the invention, it is also possible to make Some improvements and modifications, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims (10)

1. a microorganism related network Forecasting Methodology, it is characterised in that the method includes:
Obtain the abundance of various microorganisms in grand gene order-checking sample, and each grand gene The environmental factors measured value that group order-checking sample is corresponding;
Data generating procedure according to each grand gene order-checking sample, each grand genome are surveyed The abundance of various microorganisms and the environmental factors measured value of correspondence in sequence sample, set up layering Bayesian model, described layering Bayesian model for data generating procedure, microorganism it Between association and microorganism with environmental factors measured value between associate and be described;
Use MAP estimation algorithm that described layering Bayesian model is learnt, determine The object function of described layering Bayesian model;
Described object function is optimized;
Use the association between the layering Bayesian model predictive microbiology after objective function optimization And associating between microorganism with environmental factors measured value.
Method the most according to claim 1, it is characterised in that described according to each grand In the data generating procedure of gene order-checking sample, each grand gene order-checking sample various micro- Biological abundance and the environmental factors measured value of correspondence, set up layering Bayesian model, bag Include:
Use Dirichlet-Multinomial to be conjugated distribution simulation sequencing procedure, data are produced During constituent deviation and data variance carry out the first modeling;
Use Lognormal distribution simulation microorganism Plantago fengdouensis process, to microorganism it Between association and microorganism with environmental factors measured value between associate and carry out the second modeling;
According to described first modeling and second modeling result, set up based on The layering Bayesian model of Lognormal-Dirichlet-Multinomial.
Method the most according to claim 1, it is characterised in that after described employing maximum Test algorithm for estimating described layering Bayesian model is learnt, determine described layering Bayes The object function of model, including:
Use the MAP estimation algorithm increasing by a norm penalty term to described layering Bayes Model learns, and determines the object function of described layering Bayesian model.
Method the most according to claim 1, it is characterised in that described layering Bayes The object function of model includes representing the degree of accuracy matrix of the association between microorganism and micro-life The matrix associated between thing with environmental factors measured value;
Correspondingly, described described object function is optimized, including:
Utilize the degree of accuracy matrix of the graphical lasso algorithm association to representing between microorganism It is iterated optimizing;
Utilize proximal algorithm to representing associating between microorganism and environmental factors measured value Matrix be iterated optimize.
5. according to the method described in any one of claim 1-4, it is characterised in that described Obtain the abundance of various microorganisms in grand gene order-checking sample, and each grand genome is surveyed Before the environmental factors measured value that sequence sample is corresponding, described method also includes:
Described grand gene order-checking sample is carried out pretreatment.
6. a microorganism related network prediction means, it is characterised in that this device includes:
Acquiring unit, for obtaining the abundance of various microorganisms in grand gene order-checking sample, And the environmental factors measured value that each grand gene order-checking sample is corresponding;
Unit set up by model, for producing according to the data of each grand gene order-checking sample In journey, each grand gene order-checking sample the abundance of various microorganisms and the environment of correspondence because of Element measured value, sets up layering Bayesian model, and described layering Bayesian model is for data Between association and microorganism and environmental factors measured value between generation process, microorganism Association is described;
Model learning unit, is used for using MAP estimation algorithm to described layering Bayes Model learns, and determines the object function of described layering Bayesian model;
Optimize unit, for described object function is optimized;
Predicting unit, the pre-micrometer of layering Bayesian model after using objective function optimization Associating between association and microorganism with the environmental factors measured value between biology.
Device the most according to claim 6, it is characterised in that described model is set up single Unit, specifically for using Dirichlet-Multinomial to be conjugated distribution simulation sequencing procedure, right Constituent deviation and data variance in data generating procedure carry out the first modeling;Use The Plantago fengdouensis process of Lognormal distribution simulation microorganism, to the association between microorganism with And associating between microorganism with environmental factors measured value carries out the second modeling;According to described One modeling and second modeling result, set up based on The layering Bayesian model of Lognormal-Dirichlet-Multinomial.
Device the most according to claim 6, it is characterised in that described model learning list Unit, specifically for using the MAP estimation algorithm increasing by a norm penalty term to described point Layer Bayesian model learns, and determines the object function of described layering Bayesian model.
Device the most according to claim 6, it is characterised in that described layering Bayes The object function of model includes representing the degree of accuracy matrix of the association between microorganism and micro-life The matrix associated between thing with environmental factors measured value;
Correspondingly, described optimization unit, including:
First optimizes module, is used for utilizing graphical lasso algorithm to representing between microorganism The degree of accuracy matrix of association be iterated optimizing;
Second optimizes module, be used for utilizing proximal algorithm to represent microorganism and environment because of The matrix of the association between element measured value is iterated optimizing.
10. according to the device described in any one of claim 6-9, it is characterised in that described Device also includes: pretreatment unit, for each in described acquisition grand gene order-checking sample Plant the abundance of microorganism, and the environmental factors measurement that each grand gene order-checking sample is corresponding Before value, described grand gene order-checking sample is carried out pretreatment.
CN201610266864.7A 2016-04-26 2016-04-26 Microorganism association network prediction method and apparatus Pending CN105938524A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610266864.7A CN105938524A (en) 2016-04-26 2016-04-26 Microorganism association network prediction method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610266864.7A CN105938524A (en) 2016-04-26 2016-04-26 Microorganism association network prediction method and apparatus

Publications (1)

Publication Number Publication Date
CN105938524A true CN105938524A (en) 2016-09-14

Family

ID=57152673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610266864.7A Pending CN105938524A (en) 2016-04-26 2016-04-26 Microorganism association network prediction method and apparatus

Country Status (1)

Country Link
CN (1) CN105938524A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827915A (en) * 2019-10-09 2020-02-21 厦门极元科技有限公司 Method for carrying out geographical positioning on unknown sample through microorganism metagenome
CN111477267A (en) * 2020-03-06 2020-07-31 清华大学 Microorganism multi-association network computing method, device, equipment and storage medium
CN114944199A (en) * 2022-04-26 2022-08-26 北京邮电大学 Artificial intelligence based strain screening method and device
EP4109349A1 (en) * 2021-06-23 2022-12-28 Precision Biomonitoring Inc. Computer-implemented method for determining survey sampling parameters for environmental nucleic acid
WO2023280059A1 (en) * 2021-07-05 2023-01-12 中国科学院分子细胞科学卓越创新中心 Human health quantitative prediction system and method based on microbial interaction

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477630A (en) * 2009-02-17 2009-07-08 吴俊� System and method for intelligent water treatment micro-organism machine vision identification
CN103268431A (en) * 2013-05-21 2013-08-28 中山大学 Cancer hypotype biomarker detecting system based on student t distribution
CN103942415A (en) * 2014-03-31 2014-07-23 中国人民解放军军事医学科学院卫生装备研究所 Automatic data analysis method of flow cytometer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477630A (en) * 2009-02-17 2009-07-08 吴俊� System and method for intelligent water treatment micro-organism machine vision identification
CN103268431A (en) * 2013-05-21 2013-08-28 中山大学 Cancer hypotype biomarker detecting system based on student t distribution
CN103942415A (en) * 2014-03-31 2014-07-23 中国人民解放军军事医学科学院卫生装备研究所 Automatic data analysis method of flow cytometer

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YUQING YANG 等: "mLDM:a new hierarchical Bayesian statistical model for sparse microbial association discovery", 《BIORXIV》 *
卜洪震 等: "双季稻区稻田不同土壤类型的微生物群落多样性分析", 《作物学报》 *
周桔 等: "土壤微生物多样性影响因素及研究方法的现状与展望", 《生物多样性》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827915A (en) * 2019-10-09 2020-02-21 厦门极元科技有限公司 Method for carrying out geographical positioning on unknown sample through microorganism metagenome
CN111477267A (en) * 2020-03-06 2020-07-31 清华大学 Microorganism multi-association network computing method, device, equipment and storage medium
CN111477267B (en) * 2020-03-06 2022-05-03 清华大学 Microorganism multi-association network computing method, device, equipment and storage medium
EP4109349A1 (en) * 2021-06-23 2022-12-28 Precision Biomonitoring Inc. Computer-implemented method for determining survey sampling parameters for environmental nucleic acid
WO2023280059A1 (en) * 2021-07-05 2023-01-12 中国科学院分子细胞科学卓越创新中心 Human health quantitative prediction system and method based on microbial interaction
CN114944199A (en) * 2022-04-26 2022-08-26 北京邮电大学 Artificial intelligence based strain screening method and device

Similar Documents

Publication Publication Date Title
Sun et al. Using Bayesian deep learning to capture uncertainty for residential net load forecasting
CN105938524A (en) Microorganism association network prediction method and apparatus
CN114092832B (en) High-resolution remote sensing image classification method based on parallel hybrid convolutional network
CN106572493A (en) Abnormal value detection method and abnormal value detection system in LTE network
CN106095812A (en) Intelligent test paper generation method based on similarity measurement
CN114067368B (en) Power grid harmful bird species classification and identification method based on deep convolution characteristics
CN102930495B (en) Steganography evaluation based steganalysis method
Chen et al. Identification of λ-fuzzy measures using sampling design and genetic algorithms
CN115393671A (en) Rock class prediction method based on multi-teacher knowledge distillation and normalized attention
CN108628164A (en) A kind of semi-supervised flexible measurement method of industrial process based on Recognition with Recurrent Neural Network model
CN114429152A (en) Rolling bearing fault diagnosis method based on dynamic index antagonism self-adaption
CN105469063A (en) Robust human face image principal component feature extraction method and identification apparatus
CN112116002A (en) Determination method, verification method and device of detection model
CN105787521A (en) Semi-monitoring crowdsourcing marking data integration method facing imbalance of labels
CN103310229B (en) A kind of multitask machine learning method for image classification and device thereof
CN116361697A (en) Learner learning state prediction method based on heterogeneous graph neural network model
Rohayana A robust data envelopment analysis for evaluating technical efficiency of indonesian high schools
CN114578011A (en) Water quality monitoring method based on multi-sensor multi-source data fusion
CN115165366A (en) Variable working condition fault diagnosis method and system for rotary machine
Smith et al. Scalable microbial strain inference in metagenomic data using StrainFacts
CN117058752A (en) Student classroom behavior detection method based on improved YOLOv7
Arifin et al. Comparative analysis on educational data mining algorithm to predict academic performance
CN113627522B (en) Image classification method, device, equipment and storage medium based on relational network
CN105608468A (en) Multi-label classification method based on flow pattern matrix completion
CN101894216A (en) Method of discovering SNP group related to complex disease from SNP information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160914