CN102478562B - Method for screening ovarian cancer body fluid prognostic marker by L-EDA - Google Patents

Method for screening ovarian cancer body fluid prognostic marker by L-EDA Download PDF

Info

Publication number
CN102478562B
CN102478562B CN201010558383.6A CN201010558383A CN102478562B CN 102478562 B CN102478562 B CN 102478562B CN 201010558383 A CN201010558383 A CN 201010558383A CN 102478562 B CN102478562 B CN 102478562B
Authority
CN
China
Prior art keywords
attribute
eda
ovarian cancer
algorithm
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010558383.6A
Other languages
Chinese (zh)
Other versions
CN102478562A (en
Inventor
林晓惠
陈静
张洋
陈世礼
黄强
路鑫
许国旺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Dalian Institute of Chemical Physics of CAS
Original Assignee
Dalian University of Technology
Dalian Institute of Chemical Physics of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology, Dalian Institute of Chemical Physics of CAS filed Critical Dalian University of Technology
Priority to CN201010558383.6A priority Critical patent/CN102478562B/en
Publication of CN102478562A publication Critical patent/CN102478562A/en
Application granted granted Critical
Publication of CN102478562B publication Critical patent/CN102478562B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention discloses a method for screening an ovarian cancer prognostic marker from a body fluid metabolome profile by modified estimation of distribution algorithms (L-EDA). A metabolome profile is obtained by analyzing a body fluid metabolite by using a liquid chromatograph-mass spectrometer; a probability distribution model is established to analyze the metabolome profile; and a potential ovarian cancer prognostic marker is screened. Being different from traditional estimation of distribution algorithms, L-EDA limits the size of a candidate attribute subset generated during iterative search, provides a new probability distribution model update strategy, allows the evaluation of the attributes to be more accurate and reasonable, and improves the execution efficiency of the algorithms. The attribute subset screened by L-EDA can reflect the characteristics among groups of the metabolome profile data; a support vector machine (SVM) classification model is established for cross validation analysis, and the correct rate reaches 99.06%.

Description

Utilize the method for L-EDA screening ovarian cancer body fluid prognostic markers thing
Technical field
The present invention relates to the fields such as analytical chemistry, medical science, pattern-recognition, that a kind of binding analysis chemical method and algorithm for pattern recognition carry out the profile analysis of metabolism group to serum, the new method of the little molecule metabolites label of screening ovarian cancer prognosis, is specially the method for improved Estimation of Distribution Algorithm (L-EDA) from body fluid metabolic profile screening ovarian cancer prognostic markers thing of utilizing.
Background technology
Oophoroma is also epithelial ovarian malignant tumour (EOC) (document 1. Williams TI, Toups KL, Saggese DA, et al. J Proteome Res, 2007,6 (8): 2936-2962.), it is one of gynaecology's common cancer, mortality ratio is always high, is " gynaecology's three cancers " (oophoroma, cervical carcinoma, carcinoma of endometrium) mortality ratio first place (document 2. Jacobs IJ, Menon U. Mol Cell Proteomics, 2004,3 (4): 355-366.).Dark because of its pelvic cavity position, position is hidden, is difficult to find, and atypical symptom, lacks early diagnosis marker, in the time can finding and clarify a diagnosis, often in, late period.The cause of disease of oophoroma it be unclear that, and its morbidity may be relevant with age, fertility, blood group, mental element and environment etc.Due to the embryonic development of ovary, anatomic tissue and endocrine function are more complicated, and the tumour that it is suffered from may be optimum or pernicious.At present conventional blood serum designated object CA125(cancer antigen 125) there is suggesting effect for epithelial tumor of ovary, but sensitivity is not high, especially the sensitivity of early diagnosis only has 30% left and right (document 3. Rosen DG, Wang L, Atkinson JN, et al. Gynecol Oncol, 2005,99 (2): 267-277.).CA125 is not the specific mark of ovary, it easily obscures (document 4. An HJ mutually with the change that other cancers, pelvic cavity benign tumour, gynaecological imflammation etc. cause, Miyamoto S, Lancaster KS, et al. Profiling of glycans in serum for the discovery of potential biomarkers for ovarian cancer. J Proteome Res, 2006,5 (7): 1626-1635.), thereby cause mistaken diagnosis.The sensitivity and the specificity requirement that day by day improve in order to meet clinical diagnosis, develop new blood serum tumor markers imperative.
Metabolism group (document 5.Nicholson, J. K.; Lindon, J. C.; Holmes, E. xenobiotica 1999, 29, 1181-1189.) and be the method that the biosome small molecular metabolin after developed recently a kind of comprehensive investigation irriate or the disturbance of getting up changes.Metabolism group method is found disease markers can be divided into following components: the mensuration of studied object metabolite content obtained to metabolism group profile; Utilize chemometrics method to characterize the metabolism group profile recording, and obtain the significant important compound of content in group or between group by certain screening means.Therefore set up sane reliable, identification accurately, classification predicts that mark screening technique is very important accurately, is the key of this class research.
In research in the past, the foundation of metabolin screening model usually adopts the method based on conspicuousness statistical study, as methods such as t inspection, variance analyses (ANOVA); And multivariate statistical analysis algorithm, as principal component analysis (PCA), partial least squares discriminant analysis (PLS-DA), orthogonal partial least squares discriminant analysis method (OPLS-DA) etc.Method based on conspicuousness statistical study usually requires the sample that hypothesis gathers to obey specific distribution.Due to the singularity studying a question, the capacity of sample set very limited (conventionally only having tens samples), and it is conventionally reliable not on a small amount of sample basis, the data that gather to be done to significance analysis.Application PCA during to the data modeling of metabolic profile complexity, owing to not using the reason such as known sample class information, data noise variable, causes the explanation degree low (R of pca model of pca model for data 2be worth less), be often difficult to obtain gratifying result.To the method based on offset minimum binary (PLS), owing to using known sample class information in modeling process, constructed model can characterize the metabolic profile of data of all categories conventionally clearly.Exactly because used the classification information of sample in modeling process, model may be depended on the relation between current data and classification unduly, causes the not ideal enough (Q of predictive ability of model 2less), over-fitting (over-fitting) even, it is insecure on such model, screening potential metabolic marker thing.
At area of pattern recognition, for the description of one group of data characteristics, conventionally there are attributes extraction and attribute to select two class methods.Attributes extraction algorithm normally builds specific model sample is distinguished, and in the building process of model, each attribute is combined, and the attribute after combination is called main composition.Afterwards, according to attribute, the contribution in each main composition filters out important attribute.Above-mentioned PCA, PLS-DA, OPLS-DA etc. belong to attributes extraction class algorithm.And select class algorithm for attribute, attribute or attribute set are taken as an entirety and assess, by attribute good assessment result or attribute set as a result of.
Estimation of Distribution Algorithm (Estimation of Distribution Algorithms, EDAs) is the evolution algorithm (Evolutionary Algorithms, EAs) of a class based on probability Distribution Model.Than classical evolution algorithm-genetic algorithm; Estimation of Distribution Algorithm has that parameter is few, implication is directly perceived, probability model guidance search and the plurality of advantages such as result is stable; be widely used biological information field and solve complicated application problem (document 6.Arma anzas, R.; Inza, I.; Santana, R.; Et al. bioData Mining 2008 ,1 (6) .).But as a kind of evolution algorithm, Estimation of Distribution Algorithm has the shortcomings such as execution speed is slow, consumption of natural resource is many too.
The shortcoming that the present invention is directed to Estimation of Distribution Algorithm makes improvements, and the attribute number comprising in the candidate attribute set producing in algorithm implementation is limited, and has provided a kind of new model renewal strategy simultaneously.Utilize improved Estimation of Distribution Algorithm, apply it in the research of ovarian cancer prognosis mark, the mark being filtered out by the method has only embodied ovarian cancer prognosis state, causes that with radiotherapy chemotherapy metabolism state change is irrelevant.Algorithm after improvement can filter out the community set that can characterize metabolism group outline data feature, improved simultaneously algorithm execution efficiency, reduced the required resource of execution algorithm.
Summary of the invention
The present invention relates to a kind of new method of utilizing improved Estimation of Distribution Algorithm (L-EDA) screening ovarian cancer prognostic markers thing from body fluid metabonome profile, described new method can be measured body fluid small molecular metabolism group profile based on metabonomic technology, and application model recognizer is screened ovarian cancer prognosis label.The method has that the selection result is accurate, error rate is little, computing velocity is fast, high, the integrated degree high of degree automatically, is suitable for the screening of extensive sample, can be widely used in the field such as chemistry, medical science.
For achieving the above object, the technical solution used in the present invention is as follows:
Adopt liquid chromatography mass coupling platform to obtain metabolism group profile to body fluid (comprising blood, urine etc.) metabolin analysis, build probability Distribution Model screening important property, the metabolism group profile of analysis of ovarian cancer patient and Healthy People, screening ovarian cancer prognostic markers thing.Comprise the following steps:
1) collection of body fluid sample and pre-service.
The N1 gathering under an identical sampling condition Healthy People, N2 oophoroma patient, N3 ovarian cancer post operation do not recur women and N4 ovarian cancer post operation recurrence patient's sample, is stored in immediately in-80 DEG C of refrigerators after collection of body fluids sample.
When analysis, sample is taken out to room temperature from refrigerator and thaw.Get 180 μ L serum and add 4 times of volumes (720 μ L) acetonitrile, in acetonitrile, contain the existing choline (12:0) (Lyso PC (12:0)) of LEK and lysophosphatide as interior mark, fully concussion 30 seconds, then centrifugal 10 minutes of 15000g under 4 ° of C, gets supernatant freeze-drying.Before analysis, be heavily dissolved in 150 μ L water: in acetonitrile (1/4, v/v), now, interior mark concentration is LEK 3 ng/ μ L and Lyso PC (12:0) 3 ng/ μ L.
2) metabolin of liquid GC-MS serum analysis.
What stratographic analysis adopted is that Agilent 1200 series are differentiated liquid chromatography (Rapid Resolution Liquid Chromatography, RRLC), chromatographic column adopting 50 mm × 2.1 mm 1.7 μ m Waters BEH C fast 18post.Column temperature remains on 50 ° of C, and flow is 0.35 mL/min.Mobile phase A is that high purity water contains 0.1% formic acid and 2% acetonitrile, and Mobile phase B is acetonitrile.Gradient is that 5%B is initial, rises to 35%B in the time of the 4th minute, in the time of the 22nd minute, is changed to 80%B, reaches 100%B the 24th minute time, keeps carrying out column equilibration 5 minutes after 5 minutes.Automatic sampler remains 4 ° of C, and sampling volume is 5 μ L.
That mass spectrophotometry adopts is Agilent 6510 quadrupole rods-flight time mass spectrum (Q-TOF MS, Agilent, USA).Mass spectrum carries out data acquisition under positive ion mode.Mass spectrum capillary voltage is made as 4000V, and Fragmentor voltage and skimmer voltage are made as respectively 230 V and 65 V.Dry gas flow is 11L/min, and atomisation pressure is made as 45 psig, and temperature is 350 DEG C.Adopt the potpourri of purine and six phosphine piperazines (hexakis phosphazine) to be used for keeping precision and the stability of mass number measurement as correcting fluid.Under positive ion mode they to produce respectively mass-to-charge ratio be 121.0508 and 922.0097 ion.Data acquisition scope is mass-to-charge ratio 80-1000, with barycenter type collection.Acquisition rate is 500 milliseconds.
3) the original metabolism group outline data gathering is by Molecular Feathers Extraction(MFE, Agilent) software extraction compound information, calculate accurate molecular weight according to mass spectrometric data.Subsequently, adopting Genespring(Agilent) software carries out chromatographic peak coupling.Data process area normalization after coupling is to reduce systematic error.Then use 80% rule to reduce the impact of missing values on data set,, in the time that an ion is all greater than 1 in the sample of arbitrary class 80%, can be used.
4) utilize L-EDA algorithm screening ovarian cancer prognostic marker.L-EDA meanwhile, has provided a kind of update strategy of new probability Distribution Model within the size of the attribute set of extraction is limited to relatively little scope.L-EDA comprises four major parts: build probability Distribution Model, generate candidate attribute subset, evaluate candidate attribute set, upgrades probability Distribution Model.
A. build probability Distribution Model
If attribute adds up to M, build a probability distribution vector p[], the length of vector is M, in vector, the span of each element is [0,1].Element p[i in probability distribution vector] represent the selected probability as potential metabolic marker thing of ion of i attribute representative.Before algorithm is carried out, owing to attribute not being evaluated, each attribute has the impartial attribute set that may be selected into or not be selected into optimization, therefore, probability distribution vector p[] in the value of each element be set to 0.5.
B. generate candidate attribute subset
If the size of candidate attribute subset is defined as G by L-EDA, in the time generating a candidate subset that comprises G attribute, carry out following operation so.If the candidate subset generating is S, be initially empty set.When the attribute number comprising as S is less than G, from be not yet added into the attribute S, chooses at random an attribute i, then generate at random the several ns of a value among [0,1], if n<p[i], i attribute added to S; Otherwise, proceed to circulation next time.Finally, by probability distribution current reference of generation the candidate collection S that comprises G attribute.
C. evaluate candidate attribute set
Adopt SVM to build disaggregated model in candidate attribute subset, utilize 7 times of SVM to intersect accuracy as the standard of measuring candidate attribute subset.The accuracy of cross validation is higher, better for the evaluation of candidate subset.
D. upgrade probability Distribution Model
L-EDA has provided a kind of strategy of new renewal probability Distribution Model.From when selecting the optimum candidate subset of a certain proportion of evaluation the candidate attribute subset of previous round, new strategy adopts single attribute to evaluate the frequency that occurs in the best candidate collection tolerance as evaluation attributes quality at these.If the frequency that attribute i occurs in several best candidate collection of evaluation is f[i], the average frequency of occurrences of all properties is average, in the time of the probability distribution of Update attribute i, if f[i] >average, adopt formula (1), otherwise, formula (2) adopted.
p[i]=(1-r)*p[i]+r*(1-p[i])*(f[i]-average) (1)
p[i]=(1-r)*p[i]+r*p[i]*(f[i]-average) (2)
Wherein, r is the ratio of probability Distribution Model from current candidate subset set learning.Can find out, in the time that probability Distribution Model is upgraded, to only have competitive attribute just may to be rewarded from formula (1), (2), otherwise, will be punished.
L-EDA algorithm flow is as follows: first, and according to the method initialization probability distribution vector in A.Afterwards, enter iterative search procedures: according to setting parameter, generate T candidate attribute subset according to the method in B; For each candidate subset, evaluate according to the method in C.Select and evaluate several optimum candidate subset, upgrade probability distribution vector according to the method in D.So far, the iterative search of the first round finishes, and enters the search procedure of next round.In the time that iterative search proceeds to predefined maximum search wheel number, algorithm stops search procedure.Finally, algorithm sorts all attributes according to the descending of probability distribution vector value, and output attribute sequencing table.
Implication by probability distribution vector in L-EDA algorithm can be found out, is come the attribute that attribute is to classification information is the most relevant, separating capacity is the strongest above by L-EDA, can be according to specific needs, and filter out a certain proportion of attribute above and analyze, study.
The effect that the present invention has is:
1, the processing of body fluid is in vitro and completes, and step is simple, easy to operate, and processing speed is fast, is applicable to processing and the screening of extensive sample.
2, liquid chromatography mass method for combined use repeatability, reliability are high.Sample analysis time is short, and analysis throughput is large.
3, algorithm possesses stability for the setting of parameter, under different parameters arranges, and the attribute ranking results that can agree.
4, algorithm is accurate for the evaluation of attribute, is come attribute above and can embody the feature of sample set by algorithm.
5, the execution efficiency of algorithm high, save time, be suitable for application.
Brief description of the drawings
Fig. 1 is blood serum metabolic group profile diagram in embodiment.Wherein (A) healthy women serum liquid chromatography mass total ion current figure, (B) ovary patients serum liquid chromatography mass total ion current figure, (C) the postoperative serum of women liquid chromatography mass total ion current that do not recur of ovary, (D) ovary postoperative recurrence patients serum liquid chromatography mass total ion current.
Fig. 2 is the accuracy change curve that the forward community set of L-EDA rank is carried out to 7 times of cross validations.
Fig. 3 is comparison diagram working time of L-EDA and traditional EDA.
Fig. 4 is the shot chart that original metabolism group outline data is built to PLS-DA model.
Fig. 5 is the shot chart that before the rank that L-EDA is filtered out metabolism group outline data, 20% attribute builds PLS-DA model.
Embodiment
Below in conjunction with accompanying drawing, embodiments of the invention are described in detail; The present embodiment is implemented under the guidance of technical solution of the present invention, but protection scope of the present invention is not limited to following embodiment, and following embodiment of the present invention is only as example of the present invention instead of restriction.In the situation that not violating purport of the present invention and scope, can carry out various changes and improvements to the present invention, but all these changes and improvements, all should be within protection domain of the present invention.
Embodiment: the ovarian cancer prognosis label screening based on blood serum metabolic group profile.
(1) collection of human serum sample and pre-service.
Before collection, the person of including in signs Informed Consent Form, under identical sampling condition, gathers, and the serum sample of collection is stored in-80 DEG C of refrigerators immediately.In the present embodiment, gather altogether 106 parts, human plasma sample, wherein collected 24 routine healthy womens, 21 routine ovarian cancer patients, 36 routine ovarian cancer post operation patients with recurrents and 25 routine ovarian cancer post operations and do not recurred women.
Before metabolite analysis, plasma sample is taken out from ultra low temperature freezer and thaw under room temperature condition.After thawing, shake and mix for 30 seconds.
Get respectively 106 parts of plasma samples, add 4 times of volumes (720 μ L) acetonitrile at every part of 180 μ L serum, in acetonitrile, contain LEK and Lyso PC (12:0) as interior mark, fully shake 30 seconds, then centrifugal 10 minutes of 15000g under 4 ° of C, gets supernatant freeze-drying.Before analysis, be heavily dissolved in 150 μ L water: in acetonitrile (1/4, v/v).Now, interior mark LEK concentration is that 3 ng/ μ L and Lyso PC (12:0) concentration are 3 ng/ μ L.
(2) metabolin in liquid chromatography mass coupling technique serum analysis.
What stratographic analysis adopted is that Agilent 1200 series are differentiated liquid chromatography (Rapid Resolution Liquid Chromatography, RRLC), chromatographic column adopting 50mm × 2.1 mm 1.7 μ m Waters BEH C fast 18post.Column temperature remains on 50 ° of C, and flow is 0.35mL/min.Mobile phase A is that high purity water contains percent by volume 0.1% formic acid and 2% acetonitrile, and Mobile phase B is acetonitrile.Gradient is that 5%B is initial, rises to 35%B in the time of the 4th minute, in the time of the 22nd minute, is changed to 80%B, reaches 100%B the 24th minute time, keeps carrying out column equilibration 5 minutes after 5 minutes.Automatic sampler remains 4 ° of C, and sampling volume is 5 μ L.
That mass spectrophotometry adopts is Agilent 6510 quadrupole rods-flight time mass spectrum (Q-TOF MS, Agilent, USA).Mass spectrum carries out data acquisition under positive ion mode.Mass spectrum capillary voltage is made as 4000V, and Fragmentor voltage and skimmer voltage are made as respectively 230 V and 65 V.Dry gas flow is 11 L/min, and atomisation pressure is made as 45 psig, and temperature is 350 DEG C.Adopt the potpourri of purine and six phosphine piperazines (hexakis phosphazine) to be used for keeping precision and the stability of mass number measurement as correcting fluid.Under positive ion mode they to produce respectively mass-to-charge ratio be 121.0508 and 922.0097 ion.Data acquisition scope is mass-to-charge ratio 80-1000, with barycenter type collection.Acquisition rate is 500 milliseconds.
(3) generation of metabolism group outline data.
The original metabolism group outline data gathering is by Molecular Feathers Extraction(MFE, Agilent) software extraction compound information, and calculate accurate molecular weight.Subsequently, adopting Genespring(Agilent) software carries out chromatographic peak coupling.Mass spectrum mass-to-charge ratio window is made as 0.01, and retention time window is made as 0.2min.
Data after coupling are through area normalization to reduce systematic error, and after normalization, in each sample, the summation of all peak areas equals 10000.Then use 80% rule to reduce the impact of missing values on data set,, in the time that an ion is all greater than 1 in the sample of a certain class 80%, can be used.
(4) metabolism group outline data is carried out to standardization and processing, can have for the data preprocessing method of optimizing and select: centering, autoscaling, Pareto scaling, range scaling, or not doing any data pre-service, the present embodiment adopts the standardized method of Pareto scaling.Parameter to L-EDA algorithm arranges: maximum iterative search wheel number is made as 100, each takes turns the candidate attribute number of subsets value set { 400 of generation, 700,1000}, attribute number value set { 40,70, the 100} that candidate attribute subset comprises, each is taken turns and chooses the ratio of evaluating optimum candidate subset and be made as 0.2, and probability Distribution Model is made as 0.3 from the ratio of current best candidate subset set learning.
(5) build probability distribution vector initialization according to attribute number, take turns the candidate attribute number of subsets of generation and attribute number that candidate attribute subset comprises and current probability distribution vector and generate the candidate attribute subset set of first round search according to each that set in (4).
(6) adopt support vector machine to build disaggregated model, each candidate attribute subset is carried out to 7 times of cross validations, record the accuracy of cross validation, and candidate attribute subset is had to high to Low order sequence according to cross validation accuracy.In the time carrying out cross validation, sample is divided into 7 subsets.Each wherein 1 subset that extracts, uses the Sample Establishing SVM model of 6 subsets of residue, and verifies the accuracy of classification as checking collection by the subset extracting.This process constantly repeats, until each subset is at least used as forecast set 1 time, then calculates total cross validation accuracy rate.
(7) according to the ratio of the optimum candidate subset of setting in (4), select best candidate subset set.Add up the frequency occurring in the in the end candidate subset set of each attribute, and calculate the average attribute frequency of occurrences.
(8) according to the information of statistics in (7), adopt the update strategy proposing to upgrade to probability distribution vector, obtain new probability distribution vector.
(9) take turns the candidate attribute number of subsets of generation and attribute number that candidate attribute subset comprises and new probability distribution vector and generate the candidate attribute subset set of first round search according to each that set in (4).
(10) repeatedly carry out iterative search step (6) to (9), until reach the maximum iterative search wheel number of setting in (4).Finally obtain the sequence list after L-EDA sorts to attribute.
(11), in this experiment, choose in the sequencing table of L-EDA output front 20% attribute and analyze (78 attributes).Table 1 has provided under different parameters arranges, and L-EDA sequence utilizes support vector machine to carry out the accuracy of 7 times of cross validations at front 20% community set, and result shows that, under different parameters, the attribute that L-EDA screens can both be classified accurately to sample.Table 2 has provided under different parameters arranges, and the percentage of overlapping genes-related (POGR) that L-EDA sorts between front 20% community set is worth (document 7. Zhang M.; Zhang L.; Zou J.; Et al. bioinformatics 2009, 25: 1662-1668.), can find out, under different parameters, the attribute that L-EDA screens all has higher similarity degree (POGR value is larger, and the similarity degree of two community sets is higher).Consolidated statement 1, table 2 are known, and L-EDA takes turns the candidate attribute number of subsets of generation for each and two parameters of attribute number that candidate attribute subset comprises have stability.In view of L-EDA algorithm is for parameter insensitive, ensuing analytic process is selected and each is taken turns to the candidate attribute number of subsets of generation and attribute number that candidate attribute subset comprises is set as respectively before L-EDA selects rank 20% attribute set at 700,70 o'clock.
(12) what Fig. 2 provided is the sequence according to L-EDA, while selecting respectively successively front several attributes, and the accuracy change curve of 7 times of cross validations of support vector machine.Can find from the variation of accuracy curve, accuracy curve rises rapidly in the time that attribute number is little, (to be less than 10), shows that the most forward attribute of L-EDA rank has very strong separating capacity; Meanwhile, along with attribute number constantly increases, accuracy curve remain in very high level and fluctuation very little, show all, attributes that can embody sample set feature relevant to classification information of the forward attribute of L-EDA rank.
(13) be Estimation of Distribution Algorithm part the most consuming time to the evaluation of candidate attribute subset, the attribute that candidate attribute subset comprises is more, and the time needing when sorting algorithm builds model is just longer, therefore, L-EDA is limited the capacity of candidate attribute subset, improves the execution efficiency of algorithm.Fig. 3 has provided traditional Estimation of Distribution Algorithm and the contrast of L-EDA on time loss.As can be seen from Figure 3,, with respect to traditional EDA, L-EDA can save for approximately 50% to 65% time.Must notice, be that the attribute number that comprises in candidate attribute subset is made as at 70 o'clock and measures the working time of the L-EDA providing in Fig. 3, if this parameter is arranged to less value, can expect that L-EDA can also save the more time simultaneously.
(14) Fig. 4 and Fig. 5 use multivariate statistical analysis instrument SIMCA (soft independent modeling of class analogy) to screen forward and backward metabolism group outline data to L-EDA to carry out PLS-DA modeling, and the shot chart obtaining shows.In Fig. 4, PLS-DA does not have the sample of postoperative non-recurrence group and postoperative recurrence group and oophoroma group differentiation to be come, and illustrates that the difference of postoperative recurrence and non-recurrence is covered.Meanwhile, as a kind of supervised learning method, the model that PLS-DA builds may, to data over-fitting, cause model insincere.For the model of PLS-DA structure, carry out the displacement validity check of 200 times, obtain the R of model 2intercept and Q 2intercept is respectively 0.419 and-0.678.According to research (document 8. L Eriksson before; E.J.; N Kettaneh-Wold; Et al. umetrics 2001.), R 2intercept should be less than 0.4, Q 2intercept should be less than 0.05, when the parameter of displacement validity check gained shows that PLS-DA builds model, depends on current data and classification information unduly, has occurred over-fitting phenomenon.And L-EDA extracts in the PLS-DA model of variable structure, can find that the sample of postoperative non-recurrence group has the significantly trend (Fig. 5) near normal group, and be different from postoperative recurrence group and cancer group.Being confirmed by displacement response assay, there is not over-fitting phenomenon in this model.This attribute that shows that L-EDA finds out has well embodied the postoperative feature whether recurring, and these attributes can be used as potential prognostic markers thing and analyze.
(15), according to the demand of clinical practice, potential label need to have significant difference between different classes of.To L-EDA screening to 78 attributes carry out Wilcoxon rank test (p < 0.05), obtain 6 attributes (5 kinds of metabolins) and met the requirement (p < 0.05) that has significant difference between different classes of, in table 3, provided the specifying information of these 5 kinds of metabolins.
(16) utilize 5 kinds of metabolins that obtain in (15), respectively to postoperative recurrence group and postoperative non-recurrence group and " anosis group " (normal group and non-recurrence group) and " group in spite of illness " (recurrence group and oophoroma group) structure support vector machine (SVM) disaggregated model, carry out 7 times of cross validations, accuracy is respectively 86.9% and 88.7%.Respectively two models are carried out the permutation test of 200 times, obtain model parameter R 2intercept, Q 2intercept is respectively-0.601 ,-1.079 and-0.729 ,-1.172.Can reach a conclusion thus, use the label that screens of this method to there is excellent separating capacity and model reliable, possess application prospect.
Table 1 different parameters arranges down, the cross validation accuracy of the attribute set of L-EDA sequence front 20%
Accuracy a G b =40 G=70 G=100
T c =400 0.953 0.991 0.991
T=700 0.972 0.991 0.981
T=1000 0.981 0.981 0.991
A: adopt the 7 times cross validation accuracy of support vector machine as sorter.
B:G represents the attribute number that candidate attribute subset comprises.
C:T represents that each takes turns the candidate attribute number of subsets of generation.
Table 2 different parameters arranges down, the POGR value between the attribute set of L-EDA sequence front 20%
A: the POGR value between different parameters combination.
B:T represents that each takes turns the candidate attribute number of subsets of generation, and G represents the attribute number that candidate attribute subset comprises.
The ovarian cancer prognosis label that table 3 is potential
A: verify with master sample.

Claims (1)

1. the method for utilizing L-EDA screening ovarian cancer body fluid prognostic marker, is characterized in that:
1) collection of body fluid sample and pre-service: be taken at≤-60 DEG C at preserve, healthy women, ovarian cancer patients, ovarian cancer post operation do not recur the body fluid sample of women, ovarian cancer post operation patients with recurrent; Sample is taken out to room temperature from refrigerator and thaw, add 3-5 times of volumes of acetonitrile, fully shake 10-40 second, then, at 4-8 DEG C, get the centrifugal 5-20 minute of 10000-20000g, get supernatant freeze-drying; Heavy water-soluble before analyzing: in the mixed solution of acetonitrile=1/4v/v;
2) metabolin in liquid chromatography mass combined instrument serum analysis: wherein chromatographic column adopting 50mm × 2.1mm, the C of 1.7 μ m WatersBEH 18post; Column temperature remains on 35-60 DEG C, and flow is 0.3-0.4mL/min; Mobile phase A is the high purity water that contains percent by volume 0.1-1% formic acid and 0.1-5% acetonitrile, and Mobile phase B is acetonitrile; Type of elution is gradient, is specially 5%B initial, rises to 35%B in the time of the 4th minute, in the time of the 22nd minute, be changed to 80%B, the 24th minute time, reach 100%B, keep carrying out column equilibration 5 minutes after 5 minutes, automatic sampler remains 4-8 DEG C, and sampling volume is 1-10 μ L; What mass spectrophotometry adopted is Agilent 6510 quadrupole rods-flight time mass spectrum, and model is Q-TOFMS, Agilent, USA; Mass spectrum carries out data acquisition under positive ion mode; Data acquisition scope is mass-to-charge ratio 80-1000;
3) preliminary extraction and the screening of original metabolism group outline data: the original metabolism group outline data of collection extracts compound information by the MolecularFeathersExtraction software of Agilent, calculates accurate molecular weight according to mass spectrometric data; Subsequently, adopt the Genespring software of Agilent to carry out chromatographic peak coupling; Data process area normalization after coupling is to reduce systematic error; Then use 80% rule to reduce the impact of missing values on data set,, in the time that an ion is all greater than 1 in the sample of arbitrary class 80%, can be used;
4) utilize L-EDA algorithm screening ovarian cancer prognostic marker: the metabolism group profile obtaining is moved to L-EDA algorithm to metabolism group data analysis, a kind of metabolin in the corresponding metabolism group of the attribute profile in algorithm;
The concrete grammar and the L-EDA screening step that build model are:
1. extract candidate attribute subset; Extract more than 2 or 3 community set by iteration; Each community set is a candidate attribute subset; The attribute number unification that L-EDA comprises candidate attribute subset is defined as G, and G is positive integer, is generally the 5-20% of attribute sum;
2. upgrade probability Distribution Model; Parameter to L-EDA algorithm arranges; Each is taken turns to choose and evaluates optimum candidate and be made as 0.1-0.3 in the ratio of collection, and probability Distribution Model is made as 0.2-0.4 from the ratio of current best candidate subset set learning; This average of average frequency that attribute is occurred in the set of best candidate attribute set embodies the average performance of all properties;
3. build probability Distribution Model, repeatedly carry out following steps, until search procedure has been carried out predefined number of times; Extract candidate attribute subset set according to step method 1., evaluate each candidate attribute subset, upgrade probability Distribution Model according to method 2., enter next round search procedure; After algorithm end of run, output algorithm is for the sorted lists of all properties;
4. screen potential ovarian cancer prognosis label set; The sorted lists of all properties of 3. finally exporting according to step, analyzes the attribute of the forward 10-30% of rank; If attribute has significant difference between each group of metabolism group data demand in Wilcoxon rank test, find out the metabolin that this attribute is corresponding; Finally, using this metabolin as potential ovarian cancer prognosis label.
CN201010558383.6A 2010-11-25 2010-11-25 Method for screening ovarian cancer body fluid prognostic marker by L-EDA Expired - Fee Related CN102478562B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010558383.6A CN102478562B (en) 2010-11-25 2010-11-25 Method for screening ovarian cancer body fluid prognostic marker by L-EDA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010558383.6A CN102478562B (en) 2010-11-25 2010-11-25 Method for screening ovarian cancer body fluid prognostic marker by L-EDA

Publications (2)

Publication Number Publication Date
CN102478562A CN102478562A (en) 2012-05-30
CN102478562B true CN102478562B (en) 2014-07-23

Family

ID=46091274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010558383.6A Expired - Fee Related CN102478562B (en) 2010-11-25 2010-11-25 Method for screening ovarian cancer body fluid prognostic marker by L-EDA

Country Status (1)

Country Link
CN (1) CN102478562B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111595980B (en) * 2020-06-21 2022-06-10 山东省海洋生物研究院 Identification method of vibrio strains

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101769910A (en) * 2008-12-30 2010-07-07 中国科学院大连化学物理研究所 Method for screening malignant ovarian tumor markers from blood serum metabolic profiling
CN101832977A (en) * 2009-03-09 2010-09-15 复旦大学附属妇产科医院 Ovarian tumor serum marker

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080086272A1 (en) * 2004-09-09 2008-04-10 Universite De Liege Quai Van Beneden, 25 Identification and use of biomarkers for the diagnosis and the prognosis of inflammatory diseases
US20120004854A1 (en) * 2008-05-28 2012-01-05 Georgia Tech Research Corporation Metabolic biomarkers for ovarian cancer and methods of use thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101769910A (en) * 2008-12-30 2010-07-07 中国科学院大连化学物理研究所 Method for screening malignant ovarian tumor markers from blood serum metabolic profiling
CN101832977A (en) * 2009-03-09 2010-09-15 复旦大学附属妇产科医院 Ovarian tumor serum marker

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Application of L-EDA in metabonomics data handling:global metabolite profiling and potential biomarker discovery of epithelial ovarian cancer prognosis》;Jing Chen 等;《Metabolomics》;20110217;全文 *
Jing Chen 等.《Application of L-EDA in metabonomics data handling:global metabolite profiling and potential biomarker discovery of epithelial ovarian cancer prognosis》.《Metabolomics》.2011,

Also Published As

Publication number Publication date
CN102478562A (en) 2012-05-30

Similar Documents

Publication Publication Date Title
Odunsi et al. Detection of epithelial ovarian cancer using 1H‐NMR‐based metabonomics
Asiago et al. Early detection of recurrent breast cancer using metabolite profiling
Blekherman et al. Bioinformatics tools for cancer metabolomics
CN101769910A (en) Method for screening malignant ovarian tumor markers from blood serum metabolic profiling
EP1721156A2 (en) Systems and methods for disease diagnosis
Ahmed et al. Enhanced feature selection for biomarker discovery in LC-MS data using GP
CN115575636B (en) Biomarker for lung cancer detection and system thereof
CN101832977A (en) Ovarian tumor serum marker
CN102103132B (en) Method for screening diabetes markers from body fluid metabonome profile
CN109791124A (en) Analytical data of mass spectrum resolver and analytic method
Zheng et al. Prediction and evaluation of the effect of pre-centrifugation sample management on the measurable untargeted LC-MS plasma metabolome
CN109187614A (en) Based on nuclear magnetic resonance and mass spectrographic metabolism group data fusion method and its application
Bowling et al. Analyzing the metabolome
CN102478562B (en) Method for screening ovarian cancer body fluid prognostic marker by L-EDA
CN116106534B (en) Application of biomarker combination in preparation of lung cancer prediction product
CN101901300A (en) Method for screening hepatic disease marker from body fluid metabolic profile using chain multi-population genetic algorithm
Righi et al. A metabolomic data fusion approach to support gliomas grading
CN112255335A (en) Plasma metabolic marker for distinguishing benign ovarian tumor from malignant ovarian tumor and application thereof
Shahbazy et al. Oblique rotation of factors: a novel pattern recognition strategy to classify fluorescence excitation–emission matrices of human blood plasma for early diagnosis of colorectal cancer
CN108445443A (en) A kind of fingerprint point clustering method based on KNN
CN103175935A (en) Manufacturing and diagnosis application of blood micro-molecular metabolin specific chromatogram
Hong et al. Discrimination analysis of mass spectrometry proteomics for ovarian cancer detection
CN101363837A (en) Method for estimating curative effect of pancreatic cancer chemotherapy medicine
CN112255333B (en) Ovarian tumor urine metabolic marker and application thereof
Odunsi Cancer diagnostics using 1 H-NMR-based metabonomics

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140723

Termination date: 20211125

CF01 Termination of patent right due to non-payment of annual fee