WO2014019180A1 - 确定异常状态生物标记物的方法及*** - Google Patents

确定异常状态生物标记物的方法及*** Download PDF

Info

Publication number
WO2014019180A1
WO2014019180A1 PCT/CN2012/079524 CN2012079524W WO2014019180A1 WO 2014019180 A1 WO2014019180 A1 WO 2014019180A1 CN 2012079524 W CN2012079524 W CN 2012079524W WO 2014019180 A1 WO2014019180 A1 WO 2014019180A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequencing
nucleic acid
gene
relative abundance
abnormal state
Prior art date
Application number
PCT/CN2012/079524
Other languages
English (en)
French (fr)
Inventor
李胜辉
覃俊杰
朱剑锋
张东亚
揭著业
王俊
汪建
杨焕明
Original Assignee
深圳华大基因研究院
深圳华大基因科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因研究院, 深圳华大基因科技有限公司 filed Critical 深圳华大基因研究院
Priority to PCT/CN2012/079524 priority Critical patent/WO2014019180A1/zh
Priority to CN201280075072.1A priority patent/CN104603283B/zh
Priority to US13/640,448 priority patent/US20150376697A1/en
Priority to PCT/CN2012/080479 priority patent/WO2014019267A1/en
Publication of WO2014019180A1 publication Critical patent/WO2014019180A1/zh
Priority to HK15108222.6A priority patent/HK1207670A1/zh

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/112Disease subtyping, staging or classification

Definitions

  • the invention relates to the field of biotechnology.
  • the present invention relates to methods and systems for determining abnormal state biomarkers. Background technique
  • Metagenomics also known as environmental genomics, metagenomics, ecogenomics, or community genomics, is a direct study of microbial communities in natural conditions (including culturable and non-cultivable bacteria, fungi) The subject of the sum of the genomes such as viruses.
  • yearsman of the Department of Plant Pathology at the University of Wisconsin first proposed the concept of "meteogenomics" when studying soil microbes.
  • Traditional microbial research is limited by the technology of microbial separation and pure culture, while metagenomics research is based on microbial community in specific environment, with microbial diversity, population structure, evolutionary relationship, functional activity, mutual cooperation and The relationship between the environment is a new microbiological research method for research purposes.
  • the basic research strategies for metagenomics research include: extraction and purification of large fragments of environmental genomic DNA, library construction, target gene screening, and/or large-scale sequencing analysis.
  • the metagenomic library contains both culturable and non-culturable microbial genes and genomes, which clone the total DNA in a natural environment into culturable host cells, thus avoiding the problem of microbial isolation culture.
  • large-scale sequence analysis based on gene sequence analysis, combined with bioinformatics tools, it is possible to discover a large number of unknown micro-genes or new gene clusters that were not available in the past, which is to understand the composition of microflora.
  • the evolutionary process and metabolic characteristics, and the mining of new genes with potential applications are of great significance.
  • the present invention aims to solve at least one of the technical problems existing in the prior art. To this end, the present invention proposes a method and system for efficiently determining abnormal state biomarkers.
  • the invention proposes a method of determining an abnormal state biomarker.
  • the method comprises the steps of: nucleic acid sequencing a nucleic acid sample from a first subject and a nucleic acid sample from a second subject to obtain a first sequencing result and a second consisting of a plurality of sequencing sequences, respectively Sequencing results, Wherein the first object has the abnormal state, the second object does not have the abnormal state, and the nucleic acid sample from the first object and the nucleic acid sample from the second object are from the same type of sample Separating, the first object and the second object belong to the same species; and determining a marker associated with the abnormal state based on a difference between the first sequencing result and the second sequencing result.
  • the method of the embodiment of the present invention by sequencing and aligning nucleic acid samples of two subjects, it is possible to efficiently determine a marker associated with an abnormal state.
  • the above method of determining an abnormal state biomarker may further have the following additional technical features:
  • the abnormal state is a disease.
  • the disease is at least one selected from the group consisting of a neoplastic disease, an immunological disease, a hereditary disease, and a metabolic disease.
  • the abnormal state is diabetes.
  • the first object and the second object are human.
  • the nucleic acid sample from the first object and the nucleic acid sample from the second object are separated from the excrement of the first object and the second object, respectively.
  • At least one of the nucleic acid sample from the first object and the nucleic acid sample from the second object is nucleic acid sequenced using a second generation sequencing technique or a third generation sequencing technique.
  • the nucleic acid sequencing is performed using at least one selected from the group consisting of Hiseq2000, SOLiD, 454, and single molecule sequencing devices.
  • determining the biomarker of the abnormal state further comprises: constituting the first sequencing result and the second based on a difference between the first sequencing result and the second sequencing result Aligning the sequenced sequence with the reference gene set; based on the alignment result, determining the relative abundance of each gene in the nucleic acid sample from the first object and the second object, respectively; Statistically testing the relative abundance of each gene in the nucleic acid sample of the two subjects; and determining that the gene having a significant difference in relative abundance between the nucleic acid samples from the first object and the second object is the genetic marker of the abnormal state Things.
  • the method before the sequencing sequence constituting the first sequencing result and the second sequencing result is compared with a reference gene set, the method further comprises filtering the sequencing result to remove the pollution. And wherein the contamination is at least one selected from the group consisting of: contaminant contamination, low quality sequences, and host genome contamination sequences.
  • the sequencing sequence constituting the first sequencing result and the second sequencing result is aligned with a reference gene set using at least one selected from the group consisting of SOAP2 and MAQ, optionally
  • the reference gene set is a non-redundant gene set of a human intestinal microbial community.
  • the method further includes: constituting the first sequencing result and the second sequencing knot The sequencing sequence of the fruit, assembly and gene prediction, to obtain the gene, and the gene that cannot be aligned with the reference gene set is a new gene; and the determined new gene is added to the reference gene set.
  • the species classification is performed by aligning each gene of the reference gene set with an IMG database.
  • the reference gene using BLASTP centralized database of each gene with IMG for comparison, wherein the result is less than 10_ 1G according to the value of E-Value, the species classified to determine the level of the gene.
  • the functional annotation is performed by aligning each of the reference gene sets with at least one of eggNOG and KEGG.
  • the reference gene using BLASTP centralized database of each gene with IMG for comparison wherein the result is less than 10- 1G according to the E-Value value, determine the function of the gene.
  • the relative abundance is a species relative abundance and a relative abundance of functions, the reference gene set comprising genetic species information and functional annotations, wherein, based on the first sequencing result and The difference in the second sequencing result, the biomarker determining the abnormal state further comprises: comparing the sequencing sequence constituting the first sequencing result and the second sequencing result with a reference gene set; As a result, the relative abundance and functional relative abundance of each gene in the nucleic acid sample from the first object and the second object are determined separately; the species of each gene in the nucleic acid sample from the first object and the second object Performing a statistical test on relative abundance and relative abundance of function; and determining species and function that are significantly different in relative abundance between nucleic acid samples from the first and second subjects, respectively And functional markers.
  • the statistical test is at least one selected from the group consisting of Student T test, Wilcox(R) and test.
  • the method further comprises filtering to remove the sample which is significantly affected by the apparent factors, preferably by intestinal type analysis and at least one test selected from the Fisher's exact test and Mental-Haenszel.
  • the method further comprises cluster analysis and deep assembly of the obtained genetic markers to construct a related biological genome of the abnormal state.
  • the step of verifying the biomarker is further included.
  • the invention also provides a system for determining an abnormal state biomarker.
  • the system comprises: a sequencing device adapted to perform nucleic acid sequencing on a nucleic acid sample from a first object and a nucleic acid sample from a second object for nucleic acid sequencing, to obtain respectively a first sequencing result of the sequencing sequence and a second sequencing result, wherein the first object has the abnormal state, the second object does not have the abnormal state, the nucleic acid sample from the first object, and the The nucleic acid sample from the second object is separated from the same type of sample, the first object and the second object belonging to the same species; an analysis device, the analysis The device is coupled to the sequencing device, receives the first sequencing result and the second sequencing result from the sequencing device, and is adapted to determine and the based on the difference between the first sequencing result and the second sequencing result An abnormal state related marker.
  • the system for determining an abnormal state biomarker may also have the following additional technical features:
  • nucleic acid sample separation device coupled to the sequencing device and adapted to separate a nucleic acid sample from a subject, optionally adapted to excrement from the subject The nucleic acid sample is isolated.
  • the sequencing device is a second generation sequencing platform or a third generation sequencing platform.
  • the sequencing device is at least one selected from the group consisting of Hiseq2000, SOLiD, 454, and single molecule sequencing devices.
  • the analyzing device further comprises:
  • An aligning unit wherein the aligning unit is adapted to align the sequencing sequence constituting the first sequencing result and the second sequencing result with a reference gene set;
  • a relative abundance determining unit the relative abundance calculating unit being connected to the comparing unit, and adapted to determine a relative abundance of each gene in the nucleic acid sample from the first object and the second object, respectively, based on the comparison result Degree;
  • An assay unit coupled to the relative abundance determining unit and adapted to statistically test the relative abundance of each gene in the nucleic acid samples from the first object and the second subject;
  • the marker determining unit being adapted to determine, based on a statistical test result, a gene having a significant difference in relative abundance between nucleic acid samples from the first subject and the second subject as the abnormal state Mark.
  • the analyzing device further comprises:
  • the filtering unit being coupled to the aligning unit, and adapted to sequence the sequencing sequence constituting the first sequencing result and the second sequencing result with a reference gene set
  • the result is filtered to remove contamination, wherein the contamination is at least one selected from the group consisting of: contaminant contamination, low quality sequences, and host genome contamination sequences.
  • the aligning unit compares the sequencing sequence constituting the first sequencing result and the second sequencing result with a reference gene set by using at least one selected from the group consisting of SOAP2 and MAQ
  • the reference gene set is a non-redundant gene set of a human intestinal microflora.
  • the relative abundance is a relative abundance and a relative abundance of the species of each gene
  • the reference gene set includes genetic species information and functional annotations
  • the relative abundance determining unit is adapted to determine a relative abundance and a relative abundance of the genes of the genes in the nucleic acid samples from the first object and the second object, respectively, based on the comparison result;
  • the test unit is adapted to perform statistical tests on relative abundance and relative abundance of species of each gene in the nucleic acid sample from the first object and the second object;
  • the marker determining unit is adapted to determine a species marker and a functional marker of the abnormal state based on species and functions having significant differences in relative abundance between nucleic acid samples from the first subject and the second subject .
  • the verification unit is adapted to perform at least one statistical test selected from the group consisting of Student T test, Wilcox(R) and test.
  • a genome assembly device further comprising a genome assembly device, the genome assembly device being adapted to perform cluster analysis and deep assembly of the obtained gene markers to construct a related biological genome of the abnormal state.
  • the method for determining an abnormal state-related biomarker can be based on a high-throughput sequence technique for metomephores and Correlation analysis of diseases, search for biomarkers related to diseases, greatly improved flux, and greatly reduced costs. It can study large groups and make full use of various data information of known reference gene sets to make the results repeatable. Good, credibility increases, using multiple correlation statistical test methods, greatly reducing the false positive error caused by the fluctuation of relative abundance estimation, while ensuring the efficacy of the test, can directly determine the between the marker and the target trait Linkage, correlation analysis is highly reliable and accurate.
  • FIG. 1 is a flow chart showing a method of determining an abnormal state biomarker in an embodiment of the present invention
  • FIG. 2 shows a schematic flow diagram of a method of determining an abnormal state biomarker in accordance with another embodiment of the present invention
  • FIG 3 shows a schematic diagram of a system for determining an abnormal state biomarker according to an embodiment of the present invention
  • FIGS. 4-6 illustrate a method for determining an abnormal state biomarker according to Embodiments 3, 4 and 5 of the present invention.
  • Schematic diagram of the process
  • Figure 7 shows the detection error rate distribution for relative abundance characteristics with different sequencing amounts, in accordance with an embodiment of the present invention.
  • the X-axis represents the amount of sequencing of the sample, which is defined as the number of paired-end sequencing data
  • the Y-axis represents the relative abundance of the gene.
  • the 99% confidence interval (CI) of the relative abundance is estimated, and the detection error rate is defined as the ratio of the confidence interval width to the relative abundance itself.
  • first and “second” are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, features defining “first”, “second” may explicitly or implicitly include one or more of the features. Further, in the description of the present invention, “multiple,” means two or more unless otherwise stated.
  • a first aspect of the invention provides a method of determining an abnormal state biomarker.
  • the method for determining an abnormal state biomarker includes the following steps:
  • nucleic acid samples from a first subject and nucleic acid samples from a second subject are subjected to nucleic acid sequencing to obtain first sequencing results and second sequencing results composed of a plurality of sequencing sequences, respectively.
  • the first object and the second object have different states, specifically, the first object has the abnormal state, the second object does not have the abnormal state, and the nucleic acid from the first object
  • the sample and the nucleic acid sample from the second object are separated from the same type of sample, the first object and the second object belonging to the same species.
  • the marker related to the abnormal state may be determined based on a difference between the first sequencing result and the second sequencing result .
  • the nucleic acid sample is extracted from the same type of sample based on the first object and the second object, and thus, the difference between the first sequencing result and the second sequencing result can reflect the abnormal state biomarker.
  • abnormal state as used herein shall be understood broadly and may refer to any state in which an object (organism) differs from a normal state, either as a physiological anomaly or as a psychological anomaly.
  • the abnormal state is a disease.
  • the type of disease that can be studied using the method of the present invention is not particularly limited.
  • the disease is selected from the group consisting of a neoplastic disease, At least one of an epidemic disease, a hereditary disease, and a metabolic disease.
  • the abnormal state is diabetes.
  • the scope of the term "object” as used herein is not particularly limited and may be any organism.
  • the first object and the second object are human.
  • the first object may be a patient suffering from a specific disease
  • the second object may be a healthy person.
  • the number of the first object and the second object is not particularly limited and may be plural, whereby the reliability of the determined biomarker can be further improved.
  • the source of the nucleic acid sample is not particularly limited. As long as the sources of the nucleic acid samples of the first and second objects are the same. According to one embodiment of the invention, the nucleic acid sample and the nucleic acid sample of the second object are separated from the excrement of the first object and the second object, respectively. Thereby, the nucleic acid information of the intestinal microorganism can be effectively determined, so that the relationship between the specific disease of the subject and the intestinal flora can be effectively determined.
  • the means for sequencing the nucleic acid sample is not particularly limited.
  • nucleic acid sequencing of at least one of a nucleic acid sample from the first subject and a nucleic acid sample from the second subject is performed using a second generation sequencing technique or a third generation sequencing technique.
  • the nucleic acid sequencing is performed using at least one selected from the group consisting of Hiseq2000, SOLiD, 454, and single molecule sequencing devices. Thereby, the characteristics of high-throughput, deep sequencing of these sequencing devices can be utilized. Thereby, the subsequent analysis of the sequencing data is improved, especially the accuracy and accuracy of the statistical test analysis.
  • the first sequencing result and the second sequencing result can be analyzed by any method.
  • biomarkers can be determined by the following methods with reference to Figure 2:
  • the sequencing sequences constituting the first sequencing result and the second sequencing result are compared with the reference gene set.
  • the type of the reference gene set is not particularly limited and may be a database of any known sequence, and for example, a known human intestinal microflora non-redundant gene set may be employed.
  • the step of filtering the sequencing result to remove the contamination is further included.
  • the contamination may be at least one selected from the group consisting of: contaminant contamination, low quality sequences and host genome contamination sequences.
  • the sequencing sequence can be aligned to a reference gene set using any known tool.
  • the sequencing sequence constituting the first sequencing result and the second sequencing result may be aligned with a reference gene set using at least one selected from the group consisting of SOAP2 and MAQ. Thereby, the efficiency of the alignment can be further improved, thereby improving the efficiency of determining the biomarker.
  • the relative abundance of each gene in the nucleic acid samples from the first object and the second object is determined, respectively.
  • the sequence of the sequence and the gene of the reference gene can be constructed.
  • the relative abundance of genes in the nucleic acid sample can be determined by comparison of the results, according to conventional statistical analysis.
  • a statistical test is performed on the accuracy of the relative abundance, preferably using a Poisson distribution. Specifically, the method of Audic and Claverie (1997) (Audic, S.
  • represents the relative abundance calculated from the sequencing data.
  • the inventors set the value to 0 ⁇ le-5, set it to 0-40 million, in order to calculate the 99% confidence interval, and further evaluate the detection error rate. The result is shown in Fig. 7.
  • biomarker as used herein is to be understood broadly to include any detectable biological indicator capable of reflecting an abnormal state, and may be a genetic marker, a species marker, and a functional marker.
  • the sequencing data may be assembled and genetically predicted to obtain a gene, and the gene that cannot be aligned with the reference gene set is a new gene; and the determined new gene is added to the reference gene. Concentration, thereby increasing the capacity of the reference gene set, thereby improving the efficiency of determining the biomarker.
  • species classification can be performed by aligning each gene in the reference gene set with an IMG database.
  • the reference gene using BLASTP centralized database of each gene with IMG for comparison, wherein the result is less than 10- 1G according to the value of E-Value, the species classified to determine the level of the gene. Thereby, the species classification of the gene can be efficiently determined.
  • each gene in the set of reference genes is performed by aligning the gene with at least one of eggNOG and KEGG.
  • each of the reference gene sets is aligned with the IMG database using BLASTP, wherein the function of the gene is determined based on the result of the E-Value value being less than. Thereby, the functional classification of the gene can be efficiently determined.
  • the species abundance and functional abundance of the species can be further determined by classifying the species information and functional annotations of the gene, so that the species markers and functional markers of the abnormal state can be further determined.
  • the relative abundance is the relative abundance and functional abundance of the species of the gene, the reference gene set comprising the genetic species information and functional annotations, wherein, based on the first sequencing result and Determining the difference in the second sequencing result, determining the biomarker of the abnormal state further comprises: comparing the sequencing sequence constituting the first sequencing result and the second sequencing result with a reference gene set; based on the comparison result Determining the relative abundance and functional relative abundance of each gene in the nucleic acid sample from the first object and the second object, respectively; relative to the species of each gene in the nucleic acid sample from the first object and the second object Performing a statistical test on abundance and functional relative abundance; and determining species and function that are significantly different in relative abundance between nucleic acid samples from the first subject and the second subject, respectively Functional marker.
  • the method for determining the relative abundance of function and the relative abundance of the species is not particularly limited, and according to an embodiment of the present invention, the relative abundance of the gene with respect to the gene of the same species and the relative abundance of the gene having the same function annotation may be employed. Statistical tests, such as summation, averaging, median values, etc., to determine functional relative abundance and relative abundance of species. According to one embodiment of the invention, the relative abundance of each gene can be calculated according to the following formula:
  • A is the relative abundance of gene i in the sample
  • ⁇ ' ⁇ The number of times the gene i is detected in the sample.
  • a method of statistically testing the relative abundance of a gene, the relative abundance of a species, and the relative abundance of a function is not particularly limited.
  • the statistical test may be at least one selected from the group consisting of Student T test, Wilcox(R) and test.
  • filtration is preferably performed by intestinal type analysis and at least one test selected from Fisher's exact test and Mental-Haenszel.
  • Normal human intestinal microflora can be divided into three distinct types (enterotypes, Chinese tube called intestinal type), and the classification of intestinal type is not affected by apparent factors such as age and gender. Further research indicates that the intestinal type is not affected. The effects of chronic metabolic diseases such as obesity.
  • the intestinal stratification factor may be that the associated biomarkers are not easily recognized, it is necessary to remove the intestinal type by dividing the intestinal type of the sample and performing a population stratification test. Impact.
  • Genus Based on the horizontal relative abundance data, the relative distance (JSD distance) of the sample is calculated and clustered by the PAM algorithm. At the same time, the results are verified by clustering the functional relative abundance spectrum data by the same method.
  • JSD distance Joint distance
  • Fisher's exact test or the Mental-Haenszel test can be used to determine whether the sample is significantly enriched in a certain intestinal type. If the sample is enriched in a certain gut type, the remaining sample may no longer be enriched by removing the sample or using the PCA method to correct it. On the other hand, in the usual correlation analysis research, due to the imperfect design of the experiment, the sample may also be affected by age, gender, etc., and the influence can be significant by the Mental-Haenszel test, and the removal of the result is significantly affected by these factors. Sample.
  • cluster analysis and deep assembly of the obtained genetic marker may be further included to construct a related biological genome of the abnormal state.
  • many genes may be derived from related species of lower order of magnitude. Many samples, such as most of the human intestinal microflora, have not been successfully isolated and sequenced. Only by clustering these genes, the corresponding disease-associated microbial genomes can be reconstructed on a cluster basis to get more Information on these microbial species.
  • Gene clustering can be performed using known clustering software. After the clustering results are obtained, sequencing data can be searched from the original sequencing pool using sequencing methods (for example, SOAP2 can be used), and the deep data of the sequencing data can be obtained through the sequencing data.
  • the software for assembly is SOAPdenovo) to obtain the genomic sequence of the ⁇ biological species.
  • the genome of the microbial species can be reconstructed as much as possible by further iterative alignment and deep assembly, and the assembly results are greatly improved. After multiple iterations, the assembly results that are no longer improved can be used as a genome sketch of the resulting microbial species.
  • the step of verifying the biomarker is further included.
  • the effectiveness and reliability of the association between the biomarker and an abnormal state such as a disease such as diabetes can be further improved.
  • the invention also provides a system for determining an abnormal state biomarker.
  • the system 1000 includes: a sequencing device 100 and an analysis device 200, in accordance with an embodiment of the present invention.
  • the sequencing device 100 is adapted to perform nucleic acid sequencing of a nucleic acid sample from a first subject and a nucleic acid sample from a second subject for nucleic acid sequencing to obtain a first sequencing result consisting of a plurality of sequencing sequences, respectively.
  • the analyzing device 200 is connected to the sequencing device 100, so that the analyzing device 200 can receive the first sequencing result and the second sequencing result from the sequencing device 100, and is adapted to be based on the first sequencing result and the second sequencing result. Difference, identify the marker associated with the abnormal state.
  • a method of determining an abnormal state biomarker whereby an abnormal state marker can be efficiently determined.
  • the system 1000 for determining an abnormal state biomarker may further include a nucleic acid sample separation device 300 coupled to the sequencing device 100 for isolating a nucleic acid sample from a subject,
  • the nucleic acid sample is separated from the excrement of the subject such that the sequencing device can be provided with a nucleic acid sample for sequencing.
  • the method and apparatus that can be used for sequencing according to embodiments of the present invention are not particularly limited.
  • a second generation sequencing platform or a third generation sequencing platform can be employed.
  • the sequencing device is at least one selected from the group consisting of Hiseq2000, SOLiD, 454, and single molecule sequencing devices.
  • the analyzing device 200 further includes: a matching unit 201, a relative abundance determining unit 202, a checking unit 203, and a marker determining unit 204.
  • the comparing unit 201 is adapted to compare the sequencing sequence constituting the first sequencing result and the second sequencing result with a reference gene set, the relative abundance calculating unit 202 and the comparing unit 201 Connected, and adapted to determine the relative abundance of each gene in the nucleic acid sample from the first object and the second object, respectively, based on the alignment result, the test unit 203 being coupled to the relative abundance 202 determining unit and adapted to The relative abundance of each gene in the nucleic acid sample of the first object and the second object is statistically tested, and the marker determining unit 204 is adapted to determine nucleic acid samples from the first object and the second object based on the statistical test result
  • a gene having a significant difference in relative abundance is a genetic marker of the abnormal state.
  • the analyzing device 200 may further include: a filtering unit (205) connected to the comparing unit 201 and adapted to constitute the first sequencing result and the second sequencing result
  • the sequencing result is filtered to remove contamination prior to the sequencing sequence being aligned with the reference gene set, wherein the contamination is at least one selected from the group consisting of: contaminant contamination, low quality sequences, and host genome contamination sequences.
  • the comparing unit 201 compares the sequencing sequence constituting the first sequencing result and the second sequencing result with a reference gene set using at least one selected from the group consisting of SOAP2 and MAQ.
  • a reference gene set can be stored in the alignment unit, for example, a non-redundant gene set of the human intestinal microflora. Thereby, the comparison efficiency can be improved.
  • the relative abundance is a relative abundance and a relative abundance of a species of the gene, the reference gene set comprising genetic species information and a functional annotation, wherein the relative abundance determining unit, Suitable for Comparing the results, respectively determining the relative abundance and functional relative abundance of each gene in the nucleic acid sample from the first object and the second object; the testing unit being adapted to be from the first object and the second Statistically testing the species relative abundance and functional relative abundance of each gene in the nucleic acid sample of the subject; and the marker determining unit is adapted to be based on a relative abundance between nucleic acid samples from the first object and the second object There are significant differences in species and function, species markers and functional markers that determine the abnormal state. Thereby, the species marker and the functional marker of the abnormal state can be efficiently determined.
  • the technical means used in the examples are conventional means well known to those skilled in the art, and can be referred to the third edition of the Guide to Molecular Cloning, or related products, and the reagents and products used are also available. Commercially obtained.
  • the various processes and methods not described in detail are conventional methods in the field of public service.
  • the source of the reagents used, the trade name, and the components necessary to list them are indicated on the first occurrence, and the same reagents used thereafter are not special. The descriptions are the same as the first ones.
  • test set including 32 DO samples, 39 DL samples, 37 NO samples and 37 NL samples; the remaining 199 samples were used as validation sets, including 73 DO samples and 26 DL samples. , 62 NO samples and 38 NL samples, see Table 1.
  • Treatment of stool samples Place the prepared stool samples into the sterilized fecal collection tube, transport them to the storage point with dry water or liquid nitrogen, and store at -80 C low temperature in a water tank.
  • a 350 bp sequencing library and sequencing were performed on the extracted DNA samples according to the operating instructions provided by Illumina, a manufacturer of the Illumina Genome Analyzer (sequencing platform).
  • the first 145 samples of the library were sequenced using the Illumina Genome Analyzer/Sequencing Platform, which produced 4,636,045,336 reads, or 383.08 Gb of raw data.
  • the main body of the metagenomic biomarker is related to the species and function of the gene, so it is necessary to first assemble and predict the sequence of the sequence, and to redunde the non-redundant reference gene set (Junjie Qin, uiqiang Li, Jeroen aes, et Al. (2010) A human gut microbial gene catalogue established by metagenomic sequencing. Nature, 464:59-65), ie a non-redundant reference gene set.
  • the non-redundant gene set of human intestinal microflora known in the above reference is used as the reference gene set, that is, the non-redundant gene of 3.3M European intestinal microflora has been constructed. set. Samples from non-European populations require the construction of new gene sets in the samples and their addition to the original 3. 3M European gut gene set.
  • the updated gene set contains 4,267,985 predicted genes, of which 1,090,889 genes are newly supplemented gene sets.
  • Ai is the relative abundance of gene i in the sample
  • Xi the number of times the gene i is detected in the sample
  • ⁇ / represents the copy number of gene i in the sequencing data from the sample.
  • the relative abundance spectra of species and functions are obtained by summing the relative abundance of genes under each species and functional unit.
  • 3 ⁇ 4 represents the relative abundance calculated from one sequencing data.
  • the inventor sets the value of ⁇ 0 to 16 - 5 and sets it to 0 to 40 million to calculate the 99% confidence interval of ⁇ , and further evaluates the detection error rate. The result is shown in Fig. 7.
  • the updated gene set contains 4,267,985 non-redundant genes that can be divided into 6,313 KOs and 45,683 OGs (including 7,042 new gene families).
  • the genes, KO or OG appearing in less than 6 samples in all 145 samples of the first phase were first removed.
  • the relative abundance of the genes from the same KO was summed using the gene annotation information of the first 4,267,985 genes, and the total relative abundance obtained was taken as the content of the KO in the sample to produce 145.
  • KO map of each sample The OG profile is constructed using the same method as the KO map.
  • JSD distance the relative distance of the samples (JSD distance;) is calculated according to the following formula and the intestinal type of the sample is obtained by clustering with PAM algorithm (Arumugam, M. et al. Enterotypes of the human) Gut microbiome. Nature 473, 174-180, doi:10.1038/nature09944 (2011), incorporated herein by reference;):
  • P ( i ) and Q ( i ) are the relative abundances of gene i in samples P and Q, respectively.
  • MLG Metal Landage Group
  • an MLG refers to a group of genetic material in the metagenomics, possibly as a unit Links, not distributed independently.
  • LGT lateral gene transfer
  • Step 1 Select the original set of T2D-related gene markers as the starting sub-clusters (subcluster) of the gene. It should be noted that when establishing the genetic map, the inventors constructed a genetic linkage group to reduce the complexity of statistical analysis. Therefore, all genes from the gene linkage group are considered to be sub-clusters.
  • Step 2 Using the Chameleon algorithm (Karypis, G. & Kumar, V. Chameleon: hierarchical clustering using dynamic modeling. Computer 32, 68-75 (1999), incorporated herein by reference), using dynamic modeling techniques and based on each other Interconnectivity and closeness combine sub-clusters that exhibit a minimum similarity >0.4.
  • the similarity here is defined by the product of correlation and similarity. These clusters are defined as semi-cluster.
  • Step 3 In order to merge the semi-cluster established in step 2. In step 3, first update any two half - a similarity between clustering, for each half and then - for cluster classification of species (taxonomic assignment) 0 Finally, the following two will meet the requirements of two or more Multiple semi-clusters are merged into MLG: a) Similarity between semi-clusters >0.2; b) All these semi-clusters are assigned from the same taxonomy lineage »
  • the species classification of MLG is determined by the following principles: 1) If the MLG is more than 90% The gene can be mapped to the reference genome and has a threshold of 95% at the nucleotide level, which is considered to be from the known bacterial species; 2) if more than 80% of the genes in the MLG can be mapped to the reference genome And a threshold of 85% at the nucleotide and protein levels, the specific MLG is considered to be from the same genus of the known bacterial species; 3) if the 16S sequence can be identified from the MLG assembly results, then RDP -Classifier for multiple phylogenetic tree analysis (bootstrap value > 0.80 ) ( Wang, Q., Garrity, GM, Tiedje, JM & Cole, J.
  • Step 1 Extract the gene from the MLG as a seed (Seed), identify the samples containing the seed at the highest abundance in all samples, and then select the paired end sequencing data from these samples, which can be matched to the seed (including only one end) Can be matched by paired ends for sequencing).
  • the lower limit of coverage for these paired end sequencing data is 50X in no more than 5 samples, which can be calculated by dividing the total number of selected sequencing data by the total length of the seeds.
  • Step 2 Make a de novo assembly by using SOAPdenovo with the parameters used to construct the gene type.
  • Step 3 In order to identify and remove mismatch contigs that may be caused by contaminated data, a composition-based binning method is employed. The contig, which differs in GC content and sequencing depth values from other contigs of the assembly results, is removed from the assembly results because they may be incorrectly assembled for various reasons.
  • Step 4 From step 3, obtain the final assembly result and repeat step 2 until the assembly is no longer significantly improved (specifically, the total contig length is increased by less than 5%).
  • the performance of the MLG identification method was evaluated by the following steps: 1) In the genetic results quantified by the inventors, the genes that rarely appear (first appeared in less than 6 samples) were first filtered; 2) the classification results based on the species in the updated gene set , identified a group of gut bacterial strains, the standard of which contains 1,000 to 5,000 uniquely matched genes, wherein the similarity threshold is 95%. At this step, redundant strains within one species were manually removed and genes that could be matched to multiple species were discarded. Finally, 130,065 genes from 50 bacterial species were identified as test groups for evaluating the effectiveness of the MLG method; 3) The standard MLG method described above was performed for the test group. For each MLG, the inventors calculated the percentage of genes that were not derived from major species as an accuracy (ie, % of genes, see Table 7).
  • Example 1 and Example 2 were repeated, and the second phase of 199 verified samples were sequenced to obtain sequencing data.
  • Example 3 Using the same correlation statistical test as in Example 3, the relative abundance data of the genes, species, and functions of the test samples were examined, and the rigorous Bonferroni correction was performed on the significant P-values using the multiple test calibration method.
  • the obtained genetic markers and functional markers are markers that are significantly associated with the disease. Gene markers were clustered using known clustering software to obtain species markers. Student T test was performed on the relative abundance spectrum of species markers to calculate P values.
  • the markers identified in Example 3 were still significantly associated with the disease and are summarized in Tables 2-1, 2-2 and Table 3 below. Among them, eggNOG and KEGG are the function annotation database, which can find the corresponding gene family according to the number.
  • Functional marker enrichment group a (direction) P-value (first phase) P-value (second phase)
  • K03324 0 8.79E-20 1.51E-05 a i indicates enrichment in the type 2 diabetes group, which is a harmful marker; 0 indicates enrichment in the control group, which is a beneficial marker.
  • T2D-5 1 4.21047E-05 1.97056E-06
  • T2D-7 1 0.000601047 0.000279527
  • T2D-90 1 0.000704982 0.001710744
  • MLG MetagenomicLinkage Group, a candidate species.
  • d:l indicates enrichment in the type II diabetes group, which is a harmful marker
  • 0 indicates enrichment in the control group, which is a beneficial marker
  • cutoff is determined as follows: The relative abundance of genes is sorted from small to large, and then a value is taken as a candidate cutoff. The sensitivity and specificity are calculated under this candidate cutoff, and the sensitivity and specificity are calculated. Summing the largest candidate cutoff as the final optimal cutoff. For beneficial genes, the relative abundance value is less than the critical value and is diagnosed as type II diabetes; for harmful genes, the relative abundance value is greater than the critical value and is diagnosed as type II diabetes. The results are shown in Table 4-1.
  • the sensitivity is called the true positive rate, which is the probability that the actual patient is diagnosed as a patient, that is, the probability that the patient is diagnosed as positive.
  • the specificity rate is the true negative rate, which is the probability that the actual disease is not diagnosed as a non-patient, that is, the probability that the patient is not diagnosed as negative.
  • the relative abundance of 7 harmful functional markers and 8 beneficial functional markers selected by 344 samples was used as the risk value, and the OC (receiver-operating characteristic) curve was estimated below: f only AUC (Michael J. Pencina, alph B) D'AgostinoSr, alph B. D' AgostinoJr, et al. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in medicine,2008,27(2): 157-172 The larger the AUC, the higher the diagnostic ability, and the ability of the functional markers to diagnose type 2 diabetes. For each functional marker, a diagnostic cutoff is determined such that at this critical value, the sum of the sensitivity and specificity of the diagnosis is highest.
  • the relative abundance of functional markers is sorted from small to large, and then a value is taken as a candidate cutoff.
  • the sensitivity and specificity are calculated under this candidate cutoff, and the sensitivity and specificity are maximized.
  • the candidate cutoff is used as the final optimal cutoff.
  • the direction is equal to 1, it means that this functional marker is harmful.
  • it is equal to 0 it means that the functional marker is beneficial.
  • the relative abundance value is less than the critical value and is diagnosed as type II diabetes; the harmful function marker, the relative abundance value is greater than the critical value, is diagnosed as type II diabetes. See Table 4-2.
  • the sensitivity is called the true positive rate, which is the probability that the actual patient is diagnosed as a patient, that is, the probability that the patient is diagnosed as positive.
  • the specificity rate is the true negative rate, which is the probability that the actual disease is not diagnosed as a non-patient, that is, the probability that the patient is not diagnosed as negative.
  • the relative abundance of 27 harmful MLGs and 20 beneficial MLGs selected from 344 samples was used as the risk value.
  • the area under the ROC (receiver-operating characteristic) curve was estimated to evaluate the diagnostic capacity of MLG for type II diabetes. For each MLG, a diagnostic cutoff is determined such that at this threshold, the sum of the sensitivity and specificity of the diagnosis is highest.
  • the method of determining cutof is as follows: Sort the relative abundance of MLG from small to large, and then take a value out as a candidate cutoff. Calculate the sensitivity and specificity under this candidate cutoff, and maximize the sensitivity and specificity.
  • the candidate cutoff is used as the final optimal cutoff.
  • the relative abundance value is less than the critical value and is diagnosed as type II diabetes; for harmful MLG, the relative abundance value is greater than the critical value and is diagnosed as type II diabetes.
  • the results are summarized in the following table.
  • the sensitivity is called the true positive rate, which is the probability that the actual patient is diagnosed as a patient, that is, the probability that the patient is diagnosed as positive.
  • the specificity rate is the true negative rate, which is the probability that the actual disease is not diagnosed as a non-patient, that is, the probability that the patient is not diagnosed as negative.
  • T2D-11 1 0.103658 0.618 0.541176 0.66092
  • T2D-12 1 0.006279 0.654 0.564706 0.689655
  • T2D-139 1 1.553228 0.617 0.5 0.701149
  • T2D-14 1 0.010063 0.652 0.764706 0.505747
  • T2D-15 1 0.00508 0.589 0.670588 0.494253
  • T2D-170 1 0.032845 0.616 0.417647 0.804598
  • T2D-1 0.098314 0.526 0.076471 0.977011
  • T2D-2 1 0.0072 0.586 0.388235 0.816092
  • T2D-6 1 0.089696 0.526 0.094118 0.982759
  • T2D-73 1 0.107684 0.6 0.311765 0.885057
  • T2D-79 1 0.150142 0.572 0.594118 0.563218
  • T2D-80 1 0.003178 0.655 0.682353 0.586207
  • T2D-90 1 0.009561 0.62 0.447059 0.758621
  • T2D-9 1 0.008346 0.62 0.570588 0.637931 Con-101 0 0.01 1503 0.672 0.717647 0.58046
  • the sequencing data of the corresponding species were searched from the original metagenomics sequencing pool using SOAP2, and the sequencing data was deeply assembled by SOAPdenovo to obtain the genome sequence of the bacterium.
  • the genome of the microbial species can be reconstructed as much as possible by further iterative alignment and deep assembly, and the assembly results are greatly improved. After multiple iterations, the assembly results that are no longer improved are taken as the final genome sketch of the microbial species, as shown in Table 6.
  • Species identification was performed on the assembled genome sketches by 16S region identification and genome-wide identification.
  • the species classification (level) information is shown in Table 7.
  • Table 7 Species classification (level) information enrichment group MLG number number of genes species classification (level) gene % e similarity f
  • Control group rich Con-133 1 555 Ervsioelowchaceae (fa ilvl 77.88
  • T2D-154 A ermansiamuciniphila 1. .52 (1.05, 2.19)
  • Type II Diabetes T2D-5 Clostridium hathewavi 23.1 (2.08, 256.6)
  • T2D-7 Epperthellalenta 1. .57 (0.95, 2.58)
  • T2D-9 Unclassified 1.02 (0.83, 1.27)
  • T2D-170 Unclassified 1.85 (0.96, 3.57)

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Analytical Chemistry (AREA)
  • Zoology (AREA)
  • Genetics & Genomics (AREA)
  • Engineering & Computer Science (AREA)
  • Pathology (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明提出了确定异常状态生物标记物的方法和***。其中,确定异常状态生物标记物的方法和***包括:对来自第一对象的核酸样本和来自第二对象的核酸样本进行核酸测序,以便获得分别由多个测序序列构成的第一测序结果和第二测序结果,其中所述第一对象具有所述异常状态,所述第二对象不具有所述异常状态,所述来自第一对象的核酸样本和所述来自第二对象的核酸样本是从相同类型的试样分离的,所述第一对象和第二对象属于相同物种;以及基于所述第一测序结果和所述第二测序结果的差异,确定与所述异常状态相关的标记物。

Description

确定异常状态生物标记物的方法及*** 优先权信息
无 技术领域
本发明涉及生物技术领域。具体地, 本发明涉及确定异常状态生物标记物的方法及 ***。 背景技术
宏基因组学 (metagenomics)又称为环境基因组学, 元基因组学, 生态基因组学, 或 者群落基因组学, 这是一门直接研究自然状态下微生物群落(包含了可培养的和不可培 养的细菌、 真菌和病毒等基因组的总和)的学科。 1998年威斯康辛大学植物病理学部门 的 Handelsman等在研究土壤微生物时, 最早提出了 "宏基因组学" 这一概念。 传统的 微生物研究受到微生物分离纯培养的技术限制,而宏基因组学研究则是以特定环境下的 微生物群落为研究对象,以微生物多样性、 种群结构、 进化关系、 功能活性、 相互协作 关系及与环境之间的关系为研究目的的新的微生物研究方法。宏基因组学研究的基本研 究策略包括:环境基因组大片段 DNA 的提取和纯化、 文库构建、 目的基因筛选和 /或大 规模测序分析。宏基因组文库中既包含了可培养的、 又包含了不可培养的微生物基因和 基因组,将某个自然环境中的总 DNA克隆到可培养的宿主细胞中,从而避开了微生物分 离培养的难题。在该研究中,借助于大规模序列分析,在基因序列分析的基础上,结合生物 信息学工具,能够发现大量过去无法得到的未知微生新基因或新基因簇,这对了解微生物 区系组成、 进化历程和代谢特点,挖掘具有应用潜力的新基因等都具有重要意义。
然而, 目前的宏基因组研究仍有待改进。 发明内容
本发明旨在至少解决现有技术中存在的技术问题之一。 为此, 本发明提出了能够有 效确定异常状态生物标记物的方法和***。
根据本发明的第一方面, 本发明提出了一种确定异常状态生物标记物的方法。 根据本 发明的实施例, 该方法包括下列步骤: 对来自第一对象的核酸样本和来自第二对象的核酸 样本进行核酸测序, 以便获得分别由多个测序序列构成的第一测序结果和第二测序结果, 其中所述第一对象具有所述异常状态, 所述第二对象不具有所述异常状态, 所述来自第一 对象的核酸样本和所述来自第二对象的核酸样本是从相同类型的试样分离的, 所述第一对 象和第二对象属于相同物种; 以及基于所述第一测序结果和所述第二测序结果的差异, 确 定与所述异常状态相关的标记物。 根据本发明实施例的方法, 通过对两种对象的核酸样本 进行测序和比对, 从而能够有效地确定与异常状态相关的标记物。
根据本发明的实施例, 上述确定异常状态生物标记物的方法还可以具有下列附加技术 特征:
才艮据本发明的一个实施例, 所述异常状态为疾病。
才艮据本发明的一个实施例, 所述疾病为选自肿瘤性疾病、 免疫性疾病、 遗传性疾病、 代谢性疾病的至少一种。
根据本发明的一个实施例, 所述异常状态为糖尿病。
根据本发明的一个实施例, 所述第一对象和所述第二对象为人。
才艮据本发明的一个实施例, 所述来自第一对象的核酸样本和所述来自第二对象的核酸 样本分别为从所述第一对象和第二对象的***物中分离的。
根据本发明的一个实施例, 利用第二代测序技术或第三代测序技术对来自所述第一对 象的核酸样本和来自所述第二对象的核酸样本的至少一种进行核酸测序。
根据本发明的一个实施例, 利用选自 Hiseq2000、 SOLiD、 454、 和单分子测序装置的 至少一种进行所述核酸测序。
根据本发明的一个实施例, 基于所述第一测序结果和所述第二测序结果的差异, 确定 所述异常状态的生物标记物进一步包括: 将构成所述第一测序结果和所述第二测序结果的 测序序列与参照基因集进行比对; 基于比对结果, 分别确定来自所述第一对象和第二对象 的核酸样本中各基因的相对丰度; 对来自所述第一对象和第二对象的核酸样本中各基因的 相对丰度进行统计检验; 以及确定在来自所述第一对象和第二对象的核酸样本之间相对丰 度存在显著差异的基因为所述异常状态的基因标记物。
才艮据本发明的一个实施例, 在将构成所述第一测序结果和所述第二测序结果的测序序 列与参照基因集进行比对之前, 进一步包括对所述测序结果进行过滤以便去除污染的步骤, 其中, 所述污染为选自下列的至少一种: 接头污染, 低质量序列和宿主基因组污染序列。
才艮据本发明的一个实施例, 利用选自 SOAP2和 MAQ的至少一种, 将构成所述第一测 序结果和所述第二测序结果的测序序列与参照基因集进行比对, 任选地, 所述参照基因集 为人肠道微生物群落非冗余基因集。
才艮据本发明的一个实施例, 进一步包括: 将构成所述第一测序结果和所述第二测序结 果的测序序列, 进行组装和基因预测, 以获得基因, 并且, 将不能与所述参照基因集比对 上的基因为新基因; 以及将所确定的新基因增加至所述参照基因集中。
根据本发明的一个实施例, 所述物种分类是通过将所述参照基因集中每个基因与 IMG 数据库进行比对而进行的。
根据本发明的一个实施例, 利用 BLASTP将所述参照基因集中每个基因与 IMG数据库 进行比对, 其中, 根据 E-Value值小于 10_1G的结果, 确定所述基因的物种分类水平。
根据本发明的一个实施例, 所述功能注释是通过将所述参照基因集中每个基因与 eggNOG和 KEGG的至少之一进行比对而进行的。
根据本发明的一个实施例, 利用 BLASTP将所述参照基因集中每个基因与 IMG数据库 进行比对, 其中, 根据 E-Value值小于 10-1G的结果, 确定所述基因的功能。
根据本发明的一个实施例, 所述相对丰度为各基因的物种相对丰度和功能相对丰度, 所述参照基因集包含基因物种信息和功能注释, 其中, 基于所述第一测序结果和所述第二 测序结果的差异, 确定所述异常状态的生物标记物进一步包括: 将构成所述第一测序结果 和所述第二测序结果的测序序列与参照基因集进行比对; 基于比对结果, 分别确定来自所 述第一对象和第二对象的核酸样本中各基因的物种相对丰度和功能相对丰度; 对来自所述 第一对象和第二对象的核酸样本中各基因的物种相对丰度和功能相对丰度进行统计检验; 以及确定在来自所述第一对象和第二对象的核酸样本之间相对丰度存在显著差异的物种和 功能分别为所述异常状态的物种标记物和功能标记物。
根据本发明的一个实施例, 所述统计检验为选自 Student T检验、 Wilcox轶和检验的至 少一种。
根据本发明的一个实施例, 进一步包括过滤除掉受到表观因素影响显著的样本, 优选 通过肠型分析和选自 Fisher精确检验及 Mental-Haenszel的至少一种检验进行过滤。
根据本发明的一个实施例, 进一步包括对所得到的基因标记物进行聚类分析和深度组 装, 以便构建所述异常状态的相关生物基因组。
根据本发明的一个实施例, 进一步包括对所述生物标记物进行验证的步骤。
根据本发明的又一方面, 本发明还提出了一种确定异常状态生物标记物的***。 根据 本发明的实施例, 该***包括: 测序装置, 所述测序装置适于对来自第一对象的核酸样本 和对来自第二对象的核酸样本进行核酸测序进行核酸测序, 以便获得分别由多个测序序列 构成的第一测序结果和第二测序结果, 其中所述第一对象具有所述异常状态, 所述第二对 象不具有所述异常状态, 所述来自第一对象的核酸样本和所述来自第二对象的核酸样本是 从相同类型的试样分离的, 所述第一对象和第二对象属于相同物种; 分析装置, 所述分析 装置与测序装置相连, 从所述测序装置接收所述第一测序结果和所述第二测序结果, 并且 适于基于所述第一测序结果和所述第二测序结果的差异, 确定与所述异常状态相关的标记 物。 利用该***, 能够有效地实施 #>据本发明实施例的确定异常状态生物标记物的方法, 由此, 能够有效地确定异常状态标记物。
根据本发明的实施例, 该确定异常状态生物标记物的***还可以具有下列附加技术特 征:
才艮据本发明的一个实施例, 进一步包括: 核酸样本分离装置, 所述核酸样本分离装置 与所述测序装置相连, 并且适于从对象分离核酸样本, 任选地适于从对象的***物中分离 核酸样本。
才艮据本发明的一个实施例, 所述测序装置为第二代测序平台或第三代测序平台。
根据本发明的一个实施例, 所述测序装置为选自 Hiseq2000、 SOLiD、 454、 和单分子 测序装置的至少一种。
根据本发明的一个实施例, 所述分析装置进一步包括:
比对单元, 所述比对单元适于将构成所述第一测序结果和所述第二测序结果的测序序 列与参照基因集进行比对;
相对丰度确定单元, 所述相对丰度计算单元与所述比对单元相连, 并且适于基于比对 结果, 分别确定来自所述第一对象和第二对象的核酸样本中各基因的相对丰度; 以及
检验单元, 所述检验单元与所述相对丰度确定单元相连, 并且适于对来自所述第一对 象和第二对象的核酸样本中各基因的相对丰度进行统计检验; 以及
标记物确定单元, 所述标记物确定单元适于基于统计检验结果, 确定在来自所述第一 对象和第二对象的核酸样本之间相对丰度存在显著差异的基因为所述异常状态的基因标记 物。
根据本发明的一个实施例, 所述分析装置进一步包括:
过滤单元, 所述过滤单元与所述比对单元相连, 并且适于在将构成所述第一测序结果 和所述第二测序结果的测序序列与参照基因集进行比对之前, 对所述测序结果进行过滤以 便去除污染, 其中, 所述污染为选自下列的至少一种: 接头污染, 低质量序列和宿主基因 组污染序列。
才艮据本发明的一个实施例, 所述比对单元利用选自 SOAP2和 MAQ的至少一种, 将构 成所述第一测序结果和所述第二测序结果的测序序列与参照基因集进行比对, 任选地, 所 述参照基因集为人肠道微生物群落非冗余基因集。
根据本发明的一个实施例, 所述相对丰度为各基因的物种相对丰度和功能相对丰度, 所述参照基因集包含基因物种信息和功能注释,
其中, 所述相对丰度确定单元, 适于基于比对结果, 分别确定来自所述第一对象和第 二对象的核酸样本中各基因的物种相对丰度和功能相对丰度;
所述检验单元, 适于对来自所述第一对象和第二对象的核酸样本中各基因的物种相对 丰度和功能相对丰度进行统计检验; 以及
所述标记物确定单元, 适于基于在来自所述第一对象和第二对象的核酸样本之间相对 丰度存在显著差异的物种和功能, 确定所述异常状态的物种标记物和功能标记物。
根据本发明的一个实施例, 所述检验单元适于进行选自 Student T检验、 Wilcox轶和检 验的至少一种统计检验。
根据本发明的一个实施例, 进一步包括基因组组装装置, 所述基因组组装装置适于对 所得到的基因标记物进行聚类分析和深度组装, 以便构建所述异常状态的相关生物基因组。
通过 据本发明实施例的确定异常状态相关生物标记物的方法 (也称为 MGWAS ( a two-stage case-control Metagenome-Wide Association Study ) ), 可以基于高通量则序技术, 对 宏基因组与疾病进行关联分析, 寻找与疾病相关的生物标记物, 通量大幅度提高, 成本大 幅度降低, 可对大群体进行研究, 充分利用已知参考基因集的各种数据信息, 使结果的重 复性好、 可信性增加, 运用多重关联性统计检验方法, 大大减少了由于相对丰度估计的波 动导致检验的假阳性错误, 同时保证了检验的功效, 可直接确定标记物与目标性状之间的 连锁关系, 关联分析的可靠性及准确性高。
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得 明显, 或通过本发明的实践了解到。 附图说明
本发明的上述和 /或附加的方面和优点从结合下面附图对实施例的描述中将变得明 显和容易理解, 其中:
图 1 示出了 #居本发明一个实施例的确定异常状态生物标记物的方法的流程示意 图;
图 2 示出了根据本发明另一个实施例的确定异常状态生物标记物的方法的流程示 意图;
图 3示出了根据本发明一个实施例的确定异常状态生物标记物的***的示意图; 图 4-6示出了 #居本发明实施例 3、 4和 5的确定异常状态生物标记物的方法的流 程示意图; 图 7显示了根据本发明实施例, 以不同测序量, 相对丰度特征的检测误差率分布。 在图 7中, X轴表示样品的测序量, 其被定义为成对末端测序数据的数目, Y轴表示基 因的相对丰度。 估计相对丰度的 99%置信区间 (CI ) , 并且将检测误差率定义为置信区 间宽度与相对丰度自身的比例。通过1^¾&^1 + 转化标准的检测误差率, 并用于 对所有的点进行着色, 颜色越深表示检测误差率越高。 另外, 添加两条无差异曲线: 落 入两条曲线右上方的检测误差率将分别小于 IX和 10X。 发明详细描述
下面详细描述本发明的实施例, 所述实施例的示例在附图中示出, 其中自始至终相 同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附 图描述的实施例是示例性的, 仅用于解释本发明, 而不能理解为对本发明的限制。
需要说明的是, 术语 "第一,, 、 "第二,,仅用于描述目的, 而不能理解为指示或暗 示相对重要性或者隐含指明所指示的技术特征的数量。 由此, 限定有 "第一"、 "第二" 的特征可以明示或者隐含地包括一个或者更多个该特征。进一步地,在本发明的描述中, 除非另有说明, "多个,, 的含义是两个或两个以上。
确定异常状态生物标记物的方法
本发明的第一方面提出了一种确定异常状态生物标记物的方法。
参考图 1 , 该确定异常状态生物标记物的方法包括下列步骤:
首先, 对来自第一对象的核酸样本和来自第二对象的核酸样本进行核酸测序, 以便获 得分别由多个测序序列构成的第一测序结果和第二测序结果。 才艮据本发明的实施例, 第一 对象和第二对象具有不同的状态, 具体地, 第一对象具有所述异常状态, 第二对象不具有 所述异常状态, 并且来自第一对象的核酸样本和所述来自第二对象的核酸样本是从相同类 型的试样分离的, 第一对象和第二对象属于相同物种。
接下来, 在从第一对象和第二对象获得第一测序结果和第二测序结果之后, 可以基于 第一测序结果和所述第二测序结果的差异, 确定与所述异常状态相关的标记物。 基于第一 对象和第二对象属于相同的物种, 并且从相同类型的试样提取核酸样本, 由此, 第一测序 结果和第二测序结果的差异可以反映出异常状态的生物标记物。
在本文中所使用的术语 "异常状态" 应做广义理解, 其可以指对象(生物体) 不同于 正常状态的任何状态, 既可以是生理上的异常, 也可以是心理上的异常。 根据本发明的一 个实施例, 所述异常状态为疾病。 根据本发明的实施例, 可以利用本发明的方法进行研究 的疾病类型并不受特别限制。 根据本发明的一个实施例, 所述疾病为选自肿瘤性疾病、 免 疫性疾病、 遗传性疾病、 代谢性疾病的至少一种。 根据本发明的具体实例, 所述异常状态 为糖尿病。 由此, 利用本发明的方法, 可以有效地特定物种特定疾病的生物标记物。 根据 本发明的实施例, 在本文中所使用术语 "对象" 的范围不受特别限制, 可以为任意生物体。 根据本发明的一个实施例, 所述第一对象和所述第二对象为人。 由此, 根据本发明的实施 例, 第一对象可以为患有特定疾病的患者, 第二对象可以为健康人。 另外, 根据本发明的 实施例, 第一对象和第二对象的数目并不受特别限制, 可以为多个, 由此, 能够进一步提 高所确定生物标记物的可信度。
根据本发明的实施例, 核酸样本的来源并不受特别限制。 只要第一和第二对象的核酸 样本的来源相同即可。 根据本发明的一个实施例, 分别从来第一对象和第二对象的***物 中分离核酸样本和第二对象的核酸样本。 由此, 可以有效确定肠道微生物的核酸信息, 从 而可以有效地确定对象特定疾病与肠道菌群之间的关系。
根据本发明的实施例, 对核酸样本进行测序的手段并不受特别限制。 根据本发明的一 个实施例, 利用第二代测序技术或第三代测序技术对来自所述第一对象的核酸样本和来自 所述第二对象的核酸样本的至少一种进行核酸测序。 才艮据本发明的一个具体实例, 利用选 自 Hiseq2000、 SOLiD、 454、 和单分子测序装置的至少一种进行所述核酸测序。 由此, 能 够利用这些测序装置的高通量、 深度测序的特点。 从而, 提高后续对测序数据进行分析, 尤其是统计检验分析时的精确性和准确度。
才艮据本发明的实施例, 在获得测序结果之后, 可以通过任意方法对第一测序结果和第 二测序结果进行分析。 根据本发明的一些实施例, 参考图 2可以通过下列方法来确定生物 标记物:
首先, 将构成第一测序结果和第二测序结果的测序序列 (reads ) 与参照基因集进行比 对。 根据本发明的实施例, 参照基因集的类型并不受特别限制, 可以为任何已知序列的数 据库, 例如, 可以采用已知的人肠道微生物群落非冗余基因集。 根据本发明的一个实施例, 在将构成第一测序结果和第二测序结果的测序序列与参照基因集进行比对之前, 进一步包 括对测序结果进行过滤以便去除污染的步骤。 根据本发明的实施例, 污染可以为选自下列 的至少一种: 接头污染, 低质量序列和宿主基因组污染序列。 由此, 可以进一步提高比对 的效率, 进而提高确定生物标记物的效率。 根据本发明的实施例, 可以采用任何已知的工 具将测序序列与参照基因集进行比对。 根据本发明的一个实施例, 可以利用选自 SOAP2和 MAQ的至少一种, 将构成所述第一测序结果和所述第二测序结果的测序序列与参照基因集 进行比对。 由此, 可以进一步提高比对的效率, 进而提高确定生物标记物的效率。
接下来, 基于比对结果, 分别确定来自第一对象和第二对象的核酸样本中各基因的相对 丰度。 通过将测序序列与参照基因集进行比对, 可以将测序序列与参照基因集中的基因建 立对应关系, 从而针对核酸样本中的特定基因, 与其相对应的测序序列的数目可以有效地 反映该基因的相对丰度。 由此, 可以通过比对结果, 按照常规的统计分析, 可以确定在核 酸样本中基因的相对丰度。 根据本发明的实施例, 在获得相对丰度之后, 对相对丰度的精 确性进行统计检验, 优选地利用泊松分布。 具体地, 可以采用 Audic和 Claverie ( 1997 ) 的 方法 ( Audic, S. &Claverie, J. M. The significance of digital gene expression profiles. Genome Res 7, 986-995 (1997),通过参照将其并入本文)对相对丰度估计( relative abundance estimate ) 的理论精确性进行评估。 假设从基因 获得了 个测序数据, 其只占据了样品全部测序数据 中的一' j、部分, 通过泊松分布(Poisson distribution )对 的分布进行估计。 将样品中全部测 序数据(reads )的数目记录为 N, 则 假设所有的基因都是相同长度的, 则基因 i 的相对丰度值 ί¾可以筒单地表示为% = 进而, 发明人可以按照下列公式评估从相同 的基因 i获得 个测序数 期攀勢率,
其中, ^ = 表示由 个测序数据计算得到的相对丰度。 根据该公式, 发明人通过 设定 为 0~le-5 , 设定 为 0~4000万, 以便计算 的 99%置信区间, 并且进一步评估检测 误差率, 结果见图 7。
最后, 在确定核酸样本中各基因的相对丰度后, 对来自所述第一对象和第二对象的核 酸样本中各基因的相对丰度进行统计检验, 由此, 可以判断在第一对象和第二对象样本中 是否存在相对丰度有显著差异的基因, 如果存在则判断该基因为异常状态的生物标记物, 即基因标记物。
根据本发明的实施例, 在本文中使用的术语 "生物标记物" 应做广义理解, 其包括任 何能够反映异常状态的可检测生物指标, 可以为基因标记物, 物种标记物以及功能标记物。
另外, 根据本发明的实施例, 可以将测序数据进行组装和基因预测, 以获得基因, 不 能与所述参照基因集比对上的基因为新基因; 并将所确定的新基因补充到参照基因集中, 从而可以提高参照基因集的容量, 从而提高确定生物标记物的效率。 根据本发明的一个实 施例, 可以通过将所述参照基因集中每个基因与 IMG数据库进行比对而进行物种分类。 根 据本发明的一个实施例, 利用 BLASTP将所述参照基因集中每个基因与 IMG数据库进行比 对, 其中, 根据 E-Value值小于 10-1G的结果, 确定所述基因的物种分类水平。 由此, 可以 有效地确定基因的物种分类。 根据本发明的一个实施例, 对所述参照基因集中每个基因的 功能注释是通过将基因与 eggNOG和 KEGG的至少之一进行比对而进行的。 根据本发明的 一个实施例, 利用 BLASTP将所述参照基因集中每个基因与 IMG数据库进行比对, 其中, 根据 E-Value值小于 的结果, 确定所述基因的功能。 由此, 可以有效地确定基因的功 能分类。
另外, 对于已知的参照基因集, 其通常包含基因物种信息和功能注释, 由此, 在确定 基因相对丰度的基础上, 可以进一步通过将基因的物种信息和功能注释进行分类, 从而确 定物种相对丰度和功能相对丰度, 从而可以进一步确定异常状态的物种标记物和功能标记 物。 由此, 根据本发明的一个实施例, 相对丰度为各基因的物种相对丰度和功能相对丰度, 所述参照基因集包含基因物种信息和功能注释, 其中, 基于第一测序结果和所述第二测序 结果的差异, 确定所述异常状态的生物标记物进一步包括: 将构成所述第一测序结果和所 述第二测序结果的测序序列与参照基因集进行比对; 基于比对结果, 分别确定来自所述第 一对象和第二对象的核酸样本中各基因的物种相对丰度和功能相对丰度; 对来自所述第一 对象和第二对象的核酸样本中各基因的物种相对丰度和功能相对丰度进行统计检验; 以及 确定在来自所述第一对象和第二对象的核酸样本之间相对丰度存在显著差异的物种和功能 分别为所述异常状态的物种标记物和功能标记物。 功能相对丰度和物种相对丰度的确定方 法不受特别限制, 根据本发明的实施例, 可以采用对来自相同物种的基因的基因相对丰度 和具有相同功能注释的基因的基因相对丰度进行统计检验, 例如加和、 取平均值、 中位数 值等, 来确定功能相对丰度和物种相对丰度。 根据本发明的一个实施例, 可以按照下列公 式计算出各基因的相对丰度:
Figure imgf000010_0001
其巾
A为基因 i在样品中的相对丰度;
基因 i的长度;
χ'·: 基因 i在样品中被检测到的次数。
根据本发明的实施例, 对基因相对丰度、 物种相对丰度和功能相对丰度进行统计检验 的方法并不受特别限制。 根据本发明的一个实施例, 所述统计检验可以为选自 Student T检 验、 Wilcox轶和检验的至少一种。
另外, 才艮据本发明的一个实施例, 进一步包括过滤除掉受到表观因素影响显著的样本, 由此, 可以进一步保证所得到的生物标记物的有效性和可靠性。 对于采用多个第一对象和 第二对象研究***物 (可以为粪便) 中 生物菌群, 优选通过肠型分析和选自 Fisher精确 检验及 Mental-Haenszel的至少一种检验进行过滤。 正常的人肠道微生物群落能够被划分为 三个明显的类型 (enterotypes, 中文筒称为肠型), 肠型的划分不受年龄、 性别等表观因素 影响, 深入研究表明肠型也不受肥胖等慢性代谢性疾病的影响。 因此, 在通常的疾病与肠 道微生物关联分析中, 由于肠型这种群体分层因素可能是关联的生物标记物不易识别, 需 要通过划分样品的肠型及做群体分层检验以除去肠型的影响。 肠型的划分以物种分类的属 水平相对丰度语数据为基础, 通过计算样品相对距离( JSD距离)并用 PAM算法聚类得到; 同时通过功能相对丰度谱数据用同样方法的聚类验证其结果。 当然本领域技术人员可以理 解, 也能够通过其它算法实现(比如 Hclust算法、 邻接法或者最大似然法)。 得到样本的肠 型后, 可以通过 Fisher精确检验或者 Mental-Haenszel检验判断样本是否在某个肠型中显著 富集。 如果样本在某个肠型中富集, 可以通过去掉该样本或者利用 PCA方法校正使剩余的 样本不再富集。 另一方面, 在通常的关联分析研究中, 由于实验设计的不完善, 样本也可 能受到年龄、 性别等的影响, 可以通过 Mental-Haenszel检验其影响是否显著, 从结果中去 除受这些因素影响显著的样本。
在获得基因标记物后, 根据本发明的实施例, 可以进一步包括对所得到的基因标记物 进行聚类分析和深度组装, 以便构建所述异常状态的相关生物基因组。 对于所得到的基因 标记物, 一般情况下, 很多基因都可能来自较低数量级的相关物种。 而众多样本, 例如人 肠道微生物群落的大部分物种都是没有成功分离并测序的, 只有通过这些基因的聚类, 在 聚类的基础上重建其对应的疾病相关微生物基因组, 才能得到更多的这些微生物物种的信 息。 基因的聚类可以采用已知的聚类软件。 得到聚类结果后, 可以利用测序序列 (reads ) 比对方法(例如可以采用 SOAP2 )从原始的测序序列池(reads pool ) 中寻找对应 ί生物物 种的测序数据, 并通过这些测序数据的深度组装(组装用的软件是 SOAPdenovo )得到该 ί 生物物种的基因组序列。 通过进一步迭代比对和深度组装能够尽可能地重建该微生物物种 的基因组, 并大大改进了组装结果。 多次迭代后, 可以把不再改进的组装结果作为最终得 到的微生物物种的基因组草图。
根据本发明的一个实施例, 进一步包括对所述生物标记物进行验证的步骤。 由此, 可 以进一步提高生物标记物与异常状态例如疾病诸如糖尿病之间关联的有效性和可靠性。
确定异常状态生物标记物的***
根据本发明的又一方面, 本发明还提出了一种确定异常状态生物标记物的***。 参考 图 3 , 根据本发明的实施例, 该*** 1000包括: 测序装置 100和分析装置 200。 根据本发 明的实施例, 测序装置 100适于对来自第一对象的核酸样本和对来自第二对象的核酸样本 进行核酸测序进行核酸测序, 以便获得分别由多个测序序列构成的第一测序结果和第二测 序结果, 其中, 第一对象具有异常状态, 第二对象不具有异常状态, 来自第一对象的核酸 样本和来自第二对象的核酸样本是从相同类型的试样分离的, 第一对象和第二对象属于相 同物种。 根据本发明的实施例, 分析装置 200与测序装置 100相连, 从而分析装置 200可 以从测序装置 100接收第一测序结果和第二测序结果, 并且适于基于第一测序结果和第二 测序结果的差异, 确定与异常状态相关的标记物。 由此, 利用该*** 1000, 能够有效地实 施# ^据本发明实施例的确定异常状态生物标记物的方法, 由此, 能够有效地确定异常状态 标记物。
才艮据本发明的一个实施例, 确定异常状态生物标记物的*** 1000可以进一步包括核酸 样本分离装置 300, 该核酸样本分离装置 300与所述测序装置 100相连, 用于从对象分离核 酸样本, 任选地适于从对象的***物中分离核酸样本, 从而可以为测序装置提供核酸样本 进行测序。 根据本发明的实施例, 可以用于进行测序的方法和设备并不受特别限制。 根据 本发明的实施例, 可以采用第二代测序平台或第三代测序平台。 才艮据本发明的一个实施例, 所述测序装置为选自 Hiseq2000、 SOLiD、 454、 和单分子测序装置的至少一种。 由此, 结 合最新的测序技术, 针对单个位点可以达到较高的测序深度, 检测灵敏度和准确性大大提 高, 因而能够利用这些测序装置的高通量、 深度测序的特点, 进一步提高对核酸样本进行 检测分析的效率。 从而, 提高后续对测序数据进行分析, 尤其是统计检验分析时的精确性 和准确度。
参考图 4, 根据本发明的一个实施例, 分析装置 200进一步包括: 比对单元 201、 相对 丰度确定单元 202、 检验单元 203以及标记物确定单元 204。 根据本发明的实施例, 比对单 元 201 适于将构成所述第一测序结果和所述第二测序结果的测序序列与参照基因集进行比 对, 相对丰度计算单元 202与比对单元 201相连, 并且适于基于比对结果, 分别确定来自 所述第一对象和第二对象的核酸样本中各基因的相对丰度, 检验单元 203 与相对丰度 202 确定单元相连, 并且适于对来自所述第一对象和第二对象的核酸样本中各基因的相对丰度 进行统计检验, 标记物确定单元 204适于基于统计检验结果, 确定在来自所述第一对象和 第二对象的核酸样本之间相对丰度存在显著差异的基因为所述异常状态的基因标记物。 根 据本发明的一个实施例, 所述检验单元适于进行选自 Student T检验、 Wilcox轶和检验的至 少一种统计检验。
根据本发明的一个实施例, 分析装置 200 可以进一步包括: 过滤单元(205 ), 过滤单 元与比对单元 201 相连, 并且适于在将构成所述第一测序结果和所述第二测序结果的测序 序列与参照基因集进行比对之前, 对所述测序结果进行过滤以便去除污染, 其中, 所述污 染为选自下列的至少一种: 接头污染, 低质量序列和宿主基因组污染序列。 根据本发明的 一个实施例, 比对单元 201利用选自 SOAP2和 MAQ的至少一种, 将构成所述第一测序结 果和所述第二测序结果的测序序列与参照基因集进行比对。 任选地, 可以在比对单元中存 储参照基因集, 例如可以为人肠道微生物群落非冗余基因集。 由此, 可以提高比对效率。
根据本发明的一个实施例, 所述相对丰度为各基因的物种相对丰度和功能相对丰度, 所述参照基因集包含基因物种信息和功能注释, 其中, 所述相对丰度确定单元, 适于基于 比对结果, 分别确定来自所述第一对象和第二对象的核酸样本中各基因的物种相对丰度和 功能相对丰度; 所述检验单元, 适于对来自所述第一对象和第二对象的核酸样本中各基因 的物种相对丰度和功能相对丰度进行统计检验; 以及所述标记物确定单元, 适于基于在来 自所述第一对象和第二对象的核酸样本之间相对丰度存在显著差异的物种和功能, 确定所 述异常状态的物种标记物和功能标记物。 由此, 可以有效地确定异常状态的物种标记物和 功能标记物。
借助才艮据本发明实施例的确定异常状态生物标记物的*** 1000, 能够有效地实施上述 根据本发明实施例的确定异常状态生物标记物的方法。 关于该方法的优点, 前面已经进行 了详细描述, 不再赘述。 需要说明的是, 本领域技术人员能够理解, 在前面所描述的用于 对确定异常状态生物标记物的方法的特征和优点也适合于用于确定异常状态生物标记物的 ***, 为描述方便, 不再详述。 下面参考具体实施例,对本发明进行说明,需要说明的是,这些实施例仅仅是说明性的, 而不能理解为对本发明的限制。
若未特别指明, 实施例中所采用的技术手段为本领域技术人员所熟知的常规手段, 可 以参照 《分子克隆实验指南》 第三版或者相关产品进行, 所采用的试剂和产品也均为可商 业获得的。 未详细描述的各种过程和方法是本领域中公职的常规方法, 所用试剂的来源、 商品名以及有必要列出其组成成分者, 均在首次出现时标明, 其后所用相同试剂如无特殊 说明, 均以首次标明的内容相同。
实施例 1: 样品收集
II型糖尿病患者来自中国深圳北大医院, 以 1999年 WHO发布的标准进行 II型糖尿病 诊断。 匹配的志愿者进行粪便样品的采集, 志愿者在采样前 3天需注意饮食, 宜饮食清淡, 不宜食用高油脂类食物; 且在取样前 5 天不要食用酸奶等乳酸制品及益生元, 在采集粪便 样品时需注意不要混入尿样, 并注意取样时尽量隔绝人体污染及空气。 实验共采集了 99例 正常肥胖(NO )样品, 75例正常偏瘦(NL )样品, 105例糖尿病肥胖( DO )样品, 65例 糖尿病偏瘦(DL )样品。 选取其中的 145个样品作为测试集, 包括 32个 DO样品, 39个 DL样品, 37个 NO样品及 37个 NL样品; 余下的 199个样品作为验证集, 包括 73个 DO 样品, 26个 DL样品, 62个 NO样品及 38个 NL样品, 见表 1。
样品采集统计
样品个数
样品编号 糖尿病 肥胖
第一期 第二期
DO Y Υ 32 73
DL Υ Ν 39 26
NO Ν Υ 37 62
NL Ν Ν 37 38 实施例 2: 提取核酸样品及测序
2.1 粪便样品的处理: 将取好的粪便样品放入灭菌后的粪便收集管, 用干水或液氮冷冻 运送到保存点后,在 -80 C低温;水箱中保存。
2.2核酸样本(DNA ) 的提取
从各个粪便样品分别提取 DNA样本。
2.3 构建测序文库及测序:
按照测序仪器( Illumina Genome Analyzer Πχ测序平台)制造商 Illumina公司提供的操 作指南, 针对所提取的 DNA样本进行构建 350bp测序文库和测序。
利用 Illumina Genome Analyzer Πχ测序平台对第一期的 145个样品的文库进行测序,最 终产出了 4,636,045,336 reads, 即 383.08 Gb的原始数据。
参考图 4-6所示的流程, 鉴定 II型糖尿病相关生物标记物。 其中关于几个主要步骤的 筒要介绍如下:
实施例 3: 生物标记物的鉴定
3.1测序数据的基本处理
获得第一期 145个样品的测序数据以后, 对其进行过滤, 去除 adapter污染序列、 去低 质量序列和去宿主基因组污染序列, 最终获得 378.4 Gb高质量数据。
3.2获得微生物组基因集
宏基因组生物标记物的主体 因和基因相应的物种及功能, 因此需要首先对测序序 列进行组装和基因预测,去冗余,构建非冗余参考基因集( Junjie Qin, uiqiang Li, Jeroen aes, et al. (2010) A human gut microbial gene catalogueestablished by metagenomic sequencing. Nature, 464:59-65 ), 也即非冗余参照基因集。 在人肠道微生物群落的研究中, 采用上述参考 文献中已知的人肠道微生物群落非冗余基因集作为参考基因集, 即已经构建好的 3.3M欧洲 人肠道微生物群落非冗余基因集。 样本采自非欧洲人, 需要在样品中构建新的基因集并 补充到原来的 3. 3M欧洲人肠道基因集上。 更新后的基因集包含 4,267,985个预测的基因, 其中 1,090,889个基因为新补充的基因集。
3.3基因的物种分类通过每个基因与 IMG (v3,4)数据库进行 BLASTP比对, 得到从比对 得到属水平的物种分类(比对相似度在 85%以上,比对的覆盖度在 80%以上, Arumugam, M. et al. Enterotypes of the human gut microbiome. Nature 473, 174-180, doi: 10.1038/nature09944 (2011), 通过参照将其并入本文)和门水平的物种分类 (比对的相似度在 65%以上)。 更新 后的基因集中 21.3%的基因被分类到属水平, 覆盖 26.4-90.6% (平均 61.2%) 的样品测序数 据, 其余的基因来自目前仍没被鉴定出来的未知物种。 3.4基因的功能注释
通过基因和 eggNOG(v3)和 KEGG(59.0)数据库的 BLASTP比对,从结果中过滤出相似程 度较高的结果(E-Value<10-1G ), 再将基因划分到结果相应的数据库条目中。 对于不能比上 的基因, 把这些基因互相比对, 过滤出相似性较高的结果, 通过聚类得到基因簇作为新发 现的功能单元。
3.5构建物种和功能相对丰度谱
3.5.1使用 SOAP2将测序序列与非冗余参考基因集进行比对,并按照下列公式计算出各 :
Figure imgf000015_0001
其巾
ai为基因 i在样品中的相对丰度;
基因 i的长度;
Xi : 基因 i在样品中被检测到的次数;
^ / 表示在来自样品的测序数据中基因 i的拷贝数。
在基因相对丰度谱的基础上, 由于基因的物种分类和功能注释已知, 通过对每个物种 和功能单元下基因的相对丰度加和得到物种和功能的相对丰度谱。
3.5.2采用 Audic和 Claverie ( 1997 )的方法( Audic, S. &Claverie, J. M. The significance of digital gene expression profiles. Genome Res 7, 986-995 (1997), 通过参照将其并入本文)对相 对丰度估计( relative abundance estimate ) 的理论精确性进行评估。 假设从基因 获得了 个 测序数据,其只占据了样品全部测序数据中的一小部分,通过泊松分布(Poisson distribution ) 对^的分布进行估计。
将样品中全部测序数据 ( reads )的数目记录为 N, 则 if = 。 假设所有的基因都是相 同长度的, 则基因 i 的相对丰度值 έ可以筒单地表示为 ^ = 进而, 发明人可以按照 ,
Figure imgf000015_0002
其中, ¾ 表示由 个测序数据计算得到的相对丰度。 根据该公式, 发明人通过 设定 为 0~16-5 , 设定 为 0~4000万, 以便计算^ 的 99%置信区间, 并且进一步评估检测 误差率, 结果见图 7。
3.5.3 构建基因、 KO ( KEGG Orthologue, KEGG同源群)和 OG ( Orthologue group in eggNOG, eggNOG同源群) 图谱
更新后的基因集含有 4,267,985个非冗余基因, 其可以被分入 6,313个 KO和 45,683个 OG (包括 7,042个新的基因家族)中。 首先去除在第一期的所有 145个样品中出现于少于 6 个样品的基因、 KO或 OG。 为了减少统计分析的复杂程度, 在构建基因图谱时, 鉴定高度 相关的基因对, 并随后,使用分层聚类算法( straightforward hierarchical clustering algorithm ) 对这些基因进行聚类分析。 如果在任意两个基因之间的 Pearson相关系数为 >0.9, 则为这两 个基因分配边界。这样, A集群和 B集群就不会被聚类在一起,如果 A和 B之间边界( edge ) 的总长度小于 |A|*|B|/3 , 其中 |A| 和 |B| 分别是 A和 B所包含基因的长度( size )。 在基因连 锁群中仅选择最长的基因代表该群, 由此产生了总计 1,138,151个基因。 这 1,138,151个基 因以及他们在阶段 I 的 145 个样品中的相对丰度的相关测量值用于建立基因图谱(gene profile ), 进而用于关联分析。
对于 KO图谱(KO profile ), 利用最初 4,267,985个基因的基因注释信息, 把来自相同 KO的基因的相对丰度求和,得到的总的相对丰度作为该 KO在样品中的含量,以便产生 145 个样品的 KO图谱。 利用与 KO图谱相同的方法, 构建 OG图语( OG profile )。
3.6肠型划分
以物种分类的属水平相对丰度谱数据为基础, 通过按照下列公式计算样品相对距离 ( JSD距离;)并用 PAM算法聚类得到样品的肠型划分( Arumugam, M. et al. Enterotypes of the human gut microbiome. Nature 473, 174-180, doi:10.1038/nature09944 (2011), 通过参照将其并 入本文;):
JSD(P (I D) = -D(P (I M) + -D(Q || M)
M = -(P + Q)
D(P||M)=∑ (0ln
Figure imgf000016_0001
P ( i )和 Q ( i )分别是样品 P、 Q中基因 i的相对丰度。
同时通过功能相对丰度语数据用同样方法的聚类验证其结果。 得到样本的肠型后, 通过 Fisher精确检验及 Mental-Haenszel检验查看样本是否在某种肠型中显著富集。
3.7 环境因素对肠道微生物的影响分析
通过置换多元方差分析的方法 (PERMANOVA, McArdle, B. H. & Anderson, M. J. Fitting Multivariate Models to Community Data: A Comment on Distance-Based Redundancy Analysis. Ecology 82, 290-297 (2001) , 通过参照将其并入本文),用来估计每一种环境因素 (年龄, 性 另' J , T2D, BMI和肠型)对肠道微生物的影响情况。发明人一共进行 10000次置换检验( Zapala, M. A. & Schork, N. J. Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables. Proceedings of the National Academy of Sciences of the United States of America 103, 19430-19435, doi:10.1073/pnas.0609333103 (2006), 通过参照将其并入本文), 如果 p<0.05 , 认为该环境因 素对肠道微生物有影响。 在总体检验显著的情况下 (见下表), 发明人再筛选与异常状态相 关的标记物。
Figure imgf000017_0001
3. 8 群体分层分析
群体分层即样本的分布受到年龄, 性别, BMI和肠型等的影响, 为了校正已知和未知的 混杂因素对挑选与异常状态有关的生物标志物的影响, 发明人使用 EIGENST AT 方法 ( Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature genetics 38, 904-909, doi:10.1038/ngl847 (2006),通过参照将其并入 本文)得到的主成分对基因图谱数据进行了校正。 因为每一个基因可能跟多种因素有关, 发明人修正了 EIGENSTRAT方法, 用异常状态和原来主成分的回归残差替换掉原来的主成 分。 校正的主成分的个数, 由 Tracy-Widom检验 PO.0551 确定。
3.9关联分析 /筛选基因和功能标记物
对基因和功能的相对丰度数据, 通过 Student T检验及 Wilcox秩和检验进行检验。从而 筛选出糖尿病相关的基因标记物和功能标记物。 采用已知的聚类软件, 对基因标记物进行 聚类, 得到物种标记物 (MLG ), 对物种标记物的相对丰度谱进行 Student T检验, 计算 P 值。 结果见下表 2和 3。
为了从结构上整理分析大量的宏基因组数据, 减少信息量以便进行分类描述, 设计广 义概念 MLG ( Metagenomic Linkage Group, 宏基因组连锁群, 也称为候选物种)代替宏基 因组物种的概念, 这里一个 MLG指的是在宏基因组的一组遗传物质,可能是作为一个单元 链接, 而不是独立分布的。 这样, 在研究中则可不需要完全确定在宏基因组中特定的微生 物物种,这些都是重要的大量的未知的生物,细菌之间有频繁的横向基因转移 (LGT , frequent lateral gene transfer)。 一个 MLG定义为一组共同存在于不同个体样品的基因, 并且具有一 致的丰度和物种分类水平。
3.10 鉴定 MLG
3.10.1用于鉴定 MLG的聚类方法。
为了从 T2D相关基因标记物中鉴定 MLG , 发明人按照下列步骤进行分析:
步骤 1: 选择 T2D相关基因标记物的原始组作为基因的起始子聚类 (subcluster )。 需要 注意的是, 在建立基因图谱时, 发明人构建了基因连锁群, 以减少统计分析的复杂程度。 因此, 所有来自基因连锁群( gene linkage group ) 的基因都被认为是子聚类。
步錄 2: 采用 Chameleon算法 ( Karypis, G. & Kumar, V. Chameleon: hierarchical clustering using dynamic modeling. Computer 32, 68-75 (1999),通过参照并入本文 ),利用动态建模技术 和基于相互关联性 ( interconnectivity ) 以及相近性 (closeness ), 对展现出最小相似性 >0.4 的子聚类进行组合。 这里的相似性是由相互关联性和相近性的乘积定义的。 并将这些聚类 定义为半 -聚类 ( semi-cluster )。
步骤 3: 为了将步骤 2中所建立的半-聚类进行合并。 在步骤 3中, 首先更新任意两个半 -聚类之间的相似性, 并随后对每个半 -聚类进行物种分类 (taxonomic assignment )0 最后, 将满足下面两个要求的两个或者更多个半-聚类进行合并为 MLG: a) 半-聚类之间的相似 性> 0.2; b) 所有这些半-聚类都被分配自相同的分类语系 ( taxonomy lineage )»
3.10.2 MLG的物种分类
将所有来自 MLG的基因在核苷酸水平(通过 BLASTN )与参照微生物基因组( IMG数 据库, v3.4 )进行比对, 并且, 在蛋白质水平 (通过 BLASTP )上比对至 NCBI-nr数据库。 利用 e-value (核苷酸水平 < lxl0-1G, 蛋白质水平 < 1 < 10 和比对覆盖率(覆盖 >70%的检索序 列)对比对结果进行过滤。 通过与参照微生物基因组的比对, 每一个 MLG都可以找到一些 物种和它对应, 将这些物种按照它在 MLG中的基因含量比例和平均相似度进行排序。 通过 下面的原则确定 MLG的物种分类: 1) 如果该 MLG中超过 90%的基因可以映射至参照基 因组, 并且在核苷酸水平上阈值为 95%, 则认为该特定 MLG为来自于该已知的细菌物种; 2)如果该 MLG中超过 80%的基因可以映射至参照基因组, 并且在核苷酸水平和蛋白质水 平上阈值为 85%,则认为该特定 MLG为来自于该已知的细菌物种的同一个属; 3)如果可以 从 MLG组装结果鉴定 16S序列,则通过 RDP-Classifier进行多进化树分析( bootstrap value > 0.80 ) ( Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. . Naive Bayesian classifier for rapid assignment of r NA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73, 5261-5267, doi:AEM.00062-07 [pii]10.1128/AEM.00062-07 (2007),通过参照并入本文), 然后 如果来自 16S序列的种系型 (phylotype ) 与来自基因的一致, 则为 MLG定义物种分类。
3.10.3对 MLG进行深度组装
为了重新构建潜在的细菌基因组, 则发明人设计了额外的方法对每个 MLG进行深入组 装, 其包括四步:
步骤 1: 从 MLG提取基因作为种子 (Seed ), 鉴定在所有样品中以最高丰度含有该种子 的样品, 然后从这些样品选择成对末端测序数据, 其可以被匹配到种子上 (包括仅一端可 以被匹配的成对末端测序)。 这些成对末端测序数据覆盖率的下限是在不超过 5个样品中为 50X, 其可以通过将选定的测序数据的总数目除以种子的总长度来计算得到。
步骤 2: 通过使用 SOAPdenovo借助用于构建基因类型所使用的参数, 进行从头组装。 步骤 3: 为了鉴定和除去可能由污染出数据造成的错配 contig, 采用基于组成特征的聚 类方法(composition-based binning method )。 将 GC含量和测序深度值与组装结果的其他 contig不同的 contig从组装结果中除去, 因为他们可能是由于各种原因被错误组装的。
步骤 4: 从步骤 3 , 获得最终组装结果, 重复步骤 2, 直到组装不再有明显改进(具体 的, 总 contig长度的提高低于 5% )。
3.11基于 MLG的分析
3.11.1MLG方法的有效性:
通过下列步骤评估 MLG鉴定方法的性能: 1)在发明人定量的基因结果中, 首先过滤很 少出现的基因(在小于 6个样品中出现); 2)基于在更新的基因集中的物种分类结果, 鉴定 了一组肠道细菌菌种, 其标准为含有 1,000~5,000个唯一匹配的基因, 其中, 相似性阈值为 95%。 在该步骤, 人工去除了一个物种内的冗余菌株, 并且丢弃了可以匹配至多个物种的基 因。 最后, 来自 50个细菌菌种的 130,065个基因被鉴定作为用于评价 MLG方法有效性的 测试组; 3)针对测试组进行上面描述的标准 MLG方法。 对于每个 MLG, 发明人计算了并 非来自主要物种 ( major species ) 的基因的百分比, 作为精度(即基因%, 见表 7 )。
3.11.2 MLG的相对丰度
通过使用来自 MLG的基因的相对丰度值, 评估该 MLG在所有样品中的相对丰度。 对 于该 MLG, 首先丢弃了分别于最高和最低相对丰度差异在 5%以内的基因, 然后对其他进 行与泊松分布的拟合。 泊松分布的预计平均值被解释为该 MLG的相对丰度。 最后, 获得了 所有样品的 MLG图谱( MLG profile ), 用于下列分析。
实施例 4: 生物标 i己物的验证
4.1测序数据的基本处理
重复实施例 1和实施例 2, 对第二期 199个验证样品进行测序, 获得测序数据。
获得测序数据以后, 采用和实施例 3 相同的方法对测序得到进行处理, 得到基因相对 丰度谱、 物种及功能相对丰度谱。 4.2关联分析 /验证基因和功能标记物
使用与实施例 3 相同的关联性统计检验的方法, 对验证样品的基因、 物种和功能的相 对丰度数据进行检验, 并使用多重检验校正方法, 对显著性 P-value进行严格的 Bonferroni 校正, 得到通过验证的基因标记物和功能标记物即为与疾病有显著关联的标记物。 采用已 知的聚类软件, 对基因标记物进行聚类, 得到物种标记物, 对物种标记物的相对丰度谱进 行 Student T检验, 计算 P值。 实施例 3鉴定的标记物仍然与疾病有显著关联, 总结于下表 2-1、 2-2及表 3。 其中, eggNOG和 KEGG是功能注释的数据库, 可以根据编号查找对应基 因家族。
表 2-1 基因标记物
Figure imgf000020_0001
表 2 2 功能标
功能标记物 富集组 a (方向) P-value (第一期) P-value (第二期)
COG0229 1 1.82E- 19 1.26E-05
01251 6.10F-20 3.91 F-05
議 162 3.87F- 10 2.06F-05
K05396 1 3.92E- 11 6.33E-06
K07315 1 2.97E- 12 1.45E-07
COG0499 1 2.61E- 18 0.000168
NOG134456 1 2.62E- 14 0.000137
COG0659 0 7.88E-20 3.62E-08 K03321 0 1 .09F-22 6.22F-08 14652 0 7.60F-20 2.92F-07
NOG303876 0 1.50E- 16 2.86E-05
K05339 0 1.42E- 17 3.49E-07
COG1283 0 1.72E- 15 5.34E-06
COG1266 0 1.70E- 11 6.77E-06
K03324 0 8.79E-20 1.51E-05 a: i表示在 I I型糖尿病组富集, 为有害标记物; 0表示在对照组富集, 为有益标记物。
b:假设 I I型糖尿病组和对照组总体没有差异, P值 (P〈0. 05有显著性差异) 指从假设规定的总体抽样, 抽得等 于及大于和 /或等于及小于现有样本获得的检验统计量值的概率。
MLGe编号 富集组 d (方向) P-values (第一期) P-values (第二期)
T2D-154 1 0.001347368 0.000254046
T2D-140 1 0.000397275 0.002849677
T2D-139 1 0.001328967 0.000211459
T2D-11 1 4.16065E-08 7.58308E-05
T2D-5 1 4.21047E-05 1.97056E-06
T2D-80 1 0.000129893 1.40862E-05
T2D-57 1 4.00759E-07 2.20525E-05
T2D-15 1 4.74327E-05 0.00029675
T2D-1 1 0.000601047 0.003604634
T2D-7 1 0.000601047 0.000279527
T2D-137 1 6.70507E-07 0.001204531
T2D-165 1 0.009634384 0.00166131
T2D-12 1 4.51685E-06 8.04282E-08
T2D-8 1 7.08451 E-10 9.94749E-06
T2D-93 1 0.000208898 0.002040004
T2D-62 1 7.62983E-06 0.000688358
T2D-2 1 3.14293E-05 0.001850999
T2D-6 1 0.000202468 0.002073171
T2D-9 1 3.03578E-05 0.000117763
T2D-14 1 4.16065E-08 7.44243E-07
T2D-16 1 7.44638E-09 2.21532E-06
T2D-30 1 0.000140727 0.004548142
T2D-37 1 0.008582927 7.65392E-05
T2D-73 1 2.54217E-06 0.002296161
T2D-79 1 0.000511522 0.001924895
T2D-90 1 0.000704982 0.001710744
T2D-170 1 0.000665393 0.000421786
Con-107 0 1.12113E-07 0.001826862
Con-112 0 0.006389079 0.00019943
Con-129 0 0.003274757 0.001001054
Con- 166 0 3.79947E-05 0.000193721
Con-121 0 6.10793E-05 4.89846E-06
Con-113 0 0.000284629 0.000972347
Con-120 0 0.000190164 0.000540535
Con-130 0 0.013361656 0.001837279
Con-131 0 0.000898899 0.001737676
Con-133 0 3.42674E-05 0.001474928
Con-109 0 0.013510306 0.000167496
Con-101 0 0.000136295 2.7876E-05
Con- 104 0 9.0896E-07 4.32913E-05
Con- 122 0 0.000415525 0.001694336
Con- 142 0 1.14239E-05 0.001163884 Con- 144 0 0.003671368 0.001951713
Con-148 0 0.014915281 0.004688126
Con- 152 0 0.002630298 0.003828386
Con-155 0 0.000566927 0.007671607
Con-180 0 0.013068685 0.00275283
c: MLG: MetagenomicLinkage Group, 为候选物种。
d:l表示在 II型糖尿病组富集, 为有害标记物; 0表示在对照组富集, 为有益标
己物。
4.3生物标记物对疾病的预测分析
4.3.1基因标记物对疾病的预测 o o分析
通过 344 个样品中 10 个有害基因和 10 个有益基因的相对丰度作为风险值, 估计 OC(receiver-operating characteristic)曲线下面积 AUC(Michael J. Pencina, alph B. D' Agostino Sr, alph B. D' Agostino Jr,et al. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in medicine,2008,27(2): 157-172 ), AUC越大, 表示诊断能力越高, 评价基因对 II型糖尿病的 诊断能力。 对于每一个基因, 确定一个诊断的临界值 (cutoff), 使得在这个临界值下, 诊断 的敏感度跟特异度的和最高。
筒言之, cutoff的确定方法如下: 将基因的相对丰度从小到大排序, 然后顺序取一个值 出来作为候选 cutoff, 在这个候选 cutoff下算出敏感度和特异度, 将敏度度和特异度求和最 大的候选 cutoff作为最终的最优 cutoff。 对于有益基因, 相对丰度值小于临界值就被诊断为 II型糖尿病;对于有害基因,相对丰度值大于临界值就被诊断为 II型糖尿病。结果见表 4-1。
敏感度称真阳性率, 是实际患者且被指标诊断为患者的概率, 即患者被诊断为阳性的 概率。 特异度称真阴性率, 是指实际未患病被指标诊断为非患者的概率, 即非患者被诊断 为阴性的概率。
表 4-1基因标记物的 AUC和 CUTOFF
富集组 a
基因标记物编号 cutoff AUC 敏感度 特异度 序列编号
(方向)
52049 0 6. 44E-08 0. 685 0. 564706 0. 752874 1
66281 0 3. 63E-08 0. 684 0. 576471 0. 741379 2
86279 0 3. 06E-07 0. 683 0. 688235 0. 626437 3
337304 1 0. 658 0. 647059 0. 632184 4
1224005 1 7. 85E-08 0. 666 0. 611765 0. 666667 5
1238449 0 0. 683 0. 770588 0. 568966 6
2005309 1 9. 78E-08 0. 657 0. 635294 0. 649425 7
2060779 0 4. 15E-07 0. 687 0. 682353 0. 666667 8
2370529 1 1. 02E-07 0. 66 0. 641176 0. 614943 9
2581190 1 3. 10E-07 0. 663 0. 482353 0. 804598 10 2746171 1 5. 30E-08 0. 659 0. 705882 0. 563218 11
3182475 1 1. 25E-06 0. 658 0. 388235 0. 873563 12
3247820 1 6. 78E-08 0. 662 0. 605882 0. 695402 13
3250057 1 3. 09E-08 0. 666 0. 705882 0. 568966 14
3253773 1 9. 35E-08 0. 657 0. 682353 0. 62069 15
3646621 0 4. 60E-07 0. 68 0. 747059 0. 563218 16
3793132 0 6. 97E-07 0. 681 0. 723529 0. 568966 17
3815768 0 8. 22E-08 0. 68 0. 541176 0. 775862 18
4097912 0 1. 57E-07 0. 689 0. 652941 0. 695402 19
4136092 0 1. 20E-07 0. 688 0. 658824 0. 683908 20
4.3.2功能标记物对疾病的预测分析
通过 344个样品所选择的 7个有害功能标记物和 8个有益功能标记物的相对丰度作为 风险值,估计 OC(receiver-operating characteristic)曲线下面: f只 AUC(Michael J. Pencina, alph B. D' AgostinoSr, alph B. D' AgostinoJr,et al. Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Statistics in medicine,2008,27(2): 157-172 ), AUC越大, 表示诊断能力越高,评价功能标记物对 II型糖尿 病的诊断能力。 对于每一个功能标记物, 确定一个诊断的临界值 (cutoff), 使得在这个临界 值下, 诊断的敏感度跟特异度的和最高。
cutof 的确定: 将功能标记物的相对丰度从小到大排序, 然后顺序取一个值出来作为候 选 cutoff, 在这个候选 cutoff 下算出敏感度和特异度, 将敏度度和特异度求和最大的候选 cutoff作为最终的最优 cutoff。 当方向等于 1时, 表示这个功能标记物是有害的, 等于 0时, 表示功能标记物是有益的。 对于有益功能标记物, 相对丰度值小于临界值就被诊断为 II型 糖尿病; 有害功能标记物, 相对丰度值大于临界值就被诊断为 II型糖尿病。 见表 4-2。
敏感度称真阳性率, 是实际患者且被指标诊断为患者的概率, 即患者被诊断为阳性的 概率。 特异度称真阴性率, 是指实际未患病被指标诊断为非患者的概率, 即非患者被诊断 为阴性的概率。
表 4-2 功能标记物的 AUC和 Cutoff
标记物 富集组(方向) cutoff AUC 敏感度 特异度
COG0229 1 5.03E-05 0.695 0.694118 0.643678
K01251 1 5.84E-05 0.686 0.758824 0.517241
K00162 1 1.01E-05 0.68 0.647059 0.655172
K05396 1 2.33E-05 0.678 0.511765 0.798851
K07315 1 5.39E-05 0.678 0.552941 0.747126
COG0499 1 5.06E-05 0.674 0.782353 0.488506
NOG134456 1 2.71E-06 0.674 0.441176 0.833333
COG0659 0 0.000206 0.715 0.688235 0.678161
K03321 0 0.000215 0.715 0.71 1765 0.66092
K14652 0 0.000208 0.703 0.511765 0.827586
NOG303876 0 2.37E-05 0.693 0.629412 0.706897
K05339 0 0.000193 0.688 0.641176 0.649425
COG1283 0 0.000369 0.679 0.670588 0.591954 COG 1266 0 0.000283 0.677 0.788235 0.494253
K03324 0 0.000414 0.675 0.735294 0.528736
4.3.3 物种标记物对疾病的预测分析
通过 344个样品中所选择的 27个有害 MLG和 20个有益 MLG的相对丰度作为风险值, 估计 ROC(receiver-operating characteristic)曲线下面积 AUC, 评价 MLG对 II型糖尿病的诊 断能力。 对于每一个 MLG, 确定一个诊断的临界值 (cutoff), 使得在这个临界值下, 诊断的 敏感度跟特异度的和最高。
cutof 的确定方法如下: 将 MLG的相对丰度从小到大排序, 然后顺序取一个值出来作 为候选 cutoff, 在这个候选 cutoff下算出敏感度和特异度, 将敏度度和特异度求和最大的候 选 cutoff作为最终的最优 cutoff。 对于有益 MLG, 相对丰度值小于临界值就被诊断为 II型 糖尿病; 对于有害 MLG, 相对丰度值大于临界值就被诊断为 II型糖尿病。 结果总结如下表 敏感度称真阳性率, 是实际患者且被指标诊断为患者的概率, 即患者被诊断为阳性的 概率。 特异度称真阴性率, 是指实际未患病被指标诊断为非患者的概率, 即非患者被诊断 为阴性的概率。
物种标记物的 AUC和 CUTOFF
富集组
MLG编号 cutoff AUC 敏感度 特异度
(方向)
T2D-11 1 0.103658 0.618 0.541176 0.66092
T2D-12 1 0.006279 0.654 0.564706 0.689655
T2D-137 1 0.498151 0.585 0.423529 0.729885
T2D-139 1 1.553228 0.617 0.5 0.701149
T2D-140 1 0.49045 0.571 0.423529 0.735632
T2D-14 1 0.010063 0.652 0.764706 0.505747
T2D-154 8.95F-05 0.604 0.41 1765 0.798851
T2D-15 1 0.00508 0.589 0.670588 0.494253
T2D-165 1 0.032528 0.6 0.488235 0.701 149
T2D-16 1 0.003242 0.634 0.6 0.626437
T2D-170 1 0.032845 0.616 0.417647 0.804598
T2D-1 1 0.098314 0.526 0.076471 0.977011
T2D-2 1 0.0072 0.586 0.388235 0.816092
T2D-30 1 0.001567 0.54 0.147059 0.936782
T2D-37 1 0.099862 0.591 0.411765 0.770115
T2D-57 1 0.015788 0.647 0.523529 0.701149
T2D-5 0.000673 0.651 0.688235 0.563218
T2D-62 1 0.274395 0.624 0.417647 0.793103
T2D-6 1 0.089696 0.526 0.094118 0.982759
T2D-73 1 0.107684 0.6 0.311765 0.885057
T2D-79 1 0.150142 0.572 0.594118 0.563218
T2D-7 1 0.046154 0.604 0.523529 0.655172
T2D-80 1 0.003178 0.655 0.682353 0.586207
T2D-8 1 0.007389 0.622 0.641176 0.58046
T2D-90 1 0.009561 0.62 0.447059 0.758621
T2D-93 1 0.034981 0.563 0.417647 0.718391
T2D-9 1 0.008346 0.62 0.570588 0.637931 Con-101 0 0.01 1503 0.672 0.717647 0.58046
Con- 104 0 0.156174 0.668 0.658824 0.6321 84
Con-107 0 0.34953 0.656 0.652941 0.637931
Con-109 0 0.001797 0.641 0.423529 0.816092
Con-112 0 0.059392 0.606 0.529412 0.632184
Con-113 0 0.36604 0.646 0.641176 0.614943
Con-120 0 1.686662 0.62 0.705882 0.5
Con-121 0 0.06585 0.67 0.688235 0.568966
Con-122 0 0.003649 0.602 0.723529 0.448276
Con-129 0 0.663083 0.618 0.658824 0.557471
Con-130 0 0.403354 0.604 0.664706 0.54023
Con-131 0 0.639878 0.643 0.6 0.62069
Con-133 0 0.419924 0.627 0.717647 0.505747
Con-142 0 0.180048 0.625 0.529412 0.655172
Con- 144 0 0.082044 0.613 0.564706 0.649425
Con-148 0 0.689789 0.605 0.758824 0.408046
Con- 152 0 0.222946 0.598 0.705882 0.494253
Con-155 0 0.001098 0.575 0.811765 0.321839
Con-166 0 0.001912 0.67 0.5 0.781609
Con-180 0 7.74E-05 0.599 0.694118 0.494253 实施例 5: 疾病相关微生物基因组的重建
5.1深度组装
得到物种标记物后, 利用 SOAP2从原始的宏基因组测序序列池(reads pool )中寻找对 应 ί生物物种的测序数据, 并通过 SOAPdenovo对这些测序数据进行深度组装, 得到该 ί 生物物种的基因组序列。 通过进一步迭代比对和深度组装能够尽可能地重建该微生物物种 的基因组, 并大大改进了组装结果。 多次迭代后, 把不再改进的组装结果作为最终得到的 微生物物种的基因组草图, 见表 6。
表 6 标记物的组装结果
Figure imgf000025_0001
Figure imgf000026_0001
5.2微生物基因组的鉴别
通过 16S 区域鉴定和全基因组鉴定法对组装得到的基因组草图进行物种鉴定, 物种分 类 (水平)信息见表 7。
表 7 物种分类 (水平)信息 富集组 MLG编号 基因数目 物种分类 (水平) 基因%e 相似度 f
T2D-154 337 Akkermansiamuciniphila 97.92 98.17±0.09
T2D-140 148 Bacteroidesintestinalis 89.19 98.20±0.15
T2D-139 3,386 Bacteroides sp. 20 3 94.60 99.29±0.01
T2D-1 1 5,1 13 Clostridium bolteae 96.87 99.39±0.02
T2D-5 2,378 Clostridium hathewayi 96.93 99.31 ±0.03
T2D-80 2,381 Clostridium ramosum 95.38 99.81 ±0.01
T2D-57 821 Clostridium sp. HGF2 97.69 99.59±0.03
II型糖尿 T2D-15 2,492 Clostridium svmbiosum 95.63 99.58±0.01
T2D-1 949 Desulfovibrio sp. 3 1 syn3 93.78 98.04±0.08 病组冨集
T2D-7 1 ,056 Eqqerthellalenta 94.22 99.63±0.03
T2D-137 425 Escherichia coli 70.35 99.01 ±0.08
T2D-165 131 Aiisiioes (qenus) 89.31
T2D-12 364 Ciosindium (aenus) 79.40
T2D-8 5,272 Clostridium (aenuss 65.35
T2D-93 1 ,590 Parabac eroides (qenus) 60.69
T2D-62 2,584 SubdoiiaranuSum (aenus) 93.81
T2D-2 2,430 Lachnospiraceae (ia iM 95.43 T2D-6 1 ,305 Unclassified 96.55
T2D-9 105 Unclassified 67.62
T2D-14 392 Unclassified 74.74
T2D-16 222 Unclassified 72.07
T2D-30 430 Unclassified 98.84
T2D-37 251 Unclassified 92.03
T2D-73 565 Unclassified 96.81
T2D-79 1 ,632 Unclassified 86.89
T2D-90 130 Unclassified 99.23
T2D-170 1 14 Unclassified 95.61
Con-107 1 ,677 Clostridiales sp. SS3/4 97.02 97.95±0.06
Con-1 12 232 Eubacteriumrectale 90.52 97.56±0.12
Con-129 1 ,440 Faecalibacteriumprausnitzii 96.74 98.18±0.04
Con- 166 273 Haemophilusparainfluenzae 95.24 94·81 ±0·17
Con-121 3,507 Roseburiaintestinalis 92.19 98.90±0.03
Con-1 13 345 Roseburiainulinivorans 94.20 98.21 ±0.1 1
Con-120 1 16 tubacierium (oenusi 55.17
Con-130 670 Faecaiibacieriurn (oenusj 51 .94
Con-131 202 Faeca!ibacterium iqenus) 77.23
对照組富 Con-133 1 ,555 Ervsioelowchaceae (fa ilvl 77.88
Con-109 378 Ciosiridiaies (orden 74.87
Con-101 1 ,762 Unclassified 85.70
Con- 104 916 Unclassified 67.58
Con- 122 1 ,999 Unclassified 80.24
Con- 142 673 Unclassified 95.39
Con- 144 162 Unclassified 96.91
Con-148 481 Unclassified 82.95
Con- 152 945 Unclassified 81 .16
Con-155 228 Unclassified 89.47
Con-180 528 Unclassified 86.55
e: 表示 MLG的基因里面有多少基因是在跟它最接近的物种里面的。
f: 比上最接近物种的平均相似度。 实施例 6: 物种标志物的优势比
为了对找到的物种标记物进一步验证, 分别计算各物种标记物在上述 344个样品中的优 势比(odds ratio ), 参见表 8。 结果显示, 物种的关联强度高(优势比均大于 1 , 优势比越大, 说明该物种标记物在其相应组的样品中富集越明显)。
表 8物种标志物的优势比
富集组 MLG编号 物种分类 (水平) 优势比 C(95% 可信区间)
T2D-154 A ermansiamuciniphila 1 .52 (1.05, 2.19)
T2D-140 Bacteroidesintestinalis 1 .50 (1.15, 1 .97)
T2D-139 Bacteroides sp. 20 3 1 .66 (1.26, 2.20)
T2D-11 Clostridium bolteae 5.89 (1.39, 25.0)
II型糖尿 T2D-5 Clostridium hathewavi 23.1 (2.08, 256.6)
T2D-80 Clostridium ramosum 1 .68 (0.97, 2.89)
病组冨集
T2D-57 Clostridium sp. HGF2 2.62 (1.14, 6.03)
T2D-15 Clostridium svmbiosum 1 .13 (0.88, 1 .44)
T2D-1 Desulfovibrio sp. 3 1 syn3 1 .41 (0.93, 2.13)
T2D-7 Epperthellalenta 1 .57 (0.95, 2.58)
T2D-137 Escherichia coli 1 .72 (1.16, 2.57) T2D-165 1.46 (1.07, 1.99)
T2D-12 CiGstidiuni (qenus) 2.22 (1.12, 4.40)
T2D-8 Ciosindsum (aenus) 1.12 (0.86, 1.45)
T2D-93 Parabacteroides (qenus) 1.84 (1.03, 3.29)
T2D-62 SubdoiiQfanuium (qenus) 2.41 (1.43, 4.08)
T2D-2 t |ϊ*ί /■ j ί ί f " - £ ¾ ί?;' ,-ϊί'ί"**· 4.06 (1.28, 12.9)
T2D-6 Unclassified 3.70 (1.18, 11.7)
T2D-9 Unclassified 1.02 (0.83, 1.27)
T2D-14 Unclassified 9.61 (1.93, 47.8)
T2D-16 Unclassified 1.17 (0.87, 1.56)
T2D-30 Unclassified 1.27 (0.94, 1.73)
T2D-37 Unclassified 1.68 (1.27, 2.22)
T2D-73 Unclassified 1.89 (1.26, 2.83)
T2D-79 Unclassified 1.28 (0.97, 1.68)
T2D-90 Unclassified 2.01 (1.29, 3.13)
T2D-170 Unclassified 1.85 (0.96, 3.57)
Con-107 Clostridiales sp. SS3/4 1.44 (1.13, 1.84)
Con-112 Eubacteriumrectale 1.51 (1.13, 2.03)
Con-129 Faecalibacteriumprausnitzii 1.55 (1.19, 2.00)
Con-166 Haemophilusparainfluenzae 1.25 (0.93, 1.69)
Con-121 Roseburiaintestinalis 3.10 (1.92, 5.03)
Con-113 Roseburiainulinivorans 1.45 (1.11, 1.89)
Con-120 1.55 (1.17, 2.06)
Con-130 Fseo^bscterium iqenus: 1.59 (1.21, 2.08)
Con-131 F3 (0 1.58 (1.16, 2.15)
对照組富 Con-133 Ervsipe!otrichaceae (fami!vs 1.52 (1.15, 2.01)
Con-109 Cbstfidiaies border) 1.41 (1.09, 1.83)
Con-101 Unclassified 1.56 (1.00, 2.43)
Con- 104 Unclassified 1.96 (1.33, 2.89)
Con-122 Unclassified 1.97 (1.16, 3.34)
Con-142 Unclassified 1.38 (1.03, 1.83)
Con- 144 Unclassified 1.38 (1.09, 1.74)
Con-148 Unclassified 2.10 (1.31, 3.36)
Con-152 Unclassified 1.53 (1.17, 2.00)
Con-155 Unclassified 1.72 (1.18, 2.50)
Con-180 Unclassified 1.64 (1.15, 2.32) 表 9基因标记物的序列 尽管本发明的具体实施方式已经得到详细的描述, 本领域技术人员将会理解。 根据已 经公开的所有教导, 可以对那些细节进行各种修改和替换, 这些改变均在本发明的保护范 围之内。 本发明的全部范围由所附权利要求及其任何等同物给出。
在本说明书的描述中, 参考术语 "一个实施例"、 "一些实施例"、 "示意性实施例"、 "示 例"、 "具体示例"、 或 "一些示例" 等的描述意指结合该实施例或示例描述的具体特征、 结 构、 材料或者特点包含于本发明的至少一个实施例或示例中。 在本说明书中, 对上述术语 的示意性表述不一定指的是相同的实施例或示例。 而且, 描述的具体特征、 结构、 材料或 者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。

Claims

权利要求书
1、 一种确定异常状态生物标记物的方法, 其特征在于, 包括下列步骤:
对来自第一对象的核酸样本和来自第二对象的核酸样本进行核酸测序, 以便获得分别 由多个测序序列构成的第一测序结果和第二测序结果, 其中所述第一对象具有所述异常状 态, 所述第二对象不具有所述异常状态, 所述来自第一对象的核酸样本和所述来自第二对 象的核酸样本是从相同类型的试样分离的, 所述第一对象和第二对象属于相同物种; 以及 基于所述第一测序结果和所述第二测序结果的差异, 确定与所述异常状态相关的标记 物。
2、 根据权利要求 1所述的方法, 其特征在于, 所述异常状态为疾病。
3、 根据权利要求 1或 2所述的方法, 其特征在于, 所述疾病为选自肿瘤性疾病、 免疫 性疾病、 遗传性疾病、 代谢性疾病的至少一种。
4、 根据权利要 1或 2所述的方法, 其特征在于, 所述异常状态为糖尿病。
5、 根据前述任一项权利要求所述的方法, 其特征在于, 所述第一对象和所述第二对象 为人。
6、 根据前述任一项权利要求所述的方法, 其特征在于, 所述来自第一对象的核酸样本 和所述来自第二对象的核酸样本分别为从所述第一对象和第二对象的***物中分离的。
7、 根据前述任一项权利要求所述的方法, 其特征在于, 利用第二代测序技术或第三代 测序技术对来自所述第一对象的核酸样本和来自所述第二对象的核酸样本的至少一种进行 核酸测序。
8、 根据前述任一项权利要求所述的方法, 其特征在于, 利用选自 Hiseq2000、 SOLiD、
454、 和单分子测序装置的至少一种进行所述核酸测序。
9、 根据前述任一项权利要求所述的方法, 其特征在于, 基于所述第一测序结果和所述 第二测序结果的差异, 确定所述异常状态的生物标记物进一步包括:
将构成所述第一测序结果和所述第二测序结果的测序序列与参照基因集进行比对; 基于比对结果, 分别确定来自所述第一对象和第二对象的核酸样本中各基因的相对丰 度; 以及
对来自所述第一对象和第二对象的核酸样本中各基因的相对丰度进行统计检验; 以及 确定在来自所述第一对象和第二对象的核酸样本之间相对丰度存在显著差异的基因为 所述异常状态的基因标记物。
10、 根据权利要求 9 所述的方法, 其特征在于, 在将构成所述第一测序结果和所述第 二测序结果的测序序列与参照基因集进行比对之前, 进一步包括对所述测序结果进行过滤 以便去除污染的步骤, 其中, 所述污染为选自下列的至少一种: 接头污染, 低质量序列和 宿主基因组污染序列。
11、根据权利要求 9或 10所述的方法, 其特征在于, 利用选自 SOAP2和 MAQ的至少 一种, 将构成所述第一测序结果和所述第二测序结果的测序序列与参照基因集进行比对, 任选地, 所述参照基因集为人肠道微生物群落非冗余基因集。
12、 根据权利要求 9所述的方法, 其特征在于, 进一步包括:
将构成所述第一测序结果和所述第二测序结果的测序序列, 进行组装和基因预测, 以 获得基因, 不能与所述参照基因集比对上的基因为新基因; 以及将所确定的新基因增加至 所述参照基因集中。
13、 根据权利要求 12所述的方法, 其特征在于, 所述物种分类是通过将所述参照基因 集上每个基因与 IMG数据库进行比对而进行的。
14、 根据权利要求 13所述的方法, 其特征在于, 利用 BLASTP将所述参照基因集上每 个基因与 IMG数据库进行比对, 其中, 根据 E-Value值小于 10-1G的结果, 确定所述新基因 的物种分类水平。
15、 根据权利要求 12所述的方法, 其特征在于, 所述功能注释是通过将所述参照基因 集上每个基因与 eggNOG和 KEGG的至少之一进行比对而进行的。
16、 根据权利要求 15所述的方法, 其特征在于, 利用 BLASTP将所述参照基因集上每 个基因与 IMG数据库进行比对, 其中, 根据 E-Value值小于 10-1G的结果, 确定所述新基因 的功能。
17、 根据权利要求 9 所述的方法, 其特征在于, 所述相对丰度为各基因的物种相对丰 度和功能相对丰度, 所述参照基因集包含基因物种信息和功能注释,
其巾,
基于所述第一测序结果和所述第二测序结果的差异, 确定所述异常状态的生物标记物 进一步包括:
将构成所述第一测序结果和所述第二测序结果的测序序列与参照基因集进行比对; 基于比对结果, 分别确定来自所述第一对象和第二对象的核酸样本中各基因的物种相 对丰度和功能相对丰度; 以及
对来自所述第一对象和第二对象的核酸样本中各基因的物种相对丰度和功能相对丰度 进行统计检验; 以及
确定在来自所述第一对象和第二对象的核酸样本之间相对丰度存在显著差异的物种和 功能分别为所述异常状态的物种标记物和功能标记物, 任选地, 在获得相对丰度之后, 对相对丰度的精确性进行统计检验, 优选地利用泊松 分布。
18、 根据权利要求 9-17任一项所述的方法, 其特征在于, 所述统计检验为选自 Student T检验、 Wilcox轶和检验的至少一种。
19、 根据权利要求 9-17所述的方法, 其特征在于, 进一步包括过滤除掉受到表观因素 影响显著的样本, 优选通过肠型分析和选自 Fisher精确检验及 Mental-Haenszel的至少一种 检验进行过滤,
任选地, 进一步包括群体分层分析, 对基因图谱数据进行了校正, 优选通过 EIGENSTRAT方法得到的主成分对基因图谱数据进行校正。
20、 根据权利要求 9-17任一项所述的方法, 其特征在于, 进一步包括对所得到的基因 标记物进行聚类分析和深度组装, 以便构建所述异常状态的相关生物基因组。
21、 根据前述任一项权利要求所述的方法, 其特征在于, 进一步包括对所述生物标记 物进行验证的步骤。
22、 一种确定异常状态生物标记物的***, 其特征在于, 包括:
测序装置, 所述测序装置适于对来自第一对象的核酸样本和对来自第二对象的核酸样 本进行核酸测序进行核酸测序, 以便获得分别由多个测序序列构成的第一测序结果和第二 测序结果, 其中所述第一对象具有所述异常状态, 所述第二对象不具有所述异常状态, 所 述来自第一对象的核酸样本和所述来自第二对象的核酸样本是从相同类型的试样分离的, 所述第一对象和第二对象属于相同物种;
分析装置, 所述分析装置与测序装置相连, 从所述测序装置接收所述第一测序结果和 所述第二测序结果, 并且适于基于所述第一测序结果和所述第二测序结果的差异, 确定与 所述异常状态相关的标记物。
23、 根据利要求 22所述的***, 其特征在于, 进一步包括:
核酸样本分离装置, 所述核酸样本分离装置与所述测序装置相连, 并且适于从对象分 离核酸样本, 任选地适于从对象的***物中分离核酸样本。
24、 根据权利要求 23所述的***, 其特征在于, 所述测序装置为第二代测序平台或第 三代测序平台。
25、 根据权利要求 22-24 任一项所述的方法, 其特征在于, 所述测序装置为选自 Hiseq2000、 SOLiD、 454、 和单分子测序装置的至少一种。
26、 根据权利要求 22-25任一项所述的方法, 其特征在于, 所述分析装置进一步包括: 比对单元, 所述比对单元适于将构成所述第一测序结果和所述第二测序结果的测序序 列与参照基因集进行比对;
相对丰度确定单元, 所述相对丰度计算单元与所述比对单元相连, 并且适于基于比对 结果, 分别确定来自所述第一对象和第二对象的核酸样本中各基因的相对丰度; 以及
检验单元, 所述检验单元与所述相对丰度确定单元相连, 并且适于对来自所述第一对 象和第二对象的核酸样本中各基因的相对丰度进行统计检验; 以及
标记物确定单元, 所述标记物确定单元适于基于统计检验结果, 确定在来自所述第一 对象和第二对象的核酸样本之间相对丰度存在显著差异的基因为所述异常状态的基因标记 物。
27、 根据权利要求 26所述的***, 其特征在于, 所述分析装置进一步包括: 过滤单元, 所述过滤单元与所述比对单元相连, 并且适于在将构成所述第一测序结果 和所述第二测序结果的测序序列与参照基因集进行比对之前, 对所述测序结果进行过滤以 便去除污染, 其中, 所述污染为选自下列的至少一种: 接头污染, 低质量序列和宿主基因 组污染序列。
28、 根据权利要求 26或 27所述的***, 其特征在于, 所述比对单元利用选自 SOAP2 和 MAQ的至少一种,将构成所述第一测序结果和所述第二测序结果的测序序列与参照基因 集进行比对, 任选地, 所述参照基因集为人肠道微生物群落非冗余基因集。
29、根据权利要求 26-28任一项所述的***, 其特征在于, 所述相对丰度为各基因的物 种相对丰度和功能相对丰度, 所述参照基因集包含基因物种信息和功能注释,
其中, 所述相对丰度确定单元, 适于基于比对结果, 分别确定来自所述第一对象和第 二对象的核酸样本中各基因的物种相对丰度和功能相对丰度;
所述检验单元, 适于对来自所述第一对象和第二对象的核酸样本中各基因的物种相对 丰度和功能相对丰度进行统计检验; 以及
所述标记物确定单元, 适于基于在来自所述第一对象和第二对象的核酸样本之间相对 丰度存在显著差异的物种和功能, 确定所述异常状态的物种标记物和功能标记物。
30、根据权利要求 26-29任一项所述的***, 其特征在于, 所述检验单元适于进行选自
Student T检验、 Wilcox轶和检验的至少一种统计检验。
31、根据权利要求 26-30任一项所述的***,其特征在于,进一步包括基因组组装装置, 所述基因组组装装置适于对所得到的基因标记物进行聚类分析和深度组装, 以便构建所述 异常状态的相关生物基因组。
PCT/CN2012/079524 2012-08-01 2012-08-01 确定异常状态生物标记物的方法及*** WO2014019180A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
PCT/CN2012/079524 WO2014019180A1 (zh) 2012-08-01 2012-08-01 确定异常状态生物标记物的方法及***
CN201280075072.1A CN104603283B (zh) 2012-08-01 2012-08-22 确定异常状态相关生物标志物的方法及***
US13/640,448 US20150376697A1 (en) 2012-08-01 2012-08-22 Method and system to determine biomarkers related to abnormal condition
PCT/CN2012/080479 WO2014019267A1 (en) 2012-08-01 2012-08-22 Method and system to determine biomarkers related to abnormal condition
HK15108222.6A HK1207670A1 (zh) 2012-08-01 2015-08-25 確定異常狀態相關生物標誌物的方法及系統

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/079524 WO2014019180A1 (zh) 2012-08-01 2012-08-01 确定异常状态生物标记物的方法及***

Publications (1)

Publication Number Publication Date
WO2014019180A1 true WO2014019180A1 (zh) 2014-02-06

Family

ID=50027105

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2012/079524 WO2014019180A1 (zh) 2012-08-01 2012-08-01 确定异常状态生物标记物的方法及***
PCT/CN2012/080479 WO2014019267A1 (en) 2012-08-01 2012-08-22 Method and system to determine biomarkers related to abnormal condition

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/080479 WO2014019267A1 (en) 2012-08-01 2012-08-22 Method and system to determine biomarkers related to abnormal condition

Country Status (3)

Country Link
US (1) US20150376697A1 (zh)
HK (1) HK1207670A1 (zh)
WO (2) WO2014019180A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105420375A (zh) * 2015-12-24 2016-03-23 北京大学 一种环境微生物基因组草图的构建方法

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150211053A1 (en) * 2012-08-01 2015-07-30 Bgi-Shenzhen Biomarkers for diabetes and usages thereof
CN105209918B (zh) * 2013-05-09 2017-09-29 宝洁公司 生物标记鉴定方法和***
CA2963013C (en) 2014-09-30 2022-10-04 Bgi Shenzhen Biomarkers for rheumatoid arthritis and usage thereof
CN105825076B (zh) * 2015-01-08 2018-12-14 杭州天译基因科技有限公司 消除常染色体内和染色体间gc偏好的方法及检测***
WO2016141516A1 (zh) * 2015-03-06 2016-09-15 深圳华大基因研究院 获取子代特异性序列、检测子代新突变的方法和装置
US20180030403A1 (en) 2016-07-28 2018-02-01 Bobban Subhadra Devices, systems and methods for the production of humanized gut commensal microbiota
CN111445949A (zh) * 2020-03-27 2020-07-24 武汉古奥基因科技有限公司 利用纳米孔测序数据的高原多倍体鱼类基因组注释方法
CN112071366B (zh) * 2020-10-13 2024-02-27 南开大学 一种基于二代测序技术的宏基因组数据分析方法
CN113409321B (zh) * 2021-06-09 2023-10-27 西安电子科技大学 一种基于像素分类和距离回归的细胞核图像分割方法
CN113793647A (zh) * 2021-09-17 2021-12-14 艾德范思(北京)医学检验实验室有限公司 一种基于二代测序宏基因组数据分析装置及方法
CN116230078B (zh) * 2023-05-08 2023-07-07 瑞因迈拓科技(广州)有限公司 一种de novo评估组装基因组污染度的方法

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101914628A (zh) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 检测基因组目标区域多态性位点的方法及 ***
CN102061526A (zh) * 2010-11-23 2011-05-18 深圳华大基因科技有限公司 一种DNA文库及其制备方法、以及一种检测SNPs的方法和装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8068990B2 (en) * 2003-03-25 2011-11-29 Hologic, Inc. Diagnosis of intra-uterine infection by proteomic analysis of cervical-vaginal fluids
CA2791647A1 (en) * 2010-03-01 2011-09-09 Institut National De La Recherche Agronomique Method of diagnostic of inflammatory bowel diseases

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101914628A (zh) * 2010-09-02 2010-12-15 深圳华大基因科技有限公司 检测基因组目标区域多态性位点的方法及 ***
CN102061526A (zh) * 2010-11-23 2011-05-18 深圳华大基因科技有限公司 一种DNA文库及其制备方法、以及一种检测SNPs的方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
DANA WILLNER ET AL.: "Metagenomic Analysis of Respiratory Tract DNA Viral Communities in Cystic Fibrosis and Non-Cystic Fibrosis Individuals, art e7370", PLOS ONE, vol. 4, no. 10, October 2009 (2009-10-01) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105420375A (zh) * 2015-12-24 2016-03-23 北京大学 一种环境微生物基因组草图的构建方法
CN105420375B (zh) * 2015-12-24 2020-01-21 北京大学 一种环境微生物基因组草图的构建方法

Also Published As

Publication number Publication date
HK1207670A1 (zh) 2016-02-05
WO2014019267A1 (en) 2014-02-06
US20150376697A1 (en) 2015-12-31

Similar Documents

Publication Publication Date Title
WO2014019180A1 (zh) 确定异常状态生物标记物的方法及***
CN105368944B (zh) 可检测疾病的生物标志物及其用途
Grumaz et al. Rapid next-generation sequencing–based diagnostics of bacteremia in septic patients
Qin et al. Alterations of the human gut microbiome in liver cirrhosis
Hiergeist et al. Multicenter quality assessment of 16S ribosomal DNA-sequencing for microbiome analyses reveals high inter-center variability
CN104603283B (zh) 确定异常状态相关生物标志物的方法及***
US20150211053A1 (en) Biomarkers for diabetes and usages thereof
US10526659B2 (en) Biomarkers for colorectal cancer
US20150242565A1 (en) Method and device for analyzing microbial community composition
CN107217089B (zh) 确定个体状态的方法及装置
Kishikawa et al. A metagenome-wide association study of gut microbiome in patients with multiple sclerosis revealed novel disease pathology
WO2016050110A1 (en) Biomarkers for rheumatoid arthritis and usage thereof
CN112119167A (zh) 抑郁症生物标志物及其用途
CN110904213A (zh) 一种基于肠道菌群的溃疡性结肠炎生物标志物及其应用
CN107217088B (zh) 强直性脊柱炎微生物标志物
WO2016008954A1 (en) Gut bacterial species in hepatic diseases
CN111500705A (zh) IgAN肠道菌群标志物、IgAN代谢物标志物及其应用
WO2023098152A1 (zh) 一种微生物基因数据库的构建方法及***
EP3374523A1 (en) Biomarkers for prospective determination of risk for development of active tuberculosis
WO2017156739A1 (zh) 分离的核酸及应用
CN113913490A (zh) 非酒精性脂肪肝标志微生物及其应用
CN105671177B (zh) 强直性脊柱炎标志物及应用
CN107217086B (zh) 疾病标志物及应用
CN115331737A (zh) 一种分析肠道菌群中致病菌和量化菌群地域特征的方法
WO2017156764A1 (zh) 分离的核酸及应用

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12882325

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 26/06/2015)

122 Ep: pct application non-entry in european phase

Ref document number: 12882325

Country of ref document: EP

Kind code of ref document: A1