CN102369531B - Method for selecting statistically validated candidate genes - Google Patents

Method for selecting statistically validated candidate genes Download PDF

Info

Publication number
CN102369531B
CN102369531B CN201080015488.5A CN201080015488A CN102369531B CN 102369531 B CN102369531 B CN 102369531B CN 201080015488 A CN201080015488 A CN 201080015488A CN 102369531 B CN102369531 B CN 102369531B
Authority
CN
China
Prior art keywords
mark
qtl
population
gene
subpopulation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201080015488.5A
Other languages
Chinese (zh)
Other versions
CN102369531A (en
Inventor
V.K.基肖尔
郭志刚
李珉
王道龙
L.A.谷迪尔莱兹罗贾斯
J.D.V.克拉克
J.拜勒姆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Syngenta Participations AG
Original Assignee
Syngenta Participations AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Syngenta Participations AG filed Critical Syngenta Participations AG
Publication of CN102369531A publication Critical patent/CN102369531A/en
Application granted granted Critical
Publication of CN102369531B publication Critical patent/CN102369531B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • AHUMAN NECESSITIES
    • A01AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING
    • A01HNEW PLANTS OR NON-TRANSGENIC PROCESSES FOR OBTAINING THEM; PLANT REPRODUCTION BY TISSUE CULTURE TECHNIQUES
    • A01H1/00Processes for modifying genotypes ; Plants characterised by associated natural traits
    • A01H1/02Methods or apparatus for hybridisation; Artificial pollination ; Fertility
    • AHUMAN NECESSITIES
    • A01AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING
    • A01HNEW PLANTS OR NON-TRANSGENIC PROCESSES FOR OBTAINING THEM; PLANT REPRODUCTION BY TISSUE CULTURE TECHNIQUES
    • A01H1/00Processes for modifying genotypes ; Plants characterised by associated natural traits
    • A01H1/04Processes of selection involving genotypic or phenotypic markers; Methods of using phenotypic markers for selection
    • A01H1/045Processes of selection involving genotypic or phenotypic markers; Methods of using phenotypic markers for selection using molecular markers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Genetics & Genomics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Environmental Sciences (AREA)
  • Botany (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Developmental Biology & Embryology (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Ecology (AREA)
  • Physiology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Provided herein are methods for evaluating associations between candidate genes and a trait of interest in a population. The methods include a combination of genome-wide association analysis and one or both of nested association mapping (NAM), and expression QTL analysis (eQTL). Markers are selected or prioritized if they are shown to be positively-correlated with a trait of interest using GWA and a combination of one or both of NAM and eQTL. Also provided are models for evaluating the association between a candidate marker and a trait in a nested population of organisms. These methods include single marker regression and multiple marker regression models. Markers identified using the methods of the invention can be used in marker assisted breeding and selection, as genetic markers for constructing linkage maps, for gene discovery, for identifying genes contributing to a trait of interest, and for generating transgenic organisms having a desired trait.

Description

For selecting the method for the candidate gene statistically confirmed
Invention field
The present invention relates to molecular genetics, particularly for assessment of the method associated in population between genetic marker with phenotype.
Background of invention
Develop Multiple experiments normal form with qualification with analyze quantitative trait locus (QTL) (see such as, Jansen (1996) Trends Plant Sci 1:89).Quantitative trait locus (QTL) is genomic region, encodes to one or more protein and explain the variability of the given phenotype that can control by multiple gene of remarkable ratio in this region.Most of open reports for the QTL mapping in crop species are the uses based on biparent cross.Typically, these normal forms comprise one or more parent hybridizing, the one or more parent to can be such as derived from two inbred strains single to or different inbred strain or the multiple relevant or irrelevant parent that is, they show the different feature relative to interested phenotypic character separately.Typically, this planning of experiments comprises and derives 100 to 300 segregant generations from the single hybridization of the inbred strais of two bifurcateds (such as by the phenotype selecting to maximize between these are and molecular labeling difference).Parental generation and segregant generation are carried out Genotyping for multiple marker site and assessed to several quantitative character (such as disease resistance) for one.Then QTL is identified as the remarkable statistical correlation in segregant generation between genotype value and phenotypic variability.
Be known for determining to mark whether to be numerous statistical method that heredity is connected to QTL (or be connected to another mark) for those of ordinary skill in the art, and comprise such as standard linear model if ANOVA or regression plot (Haley and Knott (1992) Heredity 69:315), maximum likelihood method are as expected maximum calculated method (such as Lander and Botstein (1989) Genetics 121:185-199; Jansen (1992) Theor.Appl.Genet., 85:252-260; Jansen (1993) Biometrics49:227-231; Jansen (1994) In J.W.van Ooijen and J.Jansen (eds.), Biometrics inPlant breeding:applications of molecular markers, pp.116-124, CPRO-DLOMetherlands; Jansen (1996) Genetics 142:305-311; And Jansen and Stam (1994) Genetics 136:1447-1455).Exemplary statistical method comprises single point marker analysis, Interval mapping (Lander and Botstein (1989) Genetics 121:185), composite interval mapping, penalized regression analysis, compound pedigree analysis, MCMC analyzes, MQM analyzes (Jansen (1994) Genetics 138:871), HAPLO-IM+ analyzes, HAPLO-MQM analyzes, and HAPLO-MQM+ analyzes, Bayesian MCMC, ridge regression (ridge regression), blood source homogeneity is analyzed, and Haseman-Elston returns.
Complex character in numerous species is resolved and is depended on two kinds of main method substantially, linkage analysis and integrating map (Andersson and Georges 2004, Nat.Rev.Genet.5:202-212; Flint et al.2005, Nat.Rev.Genet.6:271-286; Hirschhorn and Daly 2005, Nat.Rev.Genet.6:95-108).Although the method for the linkage analysis of the mapping population designed by using has been used (Doerge 2002 for a long time, Nat.Rev.Genet.3:43-52), be developed recently the method with the integrating map of the sample based on population thus the population structure hidden overcome within collected sample or implicit correlativity (Falush et al.2003, Genetics 164:1567-1587; Yu et al.2006, Nat.Genet.38:203-208).Have studied for combining chain and statistical method (Wu and Zeng 2001, Genetics 157:899-909 that is linkage disequilibrium mapping strategies for Natural Population; Wu et al.2002, Genetics 160:779-792) and checked the hybridization (heterogeneous stock) of inbred strais and metapopulation upper (Mott and Flint 2002, Genetics 160:1609-1618).For a kind of general compound pedigree, candidate genes polymorphism (Meuwissen et al.2002, Genetics 161:373-379 are identified by the QTL region place associating of former mapping is chain with fine Structure Mapping that is linkage disequilibrium information; Blott et al.2003, Genetics 163:253-266).The research of the genetic design with multiple cross in the past has demonstrated effect and mapping precision (the Rebai and Goffinet 1993, Genet.Res.75:243-247 of improvement on single species; Xu 1998, Genetics 148:517-524; Rebai and Goffinet 2000, Genet.Res.75:243-247; Yi and Xu 2002, Genetica 114:217-230; Jansen et al.2003, Crop Sci.43:829-834; Li et al.2005, Genetics 169:1699-1709; Verhoeven et al.2006, Heredity 96:139-149).But these researchs mainly make use of the linkage information of multiple cross.
In the case of humans, science of heredity is used to identify that the gene relevant to proterties and approach have followed the normal form of very standard.First, the chain research of a kind of full-length genome that used hundreds of genetic markers to carry out in based on the data of family is with the extensive region of qualification with this trait associations.The result of the linkage analysis of this standard kind is the region that qualification controls this proterties, notice is restricted to and probably lacking to 500 to 1000 genes in this genomic specific region of this trait associations from 30,000+ gene thus.But, use the region of linkage analysis qualification remain too wide to such an extent as to the candidate gene with this trait associations can not be identified.Therefore, the research of this quasi-linkage is then typically in desmic region, use more high density marker and chain region is carried out fine Structure Mapping, increases the number of the family in analyzing, and the alternative population of qualification for studying.Notice is restricted on narrower genomic region by these effort further, with in the rank of the gene of 100 in the specific region of this trait associations.Namely use the desmic region more narrowly limited, the number of the gene of subject to confirmation remains excessive.Therefore, the function of presumption and the possible relevance of this function and this proterties that concentrate on the gene of known in based on region or prediction in the research in this stage identify candidate gene.This method is problematic, this is because it be limited to about gene at present it is known that what.Usually, these knowledge are limited and are limited by explanation.Consequently, researchist is often introduced into wrong road and does not identify the gene affecting this proterties.
Summary of the invention
The present invention includes assessment or confirm associating in population between candidate gene with a kind of interested proterties.These methods of the present invention comprise full-length genome association (GWA) analyze with nested integrating map (NAM) and expresses in qtl analysis (eQTL) one of or both unique combination, for analyze and priorization candidates is used for further enforcement or use.One of to use in GWA and NAM and eQTL or both combination, if these mark, to demonstrate be positively related with interested proterties, then select them.
Further provide novel regression model for nested integrating map.These methods comprise single labelled regression model (SMR) and multiple labelling regression model (MRM).In some embodiments, using SMR model evaluation character value with before associating between marker genetype, information gene type is not removed.In other embodiments, successive Regression is used to select to mark for the co-factor be included in MMR model.At different aspect of the present invention, if a kind of association use SMR or MMR or both detect, then consider that mark is used for confirming further.
Use method of the present invention the mark identifying, select or confirm may be used for marker-assisted breeding and in selecting, as the genetic marker for building genetic linkage maps to be separated in the genomic dna sequence around gene code or noncoding DNA sequence, thus the gene of interested proterties is facilitated in qualification, and for generation of having the genetically modified organism of desired proterties.
Brief Description Of Drawings
Fig. 1 depicts the exemplary process diagram of the step related in GWA.
Fig. 2 depicts the exemplary process diagram of the step related in NAM.
Fig. 3 depicts for selecting and the candidates that uses for downstream of priorization and combine the process flow diagram of the step related in GWA and NAM.
Fig. 4 is the graphical representation of exemplary that the overlapping mark identified based on using NAM (top group) and GWA (end group) carries out selection and priorization.
Fig. 5 shows the histogram for 3 ethanol correlated traits for the inbred strais of 600 in inbreeding group.Phenotype data fit to normal distribution well.
Detailed description of the invention
General introduction
The position of quantitative trait locus (QTL) and the estimation of effect are of paramount importance for marker assisted selection.So far, this is (Lander and Botstein (1989) the Genetics 121:185-199) that realized by the QTL drawing method of classics.The requirement of experiment of these necessity sets up phenotype together with large mapping population and genotype, and is therefore cost and time intensive (Parisseaux andBernardo (2004) Theor Appl Genet 109:508-514) in the extreme.
Described here is a kind of a kind of method associated finding or confirm between one or more candidate genes with interested phenotypic character.In different embodiments of the present invention, by by the positive correlation using whole-genome association (GWA) to identify mark and other correlation model of use as nested integrating map (NAM) and/or express positive correlation that QTL (eQTL) analyzes and mark and compare, selection, confirmation or priorization have been carried out to the mark used for downstream.In analyzing using GWA and NAM and eQTL one of or the positive correlation mark of both qualifications be placed on the physics genetic map (physical genetic map) of kind under study for action.If identify these marks (namely " overlap " marks) one of in both GWA method and NAM and eQTL or in both, then these are marked priorizations and be used for further use.Therefore, these methods disclosed here contribute to priorization for carrying out the candidates selected and implement thus the chance of success be increased in exploitation diagnostic mark in downstream processes, and these diagnostic mark is used for marker-assisted breeding and product development.
The novel method for nested integrating map (NAM) is further provided at this.NAM is a kind of for assessment of a kind of a kind of method associated in the nested colony of biology between candidates with interested proterties.These methods comprise novel in the single regression model associated in nested colony between a kind of candidate gene with a kind of interested proterties and multiple regression model.
For purposes of the present invention, a kind of " candidate gene " refers to a kind of gene or genetic elements tested with regard to associating between this gene with a kind of interested proterties.This candidate gene can be a kind of known or guessed a kind of straight homologues of gene be associated with the interested proterties in variety classes.As used in this, term " with ... association " and the relation between genetic marker (SNP, haplotype, insertion/deletion, tandem sequence repeats, etc.) and phenotype are about referring to the statistically significantly dependence of mark frequency about the quantitative range of phenotype or quality grade.When mark with the linkage of characters and when the existence of this mark indicate desired proterties or proterties form by occur in comprise this mark biology in time, then this mark is relevant to this proterties " just ".When mark with the linkage of characters and when the existence of this mark indicate desired proterties or proterties form by do not occur in comprise this mark biology in time, then this mark and this proterties negative correlation.For purposes of the present invention, term " mark " refers to that any being used to tests the genetic elements be associated with interested proterties, and unnecessarily represents that this mark is and interested proterties positive correlation or negative correlation.
Therefore, if when in the filial generation compared to this marker genetype and trait phenotypes discretely separately and frequently at biology of marker genetype and trait phenotypes by together with find time, then this mark is associated with interested proterties.Phrase " phenotypic character " refers to outward appearance or the further feature of a kind of biological (such as a Plants or animal), results from the interaction of its genome and environment.Term " phenotype " refer to biological any visible, can detect or additionally measurable characteristic.Term " genotype " refers to biological genetic constitution.This can overallly consider, or considers about monogenic allele (namely at given locus).
In some embodiments, these marks are the candidate gene or the genetic elements that are directly attributable to phenotypic character.Such as, the genetic elements being directly attributable to starch accumulation in plant can be the gene being directly involved in plant amylum metabolism.Alternately, this mark can be found within the locus that is associated with interested phenotypic character." locus " is chromosomal region, and wherein a kind of polymorphic nucleic acid, proterties determinant, gene or be marked at is located here.Therefore, such as, " locus " is the specific chromosome region in the genome of kind, wherein can find specific gene.In different embodiments, these marks using these methods disclosed here to identify or confirm can be associated with quantitative trait locus (QTL).Term " quantitative trait locus " or " QTL " refer to have at least two allelic polymorphic locuses, and these at least two allele differentially affect the expression of phenotypic character at least one genetic background.
In some respects, the institute's candidate gene identifying or confirm that uses method described herein by chain to or close linkage on QTL mark.Phrase " close linkage " represents in this application recombinates with the frequency (being namely separately not more than 10cM in genetic map) being equal to or less than about 10% between two linkage site.In other words, closely linked site the time of at least 90% be divided into from.In the present invention, when marker site proves to be divided into the remarkable possibility from (chain) with desired proterties, these marker sites are useful especially.In some respects, these marks can be called chain QTL mark.
These methods combining disclosed here multiple statistical test and model, these statistical tests and model may not be explicitly described at this.The detailed description of the statistical test of standard can be found in statistical basis textbook, such as, Dixon, W.J.et al., Introduction to Statistical Analysis, New York, McGraw-Hill (1969) or Steel R.G.D.et al., Principles andProcedures of Statistics:with Special Reference to the Biological Sciences, NewYork, McGraw-Hill (1960).Also exist multiple for software program known to persons of ordinary skill in the art for statistical study.
Interested population
These methods of the present invention comprise to be identified by carrying out whole-genome association (GWA) on the population of biological (such as plant or animal) or confirms candidates, and use nested integrating map (NAM) and express QTL (eQTL) analyze in one of or both any positive correlation in analyzing at GWA marked and is confirmed as having positively related mark with the interested proterties in the biology of identical type compare.When candidates GWA analyze and demonstrate in other linkage analysis method of at least one (such as at least one eQTL analyze, NAM or AEA) there is positive correlation time, this candidates is carried out priorization and be used for using further or implement (such as marker-assisted breeding, genetically modified plants exploitation, etc.).It is not necessary that use identical mapping population, as long as be made up of the biology of identical kind for the population of all research for each analysis.
Most of open reports for the QTL gene mapping in crop species are the uses (Lynch and Walsh (1997) Genetics and Analysis of Quantitative TraitsSinauer Associates, Sunderland) based on biparent cross.Typically, this planning of experiments comprises and derives 100 to 300 segregant generations from the single hybridization of the inbred strais of two bifurcateds (such as by the phenotype selecting to maximize between system and molecular labeling difference).In segregant generation, is carried out Genotyping for multiple marker site and have evaluated in a variety of contexts one individual quantitative character at the most.Then identify that QTL is as the remarkable statistical correlation in segregant generation between genotype value and phenotypic variability.
These methods disclosed here have for finding or confirming the mark in which kind of group in office: trait associations is useful.Term " population " or " biological population " represent that a group of identical type is biological, such as, obtain from this all living creatures's thing sample for assessment of and/or from this all living creatures's thing, select individual member for breeding objective.From its assess these mark population member need not to be with finally select for breeding identical with the population member obtaining filial generation (such as the filial generation in subsequent analysis cycle).Although use plant population mainly to carry out illustration to these methods disclosed here and illustrated, these methods are equally applicable to animal population, the such as mankind and non-human animal, as animal used as test, domestic domestic animal, companion animals, etc.
In embodiments of the invention, this biotic population (as plant population) comprises or consists of from one or more population of founding the hybridization between system (founder line) and single common parent system and producing.In different embodiments, this single common parent system is tester line.Phrase " tester line " refers to and is, this be from one group it hybridizes is be irrelevant or hereditary different.In sexual hybridization, use tester to allow those of ordinary skill in the art to determine associating of the expression of phenotypic character and quantitative trait locus in hybrid combination.Phrase " hybrid combination " refers to and a kind of single tester is hybridized to multiple process fastened.The object producing this type of hybridization is to evaluate the ability that this ties up in hybrid filial generation (derived from this being by test cross) phenotype desired by producing.
In foundation, system experienced by take turns " selfing " to produce the population be separated with Mendelian for all genes with the filial generation of the hybridization between tester line more.This mapping population is referred to herein as " nested colony " and specific embodiments for the nested integrating map of implementation of the present invention (NAM) method (the NAM method of such as novelty described herein) is useful.These recombinant inbred strains (RIL) (genetic correlation system; Usual > F 5, from the F of continuous selfing 2system develops into homozygosity) can use as mapping population.Because all locus are that isozygoty or close isozygotying, the information obtained from dominant marker can be amplified by using RIL.Under closely linked condition (namely about < 10% recombinates), dominant and the codominant marker evaluated in RIL population is compared to providing each individual more information (Reiter et al., Proc.Natl.Acad.Sci. (U.S.A.) 89:1477-1481 (1992)) in the arbitrary type in population that backcrosses.
In the context of the present invention, term " hybridization " or " hybridization " represent that gamete produces the fusion of filial generation (such as cell, seed or plant) by pollination.This term comprises both sexual hybridization (plant is pollinated by another, or a gamete is by another fertilization) and selfing (self-pollination, such as, when pollen and ovule are from identical plant).Phrase " hybrid " refers to the biology that the hybridization between individuality different from heredity produces.Phrase " inbreeding " refers to from the derivative biology of the hybridization between the individuality that heredity is relevant.In the context of the present invention, term " is " refer to by a kind of inbred plant of self-pollination and the family of derivative corresponding plants.Term " filial generation " refers to that particular organisms (plant of such as self hybridization) or biology are to the offspring of (such as passing through sexual hybridization).These offsprings can be such as F 1, F 2or any subsequent generation.
These methods disclosed here be included in further tester line and excellence be between hybrids." excellent system " or " excellent strain " is superior on agronomy is that it results from multiple breeding cycle and the selection for superior agronomy performance.By contrast, " external strain " or " Exotic Germplasm " never belong to the obtainable excellence system of kind of matter or the biologically-derived strain of strain or plant matter.Numerous excellence systems is obtainable and for breeding field those of ordinary skill is known.The classification that " excellent population " is excellent individuality or is, with regard to the genotype that the agronomy of given kind is superior, it can be used for representing state of the art.Similarly, " excellent germplasm " or to plant the excellent strain of matter be kind matter superior on agronomy, typically derived from and/or a kind of biology with superior agronomy performance can be produced.Term " kind matter " refer to individuality (such as plant or animal) or from the inhereditary material of individuality, a group individuality (such as, department of botany, kind or family) or from being, the clone of kind, kind or culture.Planting matter can be a part that is biological or cell, or can be separated from this biology or cell.Usually, plant matter and provide the inhereditary material with specific molecular structure, this molecular structure provides the physical basis of the some or all of hereditary quality for biological or cell culture.
In some cases, population can comprise parent organism together with the one or more filial generations derived from these parent organism.In some cases, plant population is derived from single biparent cross, the progeny population of such as, hybridization between two parents.Alternately, population comprises derived from twice or the member of repeatedly hybridizing, and these hybridization relate to identical or different parent.This population can by recombinant inbred strain, backcross be, tester line etc. forms.
In various embodiments, the plant population that is made up of early stage breeding material of this population.For " in early days " breeding material, be contemplated that these plants are in F2 to F3 generation.The advantage that the use of early stage breeding material finds is that the quantity of operational breeding material is large; Phenotypic data is operational for breeding system; And gene mapping result can directly help to select.In breeding early stage, multiple system at multiple position measurement.
Because early stage breeding phase relates to the filial generation evaluating the large quantity derived from multiple hybridization, they provide necessary phenotypic data for the identification of the mark also confirmed for wide region Agronomic character.By by labeled analysis set in existing breeding plan, the effect, the preci-sion and accuracy that are associated with large quantity filial generation can be obtained.In addition, this breeding plan is crossed over instead of the sample be limited to from the filial generation of single hybridization can make inference about mark association.
The population (such as producing from the hybridization between successful varieties (recurrent parent) and another kind (donor parents) carrying the proterties be not present in the former) that backcrosses can be used as mapping population.The proterties can carrying out desired by a series of great majority backcrossing to recover it to recurrent parent.Therefore, create the population that it consists of the individuality of almost similar recurrent parent, but each individuality carries (or chimeric) genome area from donor parents of different amount.If all locus in recurrent parent be isozygoty and donor and recurrent parent have the polymorphic marker allele of contrast, the population that then backcrosses can be useful (Reiter et al., Proc.Natl.Acad.Sci. (U.S.A.) 89:1477-1481 (1992)) for mapping dominant marker.Codominance or dominant marker is used to be less than the information obtained from F2 population from the information that the population that backcrosses obtains, this is because extracted one instead of two restructuring gametes for each plant.But when compared with RILs, the population that backcrosses more has information (under low mark is saturated), because adding (namely about 0.15% recombinating) connecting the distance between site in RIL population.The restructuring increased can be useful for parsing close linkage, but may be undesirable in the structure with the saturated collection of illustrative plates of low mark.
In another embodiment, this population is made up of inbred plant, according to common parent, these inbred plant is categorized into pedigree." pedigree structure " defines offspring and produces the relation between each ancestors of this offspring.Pedigree structure can cross over one or more generation, describes the relation between offspring and its parental generation, ancestral's parental generation, great-grandfather's parental generation etc.
In another embodiment again, existing mapping population can be used identify or confirm mark.Such as, the mapping population described in Yu et al. (2008) Genetics 178:539-551 (by reference its entirety being combined in this) can be particularly used in for NAM method.Other disclosed or privately owned mapping population held goes for method disclosed here.
These methods of the present invention are applicable in fact any plant population or kind, particularly floristics.Preferred plant comprises on agronomy and kind important in gardening, comprise such as: the crop producing edible flower, (safflower belongs to for such as cauliflower (wild cabbage), globe artichoke (cynara scolymus) and safflower, such as safflower) (cauliflower (Brassica oleracea), artichoke (Cynara scolvmus), andsafflower (Carthamus, e.g.tinctorius)), fruit, such as apple (Malus, such as apple), banana (Musa, the wild any of several broadleaf plants of such as fruitlet), berry (such as currant platymiscium, currant belongs to, such as black currant), cherry class (such as sweet cherry, Prunus, such as gean) (fruits such as apple (Malus, e.g.domesticus), banana (Musa, e.g.acuminata), berries (such as thecurrant, Ribes, e.g.rubrum), cherries (such as the sweet cherry, Prunus, e.g.avium)), cucumber (Cucumis, such as cucumber), grape (Vitis, such as grape), lemon (Canton lemon), muskmelon (Cucumis melo), nut (such as English walnut, juglans, such as English walnut, peanut, peanut), orange (both citrus, such as shaddock), peach (Prunus, such as peach), pears (pear (Pyra), such as European pear), pepper (Solanum, such as coral cherry), plum (Prunus, such as European Lee), strawberry (Fragaria, such as hautbois), (tomato belongs to tomato, such as tomato) (cucumber (Cucumis, e.g.sativus), grape (Vitis, e.g.vinifera), lemon (Citrus limon), melon (Cucumismelo), nuts (such as the walnut, Juglans, e.g.regia, peanut, Arachis hypoaeae), orange (Citrus, e.g.maxima), peach (Prunus, e.g.persica), pear (Pyra, e.g.communis), pepper (Solanum, e.g.capsicum), plum (Prunus, e.g.domestica), strawberry (Fragaria, e.g.moschata), tomato (Lycopersicon, e.g.esculentum)), leaf class, such as clover (clover genus, such as alfalfa), sugarcane (saccharum), wild cabbage (such as Brassicaoleracea), witloof (Cichorium, such as witloof), fragrant-flowered garlic (allium, such as leek), lettuce (Lactuca, such as lettuce), (spinach belongs to spinach, such as spinach (oleraceae)), tobacco (Nicotiana, such as tobacco) (leafs, such as alfalfa (Medicago, e.g.sativa), sugar cane (Saccharum), cabbages (such as Brassica oleracea), endive (Cichoreum, e.g.endivia), leek (Allium, e.g.porrum), lettuce (Lactuca, e.g.sativa), spinach (Spinacia e.g.oleraceae), tobacco (Nicotiana, e.g.tabacum)), root class, such as arrowroot (Maranta, such as arrowroot), beet (Beta, such as beet), carrot (Daucus, such as cicely), cassava (cassava, such as cassava), turnip (Btassica, such as overgrown with weeds blue or green), radish (Rhaphanus, such as radish), Chinese yam (Dioscorea, such as Chinese yam), sweet potato (Ipomoea batatas) (roots, such asarrowroot (Maranta, e.g.arundinacea), beet (Beta, e.g.vulgaris), carrot (Daucus, e.g.carota), cassava (Manihot, e.g.esculenta), turnip (Brassica, e.g.rapa), radish (Raphanus, e.g.sativus) yam (Dioscorea, e.g.esculenta), sweet potato (Ipomoeabatatas)), seed, such as beans (Phaseolus, such as Kidney bean), pea (Pisum, such as pea), soybean (Glycine, such as soybean), wheat (Triticum, such as common wheat), barley (Hordeum, such as barley), corn (Zea, such as maize), rice (Oryza, such as Asian Cultivated Rice) ((seeds, such as bean (Phaseolus, e.g.vulgaris), pea (Pisum, e.g.sativum), soybean (Glycine, e.g.max), wheat (Triticum, e.g.aestivum), barley (Hordeum, e.g.vulgare), corn (Zea, e.g.mays), rice (Oryza, e.g.sativa))), grass class, such as Chinese silvergrass (awns genus, such as huge awns) and switchgrass (Panicum, such as switchgrass) (grasses, such asMiscanthus grass (Miscanthus, e.g., giganteus) and switchgrass (Panicum, e.g.virgatum)), tree, such as white poplar (Populus, such as trembling poplar), pine tree (Pinus), shrub, such as cotton (such as upland cotton) (trees such as poplar (Populus, e.g.tremula), pine (Pinus)), and stem tuber, such as wild cabbage (Btassica, such as wild cabbage (oleraceae)), potato (Solanum, such as potato) (shrubs, such as cotton (e.g., Gossypium hirsutum), and tubers, such askohlrabi (Brassica, e.g.oleraceae), potato (Solanum, e.g.tuberosum)) etc.The kind associated with any given groupy phase can be the kind of transformed variety, non-transgenic kind or any genetic modification.Alternately, the plant of the given kind of natural generation in wilderness can also be used.
Genetic marker
Although the specific DNA sequence of coded protein is crossed over kind and quite guarded, other region of DNA territory (noncoding typically) is tending towards accumulation polymorphism, and is therefore variable between the individuality of identical type.These regions provide the basis for numerous molecular genetic marker.
In these methods disclosed here, after generation or selecting one or more population, the genotype value for multiple mark is obtained for the multiple members in population.This genotype value is corresponding to the quantitative of this genetic marker or observational measurement.Term " mark " refers to discernible DNA sequence dna, and this sequence is variable (polymorphic) for the Different Individual in population, and contributes to the hereditary feature studying proterties or gene.Chain with the specific chromosome position of the genotype uniqueness for individuality at the mark of DNA sequence dna level, and with the predictable mode heredity of one.
This genetic marker DNA sequence dna typically, this DNA sequence dna has specific position on the chromosome that can measure in the lab.Term " genetic marker " can also be used for referring to such as encoded by genome sequence cDNA and/or mRNA, together with this genome sequence.In order to be useful, mark must have two or more allele or variant.Mark can be or directly, that is, be positioned within interested gene or locus, or indirectly, that is, with interested gene or locus close linkage (but can speculatively, owing to being in close proximity to interested gene or locus thering is no position therein).In addition, mark can also comprise or modify (or not modifying) is positioned at the amino acid sequence of coded by said gene wherein sequence by it.
Usually, the multistate character (comprising polymorphic nucleic acid) of any differentially heredity be separated in filial generation is all potential mark.Term " polymorphism " refers to there is two or more allele variants in population.Term " allele " or " allelic " or " mark variant " refer to the variation that the specific location within mark or special flag sequence exists; When SNP, appearance be actual nucleotide; For SSR, it is the number of repetitive sequence; For peptide sequence, appearance be actual amino acid; When labeled monomer type, it is the combination of the mark variant of two or more individualities in specific combination." allele of association " refers to the allele at polymorphic locus place, and it is associated with interested particular phenotype.This type of allele variant is included in the sequence variations at single base place, such as single nucleotide polymorphism (SNP).Polymorphism can be the difference of the single core thuja acid being present in site, can be maybe insert or disappearance one, a few or multiple continuous print nucleotide.It will be appreciated that, although these methods of the present invention carry out illustration by detecting SNP at first, these methods or other method as known in the art can be used similarly to identify the polymorphism of other type, and this typically relates to more than one nucleotide.
Genome mutation can have any cause, such as, insert, lack, copy, repeat element, point mutation, the existence of recombination event or transposable element and order.This mark directly can be measured as DNA sequence polymorphism, as single nucleotide polymorphism (SNP), restriction fragment length polymorphism (RFLP) or Short tandem repeatSTR (STR), or indirect inspection is DNA sequence dna variant, as single strand conformation poly morphism (SSCP).Mark can also be the variant of the level being in the product that a kind of DNA derives, as RNA polymorphism/abundance, protein polymorphism or products of cellular metabolism polymorphism, or with basic DNA variant (underlying DNA variant) or gene outcome, there is other biological property any of direct relation.
In mapping and marker-assisted breeding scheme, often use the mark of two types, namely simple sequence repeats (SSR, also referred to as micro-satellite (microsatellite)) mark, and single nucleotide polymorphism (SNP) mark.Term SSR typically refers to the molecular heterogeneity of any type causing length variation, and (reaching a hundreds of base-pair) region of DNA section the shortest, this region of DNA section is made up of multiple tandem sequence repeats of two or three base-pair sequences.Because poor copies fidelity, such as caused by polymerase slippage, these repetitive sequences result in the highly polymorphic region of DNA territory of variable-length.SSR seems it is by genome random dispersion and generally by conservative region flank.SSR marker can also derive from RNA sequence (being in the form of cDNA, Partial cDNA or EST) together with genomic material.
In one embodiment, this molecular labeling is single nucleotide polymorphism.Developing different technologies for detecting SNP, having comprised Allele-Specific Hybridisation (ASH; See, such as, Coryell et al., (1999) Theor.Appl.Genet., 98:690-696).The molecular labeling of other type can also be widely used, including, but not limited to expressed sequence tag (EST) with derived from the SSR marker of est sequence, AFLP (AFLP), randomly amplified polymorphic DNA (RAPD) and isoenzyme mark.For this variability of detection, the scheme of wide region is known for those of ordinary skill in the art, and these schemes are often special for the type that they are designed to the polymorphism detected.Such as, pcr amplification, single strand conformation poly morphism (SSCP) and self-sustained sequence replication (3SR; See Chan and Fox, Reviews in Medical Microbiology 10:185-196).
DNA for labeled analysis can be collected and in office what organize (as new plant can from the cell of its growth, seed or tissue) easily or screen in plant parts (as being trained the leaf of whole plant, stem, pollen or cell).In some embodiments, flag data obtains from the tissue be associated with the proterties research.In some embodiments of the present invention, flag data measures from the Various Tissues of each plant research.The cell obtaining enough numbers is to provide the sample of q.s for analyzing, although only need Minimal sample size, wherein scoring is undertaken by amplification of nucleic acid.DNA isolation, RNA or protein can be come by standard nucleic acid isolation technics known to persons of ordinary skill in the art from cell sample.
In one embodiment, these genotype values correspond to for all in fact of highdensity full-length genome SNP collection of illustrative plates or value that all SNP obtain.The advantage of classic method that what this method had surpass is, because it comprises whole genome, it identifies the potential interaction from the genome product being positioned at any gene expression of genome, and the one between not requiring to be pre-existing in about genome product may interactional knowledge.The example of highdensity whole genome SNP collection of illustrative plates is the collection of illustrative plates with at least about 1 SNP/10,000kb, at least 1 SNP/500kb or about 10 SNP/500kb or at least about 25SNP or more/500kb.The definition of density of mark can cross over genome and change and be by genome area within the degree of linkage disequilibrium determine.
In addition, many genetic marker Screening Platforms are commercially available now, and can be used to obtain the genetic marker data for required by the process of Existing methods.In several cases, these platforms can take the form of genetic marker test array (microarray), test while it allows thousands of genetic markers.Such as, the genetic marker number that these arrays can be tested is greater than 1,000, be greater than 1,500, be greater than 2,500, be greater than 5,000,10 are greater than, 000, be greater than 15,000, be greater than 20,000, be greater than 25,000, be greater than 30,000,35 are greater than, 000, be greater than 40,000, be greater than 45,000, be greater than 50,000 or be greater than 100,000,250 are greater than, 000, be greater than 500,000, be greater than 1,000,000,5 are greater than, 000,000, be greater than 10,000,000 or be greater than 15,000,000.A kind of like this example of commercially available product is that those are introduced to the market by Affymetrix Inc (www.affymetrix.com) or Illumina (www.illumina.com).In one embodiment, genotype value obtains from least 2 genetic markers.
Will be appreciated that the character due to this information, the quality control of filtration or preprocessed data and data may be needs.Such as, can according to specific standard (such as data Replica or low frequency; See, such as Zenger et.al (2007) Anim Genet.38 (1): 7-14) get rid of flag data.The example of such filtration is described following, although other method of the filtering data be easily understood by a skilled person can also be adopted to obtain work data set, this work data set determines mark association.
In one embodiment, when the gene frequency of specific markers be less than about 0.01 or be less than about 0.05 time, from analysis, get rid of flag data." gene frequency " or " marker allele frequency " (MAF) refer to allele be present within individuality, be within or be population within the frequency (ratio or number percent) at locus place.Such as, for allele " A ", there is the gene frequency that has of dliploid individuality each naturally 1.0,0.5 or 0.0 of genotype " AA ", " Aa " or " aa ".People can estimate by the gene frequency from the individual specimen being is averaged be within gene frequency.Similarly, people can by the gene frequency within the population calculating system that is averaged by the gene frequency being of composition population.For the population having a limited number of individuality or be, gene frequency can be expressed as and comprises this allelic individuality or be the counting of (or any other specific group).
In various embodiments, the set being evaluated the mark of interested concrete proterties can be mark arbitrarily as above, can be maybe demonstrated in different floristics or is the mark be associated with interested proterties by guess.Molecular labeling for different types of large quantity is well known in the art and can in variety classes, uses method disclosed here confirm.Such as, the one group of candidate gene identified based on the molecular function of candidate gene and/or performance in corn can be tested in soybean.Therefore, model described herein is for confirming that in different floristics the effect of these candidate genes is useful.When evaluation one group of candidates, the common random labelling (generally random marker) without known association is also included among this analysis.
Interested proterties
These methods of the present invention are applicable to any phenotype with basic hereditary component, i.e. any heritable proterties." proterties " is biological feature, it shows self with a kind of phenotype, and refer to biological, performance or other one or more features measurable any, it can be any entity, this entity can quantize in biological sample or biology or from biological sample or biology, and then it can be used alone or the combination of entities that quantizes with one or more other uses." phenotype " is biological mode of appearance or other visible feature and relates to one or more biological proterties.
Multiple different proterties can be reasoned out by method disclosed here.Phenotype can bore hole or by other evaluation method any as known in the art, such as: mensuration of microscopy, biochemical analysis, genome analysis method, a kind of specific disease resistance etc. is observable.In some cases, a kind of phenotype is directly controlled by single gene or locus, such as: a kind of " term single gene proterties ".In other cases, phenotype is the result of multiple gene." quantitative trait locus " (QTL) is polymorphic and affects the genetic region of phenotype, this phenotype can describe with quantitative term, such as: highly, weight, oil content, germination number of days, disease resistance etc., and therefore can be designated " phenotypic number " that correspond to for the quantitative value of phenotypic character.
For any proterties, the feature of " relatively high " shows higher than mean value, and the feature of " relatively low " shows subaverage.Such as: " relatively high output " shows for the specific plant population plant products abundanter than average output.On the contrary, " relatively low output " shows for the specific plant population output abundant not as average output.
Under the background of Exemplary plants breeding plan, quantitative phenotype comprises: output (such as grain output, ensiling output), coerce that (the such as busy season coerces, terminal is coerced, water stress, heat stress etc.) resistance, disease resistance, insect resistace, to the resistance of density, check figure order, core size, fringe size, spike number order, pod number, the number of seed in each pod, degree of ripeness, flowering time, for the thermal unit of blooming, to bloom number of days, root lodging resistance, stem lodging resistance, fringe height, grain water content, test weight, content of starch, seed forms, starch forms, oil composition, protein forms, nutrient and healthcare products content, etc..
In addition, following phenotypic number can be relevant to interested mark: color, size, shape, skin thickness, pulp density, pigment content, oil deposit, protein content, enzymatic activity, lipid content, sugar and starch content, chlorophyll content, mineral, salt content, pungency, fragrance and fragrance and this type of further feature.For each in these indexes, by determining the feature (such as weight) relevant to each project in sample and then measuring the distribution determining the parameter for this sample from the mean value distributed and standard deviation value.
Similarly, these methods are equally applicable to the proterties of continuous variable, such as: grain yield, highly, oil content, for reaction of coercing (such as terminal is coerced or the busy season coerces) etc., or be applicable to multi-class denumerable proterties (but can be analyzed just as they are continuous variable), such as germinate number of days, bloom number of days or result number of days, and be applicable to discontinuous (interruption) or the proterties that distributes of mode that is separated.It should be understood, however, that within any interested biology, these methods described here can be used to characterize proterties that is similar or other uniqueness.
Except passing through the direct valuable phenotype of bore hole, be with or without one or more Prosthesises or aut.eq. (comprising such as microscope, scale, ruler, caliper etc.) auxiliary under, biological chemistry and/or molecular method can also be used to evaluate many phenotypes.Such as, oil content, content of starch, protein content, nutrient and healthcare products content can be evaluated, be grouped into together with their one-tenth, optionally then use one or more chemical assays or biochemical assays to carry out one or more isolated or purified step.Molecular phenotype, as metabolite profile or express spectra (or at protein level also or at rna level) equally can in compliance with the evaluations of these methods according to the present invention.Such as, metabolite profile (no matter being small molecule metabolites or the large biomolecule produced by metabolic pathway) provides the valuable information about phenotype interested on agronomy.This metabolite spectrum can be evaluated as directly or indirectly measuring of interested phenotype.Similarly, express spectra can serve as the indirect measurement of phenotype, or they itself directly can serve as the phenotype of the analysis stood for the relevant object of mark.Express spectra is assessed through rna expression product level of being everlasting, and such as, with a kind of array format, but antibody or other associated proteins can be used to assess at protein level equally.
In addition, in some cases, the mathematical relation between a kind of phenotypic attributes instead of the mark of correlation information independent of interested multiple phenotype is desirably adopted.Such as, the final goal of breeding plan can be obtain the crop producing high yield under low water (i.e. arid) condition.Instead of independently the mark for output is associated with the resistance for low water condition, can the mathematical criterion of the stability of the output on water condition and output be carried out relevant to mark.So a kind of mathematical criterion can adopt following form, comprise: based on the exponential quantity that the statistics of the weighted contributions from multiple independent proterties is derivative, or variable, this variable is the plant growth of plant trait reaction and the component of development model or ecological physiology model (being jointly called crop growth model) of crossing over multiple environmental baseline.These crop growth models are known in the art and have been used to study for the effect of the hereditary variation of plant trait and the collection of illustrative plates QTL for plant trait reaction.See by Hammer et al.2002.European Journal ofAgronomy 18:15-31, Chapman et al.2003.Agronomy Journal 95:99-113, and the list of references of Reymond et al.2003.Plant Physiology 131:664-675.
Association analysis
These methods disclosed here relate to the qualification of multiple linkage analysis strategy or the comparison of positive correlation mark that confirms.In different embodiments, full-length genome association (GWA) mapping strategies is used to determine mark.By positive correlation tag arrangement on the physics genetic map of tested kind.Positive correlation tag arrangement that is that will identify with other method (such as eQTL analyzes or NAM) equally or that confirm is on this physical map.If use GWA and eQTL or NAM one of or both identify or confirm that these marks, then select the candidates being used for further use.
In order to attempt identifying the gene affecting this type of proterties, in character analysis field, employ genetic data.Development crucial in these are pursued is exactly develop a large amount of collections to molecular labeling/genetic marker, and this can be used for building the genetic map of detailed kind.These collection of illustrative plates are used in quantitative trait locus (QTL) drawing method, as singly marked graphing method, Interval Mapping, composite interval mapping method and multiple characters graphing method.QTL drawing method provides the statistical study to associating between phenotype with genotype, and object understands and carefully analyzes the genome area affecting proterties.
Integrating map method employs the mark in candidate gene, and these candidate genes are that some are considered to functionally relate to the gene of development of characters, due to the biological chemistry such as in model organism, physiology, transcribes spectrum and reverse genetics experiment.In the most simply defining, integrating map method is the utility of linkage disequilibrium (being also referred to as gamete uneven mutually), in natural population, carry out identification marking with the important allelic differences had between the individuality and the individuality not showing interested proterties of interested proterties.Whole-genome association (GWA) relates to scan rapidly the method for the hereditary variation that the DNA that crosses over complete (or close to complete) this population biology a set of or genomic mark are associated with specific trait with discovery.Statistical correlation between the genotype and interested proterties of marker gene seat is considered to the chain evidence (Pritchard et al., 2000) of close physical between this mark and the QTL controlling this proterties.
Although classical gene mapping method is useful in the genome-wide screening of the multiple locus for control QTL, integrating map occurs as a kind of steering tool of the accurate estimation for QTL position.Such as, in Medical Genetics, this method has been used to identify gene (Lander and Schork, 1994 about complex character; Risch, 2000), and other field (such as Plant genetics) is transferred in its application gradually.Because integrating map uses natural population, a lot of generation (and therefore meiosis) passes, and therefore recombinates removing associating between QTL with not closely linked with it any mark.Therefore integrating map allows much meticulousr than standard biparent cross method mapping.
Under study for action or being used to monitoring in interested population and being separated or detecting association crossing over the flag data between genomic formula area in interested gene regions.In some embodiments, with Morgan or more typically use cM (cM) define these rule limit interval.Morgan is the unit of the genetic distance of expressing between mark on chromosome.One Morgan is defined as wherein for the distance on chromosome of each gamete expection generation recombination event in every generation.In some embodiments, the interval that each rule limits is less than 100cM.In other multiple embodiments, the interval that each rule limits is less than 10cM, is less than 5cM, is less than 2.5cM, is less than 2cM, is less than 1.5cM or is less than 1cM.
for the chain model of full-length genome association
The object of genetic mapping is the simple inheritance mark that qualification next-door neighbour affects the genetics of quantitative characters factor, namely differentiates QTL.This location relies on and to produce the process that a kind of process of statistical correlation and selectivity reduce the association of the function as the marking path leaving QTL between mark with QTL allele.The known statistical study of some types can be used to infer mark/trait associations from phenotype/genotype data, but basic concepts detect the mark alternative genotype to significantly different average phenotypic, i.e. polymorphism.Such as, if given marker gene seat has three alternative genotype (AA, Aa and aa), and if this three classes individuality has significantly different phenotypes, so people infer locus A and trait associations.The conspicuousness of the difference in phenotype can be checked by the standard statistical test of some types (linear regression or variance analysis (ANOVA) as the marker genetype about phenotype).Produce genetic map by placing genetic marker with heredity (linearly) map order, thus understand position relationship between the marks.
Many known programs can be used carry out association analysis according to this aspect of the invention.Such program is MapMaker/QTL, and it is companion's program of MapMaker and is initial QTL mapping software.MapMaker/QTL uses standard Interval mapping to carry out evaluation of markers data.Another such program is QTL Cartographer, and it singly marks recurrence, Interval mapping (Lander and Botstein, Id.), many Interval mapping and composite interval mapping (Zeng, 1993, PNAS 90:10972-10976; And Zeng, 1994, Genetics 136:1457-1468).QTL Cartographer allows the analysis from F2 or the population that backcrosses.QTL Cartographer is obtainable from statgen.ncsu.edu/qtlcart/cartographer.html (North Carolina State University).Another program operable is Qgene, it or returned by single mark and also or by Interval Regression carry out QTL mapping (Martinez and Curnow 1994Heredity 73:198-206).Use Qgene can analyze multiple different Population Types (all come from inbreeding).Qgene from www.qgene.orgobtainable.But another program is MapQTL, its implementation criteria Interval mapping (Lander and Botstein, Id.), many QTL map (MQM) (Jansen, 1993, Genetics 135:205-211; Jansen, 1994, Genetics 138:871-881) and nonparametric mapping (Kruskal-Wallis rank test).MapQTL can analyze many kinds of pedigree types, comprises outbreeding pedigree (cross-pollination person).MapQTL from International Plant research (International Plant study, post-office box(P.O.B.) 16,6700 AA Wageningens, Holland (( www.plant.wageningen-ur.nl/default.asp? section=products)) be obtainable; .Operable another program is again Map Manager Q in some embodiments, and it is a QTL plotting program (Manly and Olson, 1999, Mamm Genome 10:327-334).MapManager QT implements singly to mark regretional analysis, based on simple Interval mapping (Haley and Knott returned, 1992, Heredity 69,315-324), composite interval mapping (Zeng 1993, PNAS90:10972-10976) and permutation test.By reference to Manly and Olson, 1999, MammalianGenome 10:327-334, provides the explanation of Map Manager QT.
Another program again that can be used for carrying out linkage analysis is MultiCross QTL, and it is mapped to QTL according to the hybridization being derived from inbred strais.MultiCross QTL uses a kind of linear regression model (LRM) method and processes diverse ways (such as Interval mapping, all mark are mapped and have many QTL mappings of co-factor).This program can process the diversified simple mapping population for inbreeding and outbreeding kind.MultiCross QTL is from France, and 31326 Castanet Tolosan, Unite de Biometrie etIntelligence Artificielle is obtainable.
Another program again that can be used for carrying out linkage analysis is QTL Caf é.This program can analyze the most of populations derived from pure lines hybridization (such as F2 is hybridized, backcross, recombinant inbred strain and Doubled haploid line).QTL Caf é combines Haley & Knott both sides mark and returns the Java enforcement returned together with mark, and can process multiple QTL.This program allows the qtl analysis of three types: singly mark ANOVA, mark returns (Kearsey and Hyne, 1994, Theor.Appl.Genet., Interval mapping (Haley and Knott 89:698-702) and by returning, 1992, Heredity 69:315-324).QTLCaf é is obtainable from web.bham.ac.uk/g.g.seaton/.
Another program again that can be used for carrying out linkage analysis is MAPL, it by or Interval mapping (Hayashi and Ukai, 1994, Theor.Appl.Genet.87:1021-1027) also or variance analysis carry out qtl analysis.Different Population Types can be analyzed, comprise F2, backcross, after the selfing in given generation derived from F2 or the restructuring inbreeding that backcrosses.By tolerance Multidimensional Scaling, Auto-grouping and the sequence of a large amount of mark are possible.MAPL is from UKAI, Yasuo, and statistical genetics research institute (ISGI) on the internet, web.bham.ac.uk/g.g.seaton/ is obtainable.
Another program that may be used for linkage analysis is R/qtl.This program provides the interactional environment for mapping for QTL in experiment hybridization.R/qtl utilizes hidden Markov model (hiddenMarkov model, HMM) technology to process deletion Genotype data.R/qtl has achieved a lot of HMM algorithm, to the existence of Genotyping mistake and to backcrossing, hybridizing and mutually known quaternionic breeding has allowance.By having the Interval mapping and multiple imputation that Haley-Knott returns, R/qtl comprises for estimating genetic map, identified gene somatotype mistake and carrying out the facility of single QTL genome scanning and two QTL, two-dimentional genome scanning.R/qtl is from Johns Hopkins University, and Karl W.Broman, biosun01.biostat.jhsph.edu/.about.kbroman/qtl/ are obtainable.
The software TASSEL (by association, evolving and chain character analysis) based on java can be used to measure mark: trait associations.See, the people such as Yu, (2005) Nature Genetics 38:203-208, is combined in this by reference.TASSEL allows linkage disequilibrium statistics calculated and come visual with figure.Data from separate sources can be merged into single analysis data centralization by TASSEL, use k-nearest neighbor algorithm (Cover and Hart (1967) Proc IEEE Trans Inform Theory13) to predict (impute) missing data, and carry out principal component analysis (PCA) (PCA) and reduce by one group of Relevant phenotype.Open source code for TASSEL software package exists: sourceforge.net/projects/tassel is obtainable.
TASSEL can use together with quantitative inbreeding pedigree disequilibrium test (Quantitative Inbred PedigreeDisequilibrium Test, QIPDT).QIPDT is used to the inspection for the integrating map based on family of the inbred strais from plant breeding program.See Stichet al. (2006) Theor ApplGenet 113:1121-1130; Be combined in this by reference.QIPDT is a kind of QTL detection method of the data for Regular in plant breeding program.QIPDT be a kind of be applicable to parental inbred lines genotype information and their genotype of offspring's inbreeding body and the Testing Association based on family of phenotypic information.QIPDT extends QPDT, a kind of Testing Association based on family.Nuclear family by two parental inbred lines to form and at least one offspring's inbred strais can be incorporated into (basis of QIPDT, if when relating to the parent system of different core family) in the pedigree of extension.About pedigree disequilibrium test, QIPDT also considers the correction of Martin et al. (2001) Am J Hum Genet 68:1065-1067.
The regression model QIPDT2 of improvement can also be used.QIPDT2 have employed the identical method used with QIPDT1 for label coding and Phenotypic modulation, there are two improvement: the 1) matching of regression model and mark and phenotypic data, this estimation allowing hereditary effect and phenotype for discussed mark to contribute; And 2) the method is extended to inbreeding hybrid (having the different testers grown in multiple position), initial method is only applicable to inbreeding body simultaneously.This extension realizes from the genetic value of the inbreeding body of mixture model by extracting, this specification of a model tester effect (tester effect) and non-hereditary effect (such as position).Be the U.S. Patent Application No. 12/328 that on Dec 4th, 2008 submits to describe QIPDT2 in 689.
The other commercially available statistical package being generally used for doing such analysis comprises SASEnterprise Miner (SAS Institute Inc., Cary, and Splus (Cambridge town, Massachusetts, Insightful company) N.C).Those skilled in the art will be appreciated that other programs some and algorithm of there are these steps that may be used for method of the present invention, wherein need quantitative inheritance analysis, and all these programs and algorithm are all within the scope of the present invention.
Nested integrating map
In different embodiments, candidates is identified by following steps or is confirmed: marked compare and employ two kinds of methods in order to further use with the positive correlation using nested integrating map (NAM) to identify by the positive correlation using GWA to identify mark and select demonstrating positively related any mark.This NAM strategy addresses the complex character analysis in basic horizontal as image resource jointly by producing, and these resources enable researcher effectively adopt science of heredity, genomics and systems biology instrument.
Be based upon science of heredity (the Meuwissen etal.2002Genetics 161:373-379 in principle in former genomic mapping strategy and method; Mott and Flint 2002, Genetics 160:1609-1618; Darvasi and Shifman 2005, Nat.Genet.37:118-119), NAM has the following advantages: in use genome sequence or intensive mark, to the lower susceptibility of genetic heterogeneity and higher usefulness together with higher efficiency, still keep the high allele richness due to different founders simultaneously.For the quantitative trait locus (QTL) of effect with different size, NAM establishes the integrated mapping population for having dynamical genome-wide screening of special design.
First step in NAM relates to selects different founders and produces a mapping filial generation that group is relevant greatly.In different embodiments, relevant filial generation is made up of one group of recombinant inbred strain (RIL), and this group recombinant inbred strain is derived from the hybridization between single common parent and a different set of foundation system.Create these RTL by taking turns selfing more.During producing at RIL, reorganizing the genome of two parents of each hybridization together with the combinatory analysis by crossing over all RIL of repeatedly hybridizing, systematically being minimized at the Genetic Background Effects of these parent founders on individual QTL that map.Generally speaking, the strategy of the projection sequence information be nested with in information flag (individuality connected from great majority is to remaining individuality) is the kind (comprising the mankind, mouse, arabidopsis and paddy rice) being applicable to wide region.
Then foundation system is carried out or complete order-checking also or intensive Genotyping, and Genotyping is carried out to define the hereditary capacity of chromosome segment and high density marker information is projected to filial generation from founder to the label of the more smallest number on both founder and filial generation.These filial generations are carried out Phenotype typing for different proterties, and it is relevant with the high density marker of the phenotypic character to projection that make filial generation to implement whole-genome association.See, Yu et al.2008, Genetics 178:539-551.
As integrating map in the same, the mapping precision provided by NAM depends on the linkage disequilibrium between founder's individuality to a great extent.The rapid decay (Wilson et al.2004, Plant Cell 16:2719-2733) of the LD more than 2000bp has been shown with the positive research that the corn candidate gene of the order-checking of crossing over not homology carries out.At arabidopsis (Nordborg et al.2005, kind (the Lindblad-Toh et al.2005 of PLoS Biol.3:e196 and dog (Canis familiaris), Nature 438:803-819)) difference log in, nearest Whole genome analysis is consistent with this pattern: LD crosses over genetic diversity kind matter and decays rapidly.For NAM strategy, this advantage in precision is fully used and does not have the shortcoming of coupling-for the needs-by genomic information is projected to these RIL from founder of good candidate gene or a large amount of mark.
for the model of NAM
In different embodiment of the present invention, for the identification of or confirm that the NAM strategy of candidates have employed the regression model of association for detecting between interested proterties and mark.In statistics, regretional analysis is for modeling and the general designation of technology analyzing numeric data, these numeric datas forming with the value of one or more independent variable (explanatory variable) by dependent variable (response variable).Dependent variable in regression equation is modeled as the function of independent variable, relevant parameter (" constant ") and error term.This error term processes as stochastic variable.Its representative unaccounted variation in dependent variable.These parameters are estimated thus provides " best-fit " of data.Prevailing best-fit is assessed by using least square method, but have also been used other standard.
Least square method can be interpreted as a kind of method of fitting data.Best-fit in least square method meaning takes advantage of residual error to have the situation of the model of the sum of its minimum value for two, and residual error is the difference between observed value and the value provided by this model.If experimental error has normal distribution and can be derivatized out as a kind of method of moments estimation (moments estimator), then least square method corresponds to maximum-likelihood criterion.Regretional analysis is obtainable in most statistical package.
For purposes of the present invention, any suitable homing method can be used to identify the QTL in nested colony.There is described herein exemplary regression model.Provide two kinds of novel regression models (SMR and MMR) further, these models can be used to identify, confirm or the one be associated with a kind of interested proterties that priorization uses for downstream marks.
Singly mark recurrence (SMR):
The list mark providing a kind of novelty for carrying out nested integrating map at this returns (SMR) instrument.The method is similar to the list mark used in standard QTL linkage analysis and returns, and has two crucial amendments.One is multi-source (polygenetic) background information be incorporated into this model from each subpopulation.In the process done like this, the hereditary variation caused can be separated from this model, therefore improve QTL drawing efficrence by different genetic background.Meanwhile, eliminate the comprising of genetic background information the population stratification effect about QTL mapping, minimize false positive discovery rate control.Second improvement surpassing existing method eliminates flag data from departing from (distorted) population.This feature allows this model to avoid the mark detected about QTL to be partially separated the impact of (segregation distortion), and this can produce challenge in integrating map.This model benefits further by the experimental design of NAM, and it is chain and combination that is integrating map.The present invention uses unique linear model to be described in the relation between character value and marker genetype, can be written as:
y ij=μ+x ija+g iu i+e ij
Wherein y ijit is the phenotypic number of the individual j in subpopulation i; μ is overall mean; A is the additive effect of QTL; g iit is the indieating variable of subpopulation i; u iit is the effect of subpopulation i; e ijthat following of being assumed to be has average 0 and variances sigma 2the residual error of normal distribution.According to the present invention, if individual j carries the allele from common parent, then by genotype x ijbe defined as 1, and if individual j carries the allele from other parent, be then defined as-1.This definition be based on: for each mark, only have the allele that two are different.In order to utilize the simplicity of recurrence, Genetic Background Effects u ibe assumed that fixing effect.As used in this, term " fixing effect " preferably refer to season, space, the impact of geographical, environment or management, these impacts cause the systemic effect to phenotype (or to those effects by the effect of the level of experimenter's intentional arrangement or the gene consistent with crossing over evaluated population or mark).Therefore, the present invention is by Genetic Background Effects u icomprise and entering so that the impact from population layering to be described in this model, and therefore reduce residual variance.
In the use of polymorphic marker, this SMR method is different from the homing method based on initial flagging for NAM (Yu et al (2006) Nature Genetics 38 (2): 203-208).According to NAM flag data, can be readily seen that (but not in other subpopulation) some marks demonstrate polymorphism in some subpopulations.In this case, non-information mark is included the inclined separation of the marker genetype that can cause at that locus, and this is separated the reduction that can cause QTL drawing efficrence, usefulness and precision partially.In order to avoid this problem, present invention uses the mark filtration step be attached in SMR model and reduce because mark departs from the potential risk caused.This mark filtration step represents and is only included in during each analyzes from the phenotype of the genotypic subpopulation of the separative mark of those tools and genotype data.Therefore, in different embodiments of the present invention, before SMR analyzes, eliminate and there is the genotypic subpopulation of non-information.This step makes SMR can identify those allele in NAM with low-down frequency (being less than 5%).
Composite interval mapping
When there is multiple chain QTL, QTL is placed on the position of mistake by current list mark and interval method usually, and such as, position between two real QTL produces phantom QTL.Method for the treatment of multiple QTL is Standard modification Interval mapping thus comprises the other mark (being also referred to as " covariant " at this) as co-factor in analysis.Usually, the use of co-factor reduces the deviation and sampling error (Utz and Melchinger of estimating QTL position, Biometrics in Plant Breeding, Proceedings of the Ninth Meeting of the Eucarpia Section Biometrics in PlantBreeding, The Netherlands, 1994).Using suitable non-linked marker that the separation variance produced by not chain QTL can be partly described, the effect of chain QTL can be reduced simultaneously by comprising the chain mark to interested interval.This conventional method mark co-factor being added to other standard interval analysis, is commonly called " composite interval mapping " or CIM, and it causes the substance in the usefulness detecting QTL and in the precision estimating QTL position to increase.
Pass through the multilocus marker information of combination from biosome to comprise the other mark as the co-factor for analyzing by Standard modification Interval mapping, CIM can process multiple QTL.In these methods, a kind of method employ subgroup marker gene seat as covariant to carry out Interval mapping.By chain QTL being described and reducing residual variance, these marks are used as substituent for other QTL to increase the precision of Interval mapping.Such as, Jansen, 1993, Genetics 135, p.205; Zeng, 1994, Genetics 136, describes exemplary CIM p.1457, wherein each is combined in this in full with it by reference.
Other model can be used.Report many amendments to Interval mapping and alternative method, comprised and use nonparametric technique (Kruglyak and Lander, Genetics, 121:1421-1428,1995).Can also use multiple homing method or model, wherein proterties is (Jansen et al., Theor.Appl.Genet., 91:33-37,1995 at the enterprising line retrace of a large amount of marks; Weber and Wricke, Advancesin Plant Breeding, Blackwell, 1994).
multiple labeling returns (MMR)
In order to the impact from other QTL is described, the multiple labeling that there is described herein a kind of novelty returns (MMR) method.This method uses co-factor mark to absorb the effect of other QTL.Linear model for MMR is:
y ij=μ+x ija+∑(k=1,m)c ijkb k+g iu i+e ij
Wherein y ijit is the phenotypic number of the individual j in subpopulation i; μ is overall mean; x ijit is the genotype of QTL; A is the additive effect of QTL; c ijkthe co-factor mark k for individual j in subpopulation i, and b kit is the effect of co-factor mark k; g iit is the indieating variable of subpopulation i; u ithe effect of subpopulation i, and e ijthat following of being assumed to be has average 0 and variances sigma 2the residual error of normal distribution.This MMR model class be similar to composite interval mapping (Zeng 1993,1994, hereafter).
It is the selection of the suitable marker gene seat being used as covariant by the key issue of CIM; Once these are selected, this problem of model selection is changed into one-dimensional scanning by CIM.Before the present invention, the selection of mark co-factor is solved not yet.In the present invention, successive Regression is used to select to mark based on the co-factor of level of significance 0.01.Be used for selecting the linear model of co-factor be:
y ij=μ+x ija+c ijkb k+g iu i+e ij
This Gradual regression analysis model is different from that model (Zeng1993,1994) of the composite interval mapping for routine and initial that model (Yu et al 2008) for NAM.In this MMR model, successive Regression is employed for NAM population the genetic background from different subpopulations is comprised in the model.This system of selection is concentrated in the QTL selecting those to have the stabilizing effect of crossing over multiple subpopulation.Therefore, it effectively reduces the quantity of the co-factor comprised in the model, avoids oversaturated problem.
For co-factor mark, likely obtain the LOD spectrum knowing many compared to SMR from MR.The use of this co-factor mark is to reduce residual error, and therefore increases the conspicuousness of QTL test of hypothesis.Go out closely linked QTL divided out and QTL is positioned at the ability in narrow genome area in the MR models show of this novelty provided.In different embodiments, all genotype datas from all subpopulations are used for data analysis.
They are contemplated that SMR with MR does not have for those mark be partially separated and will provide similar result, although may demonstrate difference in the mark with inclined genotype separation.Therefore, in some embodiments, SMR and MR carries out as the supplementary combination for NAM data set.For trait phenotypes data and the flag data of identical group, SMR and MR can separately carry out.Then, the result obtained from each method can be compared.For those not by for those QTL of the consistent discriminating of both SMR with MMR, mark compartment analysis can be carried out.This analysis can be carried out to determine that inconsistent whether being departed from by mark of those QTL causes.If the trend relevant to this proterties occurs this marker genetype, then departing from of marker genetype can cause lacking real QTL (false negative), or it can cause detecting false QTL.For those by for consistent those QTL identified of SMR with MR, mark compartment analysis can be unnecessary.But under any circumstance conbined usage SMR and MR probably causes the QTL with the usefulness of improvement and the false positive rate of minimizing.Therefore, in this aspect of the invention, consider and identify that positive correlation marks by both SMR and MMR.
Inspection QTL effect
Usually, the object of association study is not certification mark/trait associations simply, but estimates the direct position (i.e. QTL) affecting the gene of this proterties relative to mark position.For in the straightforward procedure of this target, compare between the marker gene seat of the value of the difference between alternative genotype or the level of significance of these differences.Infer that proterties gene position is in the closest place with one or more marks of most relevance genotype difference.In a more complicated analysis, such as Interval mapping (Lander and Botstein, Genetics 121:185-199,1989), each likelihood being positioned at this position with regard to QTL multipoint of being permitted along this genetic map (such as, interval at 1cM) is checked.This genotype/phenotypic data is used to LOD scoring (logarithm of likelihood ratio) calculated for each check position.When this LOD marks more than a threshold limit value, there is the remarkable evidence of this position on genetic map, position (by dropping between two concrete marker gene seats) about QTL.
The hypothesis of inspection QTL effect can be formulated as H 0: a=0 and H 1: a 1≠ 0.By based on depending on whether QTL effect is included the Least Square Method of regression model in the model at H 0or H 1under these parameters.Then likelihood ratio (LR) can be obtained.This likelihood ratio is the ratio of the maximum likelihood of result under two different hypothesis.Likelihood ratio test is a kind of statistical test for making decision between two hypothesis of the value based on this ratio.Be a function of data x, therefore LR is a statistical value.If the value of this statistic is too little, then this null hypothesis is refused in this likelihood ratio test.How the little just too little level of significance depending on this inspection at last, is namely considered to permissible (" I type " mistake is made up of the refusal being real null hypothesis) on how many probability of I type mistake.
The lower value of likelihood ratio represents that viewed result more can not occur under null hypothesis.Higher value represents that viewed result more may occur under null hypothesis.Can from such as LR=-2 (l reduced-l full) regression model obtain LR, wherein l reducedbe the log-likelihood of simplified model, it is corresponding to H 0, and l fullbe the log-likelihood of complete model, it is corresponding to H 1(Lander and Botstein 1989).。
Lod (LOD) scoring is calculated according to LR.LOD scoring whether may be positioned at statistical estimate that is neighbouring each other and therefore possibility genetic linkage on chromosome to two locus.In the case, LOD scoring is the statistical estimate whether chain with the quantitative character of corresponding given gene to the given position in genome under study for action.In one embodiment, this LOD scoring is calculated as LR/ (2ln 10).Assuming that QTL exists with supposition there is not contrast in it, and this LOD scoring shows in fact have how many more possibility to occur data.In order to avoid having the false positive of given degree of confidence (for example 95%), this LOD threshold value depends on the quantity of mark and genomic length.The LOD threshold value of figure instruction is listed in Lander and Botstein, Genetics, in 121:185-199 (1989), and further by Ars and Moreno-Gonzalez, Plant Breeding, Hayward, Bosemark, Romagosa (eds.) Chapman & Hall, London, pp.314-331 (1993) are described.
Usually, the LOD of three or more prompting two locus of marking are genetic linkage, the strong evidence of to be two locus the be genetic linkage of the LOD scoring with 4 or more, and the LOD scoring with 5 or more is two locus is evidences strongly of genetic linkage.But, depend on used model, the conspicuousness that any given LOD marks in fact from kind to kind be change.
For the permutation test of NAM
Initial multiple regression method (Yu et al 2008) for NAM uses a low-down level of significance 10 -7as the threshold value detected for QTL.This method is not also suitable for determining the LOD threshold value in given level of significance, time particularly based on high density linkage map (dense linkage map).In order to address this problem, the invention provides a kind of method of the permutation test being used for determining experience LOD threshold value in given level of significance 0.05 and 0.01 of novelty.Method of replacing of the present invention has been reorganized phenotypic number and has not been destroyed the structure of subpopulation and the correlativity between different interested proterties in each subpopulation.In order to complete this point, randomized phenotypic data and initial flag data carry out SMR and MMR, all marks then in leap genome calculate maximum LOD and mark.Repeat this analysis 1000 times, and record is from the maximum LOD scoring analyzed each time.Finally, to these LOD scoring sort in ascending order.In the experience LOD threshold value that the LOD value of position (1-α) * n is at level of significance α.In some embodiments, due to the permutation test of limited quantity, the threshold value of 0.01 may not be stable.Therefore, recommend to replace for 10000 times in this level of significance.But should be understood that, the displacement of varying number is possible and still obtains the level of significance of hope.Such as, can carry out about 2000, about 3000, about 4000, about 5000, about 6000, about 7000, about 8000, about 9000, or more time displacement.
Express qtl analysis
Comprise in the present invention be the combination of GWA and DGE (digital gene expression) technology for the other method of the candidate gene of downstream application for priorization, with by the further priorization of the parsing of eQTL for the gene implemented or confirm.Sample designs some mark discovery/Genotyping platform to be provided for the express spectra (such as Solexa SNP discovery/Genotyping platform) of enough marks together with the mark of each Genotyping of GWA in such a way.
Therefore, classical qtl analysis is combined with gene expression profile, namely by the micro-array of DNA.These are expressed QTL (e-QTL) and describe cis for the expression of the gene with interested trait associations and trans controlling element.These methods can be determined relation between the mark on linkage map and be used for the expression of one or more marks of the QTL identifying statistically significant.Many kinds of conditions (as the stage of development, environmental exposure, etc.) under, can monitor express and make it to be associated with interested proterties.Described here or for this area any correlating method known to the skilled (such as, but being not limited to, single-point ANOVA, simple regression, Interval mapping, composite interval mapping and NAM) can be used to determine so a kind of relation.
Therefore, eQTL analysis is started with gene expression data (such as from gene expression research or protein science research) with from the genotype data of population under study for action.In one aspect of the invention, by measuring at least one cellular component of the amount of the gene corresponded in one or more cells of biosome, the expression of the gene in the biosome in interested population is determined.As used herein, term " cellular component " comprises the degree of independent gene, protein, the mRNA of a kind of gene of expression and/or other variable cellular component any or protein active, protein modified (such as phosphorylation), such as, it is typically measured in Bioexperiment by those skilled in the art.
The expression of the nucleotide sequence in gene can be measured by any high-throughput techniques.In any case measure, this result or transcript also or the absolute magnitude of response data or relative quantity, including, but not limited to representing the value of abundance or abundance ratio.Can by carrying out the measurement of express spectra with the hybridization of transcript array (such as " transcript array " or " spectrum array ").Transcript array may be used for analyzing express spectra in cell sample and especially for measuring concrete organization type or stage of development or being exposed to the express spectra of cell sample of cell type of certain environmental conditions.
This expression data is converted to expression statistical value, is used for process as each the cellular component abundance in the gene expression data of quantitative character.Then, for by each gene in multiple genes of the organism expressing in population, use genetic marker collection of illustrative plates to carry out quantitative trait locus (QTL) and analyze thus produce QTL data.The expression statistical value one group being represented quantitative character is used in each qtl analysis.
The expression statistical value being typically used as quantitative character in analysis includes but not limited to average log rate, logarithm intensity and background correction intensity.The expression statistical value of other type also can be used as quantitative character.Such as, standardized module can be used transform.In such embodiments, the expression of the multiple genes in each biosome is under study for action carried out standardization.Any standardized program can be used.Representational standardized program to be marked mean absolute deviation, user's normalized gene collection, ratio median intensity correction and intensity background correction including, but not limited to the Z scoring of, intensity, median intensity, logarithm median intensity, the Z standards of grading deviation logarithm of intensity, the Z of logarithm intensity correction DNA gene set.In addition, can the combination of operation standard program.
In in the past 10 years, some technology made the expression of monitoring a large amount of transcript be at any time possible (see, such as, Schena et al., 995, Science 270:467-470; Lockhartet al., 1996, Nature Biotechnology 14:1675-1680; Blanchard et al., 1996, NatureBiotechnology 14,1649; U.S. Patent number 5,569,588).Such as, digital gene can be used to express (DGE) and to measure expression.DGE provides (hypothesisfree), comprehensively and determine quantitative analysis of the not hypothesis of complete transcriptional group.Produce the quantity from the independent mRNA molecule of each gene by counting, the application analyzes the expression of in fact all genes.Before enforcement one experiment, do not need these genes to be identified and characterized.DEG platform such as passes through HelicosBiosciences (Massachusetts, Cambridge) and Illumina, Inc. (California, Santiago) are commercially available.
The ability (SNP) of carrying out full-length genome single nucleotide polymorphism analysis has made the GWA that may carry out for the identification of common character variant study.Full-length genome based on the non-sequence information research of apparent gene group or heredity during cell division lags behind.Partly cause is the heterogeneity of epigenetic regulation element, such as DNA methylation and multiple chromosome modification.Allele indifference gene expression analysis based on standard array can be disclosed in the epigenetics change of independent gene, or can reflect the dynamic change of the gene expression mediated by trans-acting adjusting part (such as transcription factor) simply.The ability of two allelic allele-specifics expression (ASE) of difference gene can be disclosed in the change of epigenetic regulation aspect, because these two allele affect by identical transcription factor, but will be different from cis-acting regulatory element.
Therefore, eQTL analyzes the assessment comprising allele-specific and express.In principle, the prudent section (such as haplotype) of DNA is connected to a number percent of the phenotypic variance in some level of significance by standard QTL or mark correlating method.Usually this phenotype is a quantification measurement of plant performance (such as output).Similarly, eQTL analyzes and gene expression is regarded as a quantification phenotype that can be associated with the prudent section of DNA.This method is used to the specific locations specific expressed spectrum be connected on this genome, but fails cis/trans acting sequences is described or affects the epigenetics of gene expression.
What comprise at this is a kind of method, the quantitative expression of each gene of each individuality of the population of restriction is further divided into the expression value scope based on haplotype by the method.Such as, if Gene A BC has 8 haplotypes, so each haplotype is designated the expression scope of the co expression of each haplotype based on each individuality crossing over this population.Then can express with this sequence monomers type at haplotype and quantize the association analysis that phenotype carries out subsequently therebetween.
Put it briefly, these results of such analysis disclose one of Three models: (1) each haplotype monogenic has the expression scope of its uniqueness, and this may indicate the gene expression of cis acting allele-specific; (2) each haplotype monogenic has identical expression scope, and this may indicate the conservative adjustment of discussed gene; Or (3) monogenic special haplotype has multiple expression scope, this may indicate trans-acting allele-specific to express or epigenetic regulation.
In some cases, such analysis can provide the proved independent of the association of the gene monomer type with interested proterties.Such as, if special haplotype associate with the output of increase and the specifically expressing value of this same haplotype also with increase output associate, then there is the stronger instruction of this haplotype and interested trait associations.
Alternately, or in addition, this analysis can promote on the qualification of the epigenetic of interested proterties or the impact of cis/trans allele-specific with associate.Such as, under normal operation, each unique monomers type monogenic has identical expression scope.Under these circumstances, any association of the special datum of special haplotype and interested proterties (output such as, increased in plant) can be made a variation by owing to the DNA at that locus place.Alternately, each haplotype can have one or more expression scopes of oneself uniqueness.Under these circumstances, independent special haplotype associates or is combined with the output increased can be affected by owing on the epigenetic of plant products or cis/trans allele-specific with the scope of expression.
Such as Lo et al. (2003) Genome Res.13 (8): 1855-62; Pant et al. (2006) Genome Res.16 (3): 331-9; And describe the method for checking ASE in Bjornssonet al. (2008) Genome Research18:771-779, each in them is combined in this in full with it by reference.
Computer implemented method
Mark for assessment of one: these methods above-mentioned of trait associations can completely or partially use computer program or computer implemented method to carry out.These computer programs are appropriately configured to carry out operation described herein.
Computer program of the present invention and computer program comprise computer usable medium, and this medium has a kind of steering logic be kept at wherein and performs these algorithms described herein for causing computing machine.Computer system of the present invention comprises processor (its operation for determining, accept, check and showing data), be connected on described processor for store data internal memory, be connected on described processor for showing the display of data, the input equipment for incoming external data be connected on described processor; And a kind of computer-readable script with at least two operator schemes that can be performed by described processor.Computer-readable script can be the computer program of one embodiment of the invention or the steering logic of computer program.
Be not crucially for the present invention, computer program is write with any certain computer language or at the computer system of any particular type or the enterprising line operate of operating system.Computer program can be written to such as C++, Java, Perl, Python, Ruby, Pascal or Basic program language.Should be understood that, people can create so a kind of program with one of many different program languages.In one aspect of the invention, this program is write to run on the computing machine using (SuSE) Linux OS.In another aspect of the present invention, this program is write to run on the computing machine using MS Windows or MacOS operating system.
Those of ordinary skill in the art should be understood that, according to the present invention, as long as order follows logical flow process, with any order or can side by side perform these codes.
The downstream of label uses
The mark using these methods disclosed here to identify or confirm may be used for based on genomic diagnosis and selection technique; For following the trail of biological filial generation; For determining biological hybridity; For the identification of the variation of chain phenotypic character, mrna expression proterties or phenotype and mrna expression proterties; As genetic marker for building genetic linkage maps; For the identification of the individual filial generation from hybridization, wherein this filial generation has from the Genetic Contributions desired by parent's donor, receptor parent or parent's donor and receptor parent; For separating of encoding gene or genomic dna sequence around noncoding DNA sequence, such as, but be not limited to promoter or regulate sequence; For marker assisted selection, prove based on the clone of collection of illustrative plates, hybrid, in finger-print, Genotyping and allele specific marker; For in genetically modified plants exploitation, and as the mark in interested biology.
From the viewpoint of plant breeder, the initial reason for developer molecule labelling technique is the possibility being increased breeding efficiency by marker-assisted breeding.After by above-mentioned statistical model qualification positive mark, corresponding genetic marker allele can be used for identifying the plant containing desired phenotype in many site, and together with desired phenotype, desired genotype is transferred to its filial generation by being expected.Confirm the molecular labeling allele of the linkage disequilibrium with desired phenotypic character (such as, quantitative trait locus, or QTL) provide useful instrument for selecting desired proterties (i.e. marker-assisted breeding) in plant population.
" marker gene seat " to be used for the locus of existence of tracking second linked gene seat, such as, encode or facilitate the linked gene seat of expression of phenotypic character.Such as, marker gene seat can be used for monitoring allelic separation at locus (as QTL) place, and these allele hereditarily or physically chain on this marker gene seat.Therefore, " marker allele " alternately " allele of marker gene seat " is one of multiple polymorphic nucleotide sequences at the marker gene seat place be found in population, and it is polymorphic for this marker gene seat.In some respects, the invention provides the method for the identification of the marker gene seat relevant to interested phenotypic character with confirmation.It is (the causing physics and/or genetic linkage) that closely physics and heredity are adjacent to genetic elements (such as facilitating the QTL of interested proterties) that the mark of each qualification is expected.
In different embodiments of the present invention, the mark using these methods disclosed here to identify is used to select plant and the individuality for this plant population enrichment with desired proterties.Be tested and appraised show to be divided into desired phenotype from the marker allele of possibility of statistically significant, plant breeder can advantageously use molecular labeling to identify desired individuality.Be tested and appraised and select the marker allele (or from the allele desired by multiple mark) optimized for desired phenotype, plant breeder can select desired phenotype fast by selecting suitable molecular labeling allele.
In the genome of plant showing a preferred phenotypic character, the allelic existence of specific genetic marker and/or not exist be to be determined by above listed method, the amplification of such as RFLP, AFLP, SSR, variable sequence and ASH.If from plant nucleic acid with for the special Probe Hybridization of desired genetic marker, this plant can by selfing to create the real breeding system with homologous genes group or it can be infiltrated in one or more interested system.Term " gene transgression " refers to that the desired allele at genetic loci place is sent to another genetic background from a genetic background.Such as, by the sexual hybridization between two parents of same genus kind, the allelic gene transgression desired by specific gene seat place can be sent at least one filial generation, and wherein at least one parent has desired allele in its genome.Alternately, such as, allelic transmission can be occurred by the restructuring between two donor gene groups, and such as, in the bioplast merged, wherein at least one Donor primordial plastid has desired allele in its genome.Desired allele can be such as, the allele through selecting of mark, QTL, transgenosis, etc.Under any circumstance, comprising desired allelic offspring can be repeatedly backcross with what have desired genetic background, and desired allele is selected, thus to cause this allele to become in selected genetic background be fixing.
The marker gene seat using these methods of the present invention to identify or confirm can also be used to the density genetic collection of illustrative plates creating molecular labeling." genetic map " is: the description of the genetic linkage relation between the locus on the one or more chromosomes (or linkage group) within given kind, usually describes with chart or tabular form." genetic map " is by using genetic marker, determining the method for the linkage relationship of locus for these separation of population marked and the standard genetic principle of recombination frequency." genetic map location " is relative to the position on the genetic map of the surrounding's genetic marker in identical linkage group, wherein can find specific mark within given kind.By contrast, this genomic physical map refer to absolute distance (such as, base Alignment measuring or be separated, and the adjacent genetic fragments of overlap, such as contig).Genomic physical map does not consider the genetic behavior (such as recombination frequency) between the difference on physical map.
In some applications, manufacture or clone large nucleic acid to identify the nucleic acid be connected to further on given mark, or be separated be connected to or be responsible for as this nucleic acid of QTL identified be favourable.Should be understood that, the heredity nucleic acid be connected on polymorphic nucleotide acid sequence is optionally positioned at apart from this polymorphic nucleic acid up to about 50 centimorgans, although depend on that this accurate distance of exchange frequency of specific chromosomal region can change.The typical range of distance polymorphic nucleotide is in the scope of 1-50 centimorgan, such as, is generally and is less than 1 centimorgan, is less than about 1-5 centimorgan, about 1-5,1,5,10,15,20,25,30,35,40,45 or 50 centimorgans, etc.
Manufacture large recombinant RNA and DNA nucleic acid multiple method (comprise recombinant plasmid, restructuring bacteriophage lambda, glutinous grain, yeast artificial chromosome (YAC), P1 artificial chromosome, bacterial artificial chromosome (BAC), etc.) be known.General introduction for YAC, BAC, PAC and the MAC as artificial chromosome is described in Monaco & Larin, in Trends Biotechnol.12:280-286 (1994).For the manufacture of the example of the suitable clone technology of large nucleic acid, and the explanation being enough to instruct those of ordinary skill to complete multiple clone operations also can be found in Berger, Sambrook and Ausubel (being all illustrated above).
In addition, any clone described herein or amplification strategy are useful for the contig overlapping clone, and thus provide overlapping nucleic acid, these overlapping nucleic acids demonstrate physical relation on the molecular level of the nucleic acid of genetic linkage.Find the common example of this strategy in complete biological order-checking in the works, in the works overlapping clone is checked order in these order-checkings thus chromosomal whole sequence is provided.In this step, biological cDNA or the library of genomic DNA is manufactured according to described standard step (such as, in above list of references).By independent clone and separate out and check order, and overlapping sequence information is sorted thus the sequence of this biology is provided.
Once identify the one or more QTL with the expression significant correlation of interested gene, then each of these sites and the mark of connection can also be characterized further to determine the one or more genes relevant to the expression of interested gene (such as, use the cloning process based on collection of illustrative plates, this should be known for those of ordinary skill in the art).Such as, one or more known regulatory gene can be carried out gene mapping whether consistent with the QTL of the mrna expression controlling interested gene to determine the gene location of these genes.Use the standard technique of this area (such as, but be not limited to, genetic transformation, gene complementation or gene Knockout or overexpression) following confirmation can be obtained, namely this consistent regulatory gene is affecting the expression of interested one or more gene.Genetic linkage maps can also be used to carry out separation adjusting gene (comprising any new regulatory gene) by the cloning process based on collection of illustrative plates (mark being positioned at QTL is thus used to move on to interested gene place by using the contig of large insertion genomic clone to walk) be known in the art.Positional cloning is a kind of so method, i.e. (Martin et al., 1993, Science 262:1432-1436 as described in the people such as Martin; Be combined in this by reference) it can be used to be separated one or more regulatory gene.
" gene location clone " use a kind of genetic marker close to the chromosome segment carrying out physical definition clone, this fragment is connected to and uses statistical method described herein and on the QTL that identifies.The clone of the nucleic acid connected serves many purposes, and comprises as genetic marker for identifying the QTL of connection and be used for improving characteristic desired in recombinant plant (wherein the expression of cloned sequence affects the proterties identified in genetically modified plants) in marker-assisted breeding scheme subsequently.The generic connectivity sequence of desirably cloning comprise multiple open reading frame (such as, code nucleic acid or protein, these nucleic acid or protein be observe QTL provide molecular basis).If mark is close to open reading frame, they can be hybridized with given DNA clone, identify the clone that open reading frame is located thereon thus.If the marking path of flank is farther, can identify by the contig building overlapping clone the fragment comprising open reading frame.But, other method be applicable in road as known for one of ordinary skill in the art can also be used.And by genetic transformation and complementary or can obtain following confirmation by the technology that knocks out described below, namely this consistent regulatory gene is affecting the expression of one or more genes interested.
Be responsible for when one or more genes of qualification or when facilitating interested proterties, genetically modified plants can be produced to realize desired proterties.Can will show that the plant of interested proterties is incorporated in department of botany by breeding or by common technique for gene engineering.The Method and Technology of breeding is well known in the art.See such as Welsh J.R., Fundamentals of Plant Genetics and Breeding, John Wiley & Sons, NY (1981); Crop Breeding, Wood D.R. (Ed.) American Society ofAgronomy Madison, Wis. (1983); Mayo O., The Theory of Plant Breeding, Second Edition, Clarendon Press, Oxford (1987); Singh, D.P., Breeding forResistance to Diseases and Insect Pests, Springer-Verlag, NY (1986); And Wricke and Weber, Quantitative Genetics and Selection Plant Breeding, Walter deGruyter and Co., Berlin (1986).Relevant technology includes but not limited to: hybridization, inbreeding, back cross breeding, polyphyly breeding, dihaploid inbreeding, kind blended (variety blend), interspecific hybridization, Aneuploid technique, etc.
In some embodiments, it may be necessary for using the conventional method of plant engineering to carry out genetic modification to obtain interested proterties to plant.In this example, in the nucleotide sequence introduced plant that one or more and interested proterties can be associated.For these one or more nucleotide sequences, these plants can be isozygoty or heterozygosis.The expression (or transcribe and/or translate) of this sequence result in the plant of showing interested proterties.Method for Plant Transformation is known in the art.
Following instance is not provide as restriction as explanation.
Experimental example
the QTL of example 1. in nested colony detects
Use SMR and MMR to combine with method of replacing described below and carry out NAM thus determine the LOD threshold value of NAM.
Singly mark recurrence (SMR):
The linear model of the relation be used between description character value and marker genetype is:
Y ij=μ+x ija+g iu i+ e ij(model 1)
Wherein y ijit is the phenotypic number of the individual j in subpopulation i; μ is overall mean; A is the additive effect of QTL; g iit is the indieating variable of subpopulation i; u iit is the effect of subpopulation i; e ijit is residual error; And x when wherein if individual j carries from common parent allele ijbe defined as 1, and if x during individual j carries from other parent allele ijbe defined as-1.
This definition is based on each mark only being existed to two different allelic facts.In order to utilize the simplicity of recurrence, by Genetic Background Effects u ibeing assumed to is fixing effect.It being comprised is impact in order to illustrate from population layering in the model, and thus reduces residual variance.
The hypothesis of QTL effect can be checked to be formulated as by being used for: H 0: a=0 and H 1: a 1≠ 0.By based on depending on whether QTL effect is included the Least Square Method of regression model in the model at H 0or H 1under these parameters.LR=-2 (l reduced-l full), wherein l reducedbe the log-likelihood of simplified model, it is corresponding to H 0, and l fullbe the log-likelihood of complete model, it is corresponding to H 1(Lander and Botstein 1989).Both be all calculate from SMR model and LOD scoring be calculated as LR/ (2ln 10).Notice that MMR method below uses identical test of hypothesis and method to calculate LOD.
In use polymorphic marker, this SMR method is different from the homing method based on initial markers (Yu et al (2006) Nature Genetics 38 (2): 203-208) for NAM.According to NAM flag data, in some subpopulations, some marks demonstrate polymorphism, but not like this in other subpopulation.In this case, non-information mark is included the inclined separation of the marker genetype that can cause at that locus, and this is separated the reduction that can cause QTL drawing efficrence, usefulness and precision partially.In order to avoid this problem, mark filtration step is attached in SMR model to reduce because mark departs from the potential risk caused.According to the present invention, be only included in during each analyzes from the phenotype of the genotypic subpopulation of the separative mark of those tools and genotype data.Therefore, in different embodiments of the present invention, before SMR analyzes, eliminate and there is the genotypic subpopulation of non-information.This step makes SMR can identify those allele in NAM with low-down frequency (being less than 5%).
Multiple labeling returns (MMR):
In order to the impact from other QTL is described, marks by using multiple co-factor the effect absorbing other QTL and develop a kind of MMR method.Linear model for MMR is:
Y ij=μ+x ija+ ∑ (k=1, m) c ijkb k+ g iu i+ e ij(model 2)
Wherein y ijit is the phenotypic number of the individual j in subpopulation i; Wherein μ is overall mean; Wherein x ijit is the genotype of QTL; Wherein a is the additive effect of QTL; Wherein c ijkfor the co-factor mark k of the individual j in subpopulation i; Wherein b kit is the effect of co-factor mark k; Wherein g iit is the indieating variable of subpopulation i; Wherein u iit is the effect of subpopulation i; And wherein e ijit is residual error.
Another aspect of the present invention uses successive Regression to select to mark based on the co-factor of level of significance 0.01.Be used for selecting the linear model of these co-factors be:
Y ij=μ+c ijkb k+ g iu i+ e ij(model 3)
Wherein y ijit is the phenotypic number of the individual j in subpopulation i; Wherein μ is overall mean; Wherein c ijkthe co-factor mark k for the individual j in subpopulation i; Wherein b kit is the effect of co-factor mark k; Wherein g iit is the indieating variable of subpopulation i; Wherein u iit is the effect of subpopulation i; And wherein e ijit is residual error.This Gradual regression analysis model is different from that model (Zeng 1993,1994) of the composite interval mapping for routine and initial that model (Yu et al 2008) for NAM.One aspect of the present invention has carried out successive Regression for NAM population, wherein the genetic background from different subpopulation is included in model 3.This method choice those there is the QTL of the stabilizing effect of crossing over multiple subpopulation.Stabilizing effect refers to crosses over multiple population those effects viewed.The present invention also effectively reduces the quantity of the co-factor comprised in the model, avoids oversaturated problem.
For co-factor mark, likely obtain the LOD spectrum knowing many compared to SMR from MMR.The use of co-factor mark is to reduce residual error, and therefore increases the conspicuousness of QTL test of hypothesis.MMR demonstrates and to be divided out by closely linked QTL and QTL to be positioned at the ability in narrow genome area.
But MMR has difficulties in usage flag filtration step.This problem is caused by the singularity of the design matrix for this regression model.Therefore, present invention uses all genotype datas from all subpopulations for data analysis, instead of filter non-information mark.Based on this point, SMR with MMR does not have for those mark be partially separated will provide similar result, although they may demonstrate different results in the mark with inclined genotype separation.Therefore, the present invention is designed to carry out the supplementary combination of SMR and MMR as NAM data set.
Permutation test for NAM:
Initial multiple regression method (Yu et al 2008) for NAM employs a low-down level of significance 10 -7as the threshold value detected for QTL.This method is not also suitable for determining the LOD threshold value in given level of significance, time particularly based on high density linkage map.In order to address this problem, the invention provides a kind of method of the permutation test being used for determining experience LOD threshold value in given level of significance 0.05 and 0.01 of novelty.The method has been reorganized phenotypic number and has not been destroyed the structure of subpopulation and the correlativity between different interested proterties in each subpopulation.For SMR and MMR it is recommended that use 1000 displacements.From these displacements, determine the LOD threshold value in 0.05 and 0.01 level.Note because permutation test (recommending to replace for 1000 times) 0.01 threshold value of limited quantity may be unstable.
example 2. selects candidate's guidance to carry out the side confirmed further after full-length genome integrating map method
Along with group learns the appearance of (omics), in genomic thousands of genes, identify that crucial candidate gene (they work in phenotype or complex biological process) becomes one of major obstacle contradictoryly.Really, pay close attention to (lack enough comprehensive data and will remain a limiting factor) in early days with some contrary, this is just in time contrary, and now a large amount of information proposes challenge to scientists.This has been converted into for being used for excavating, integrating and the needs of upgrade kit of priorization bulk information.Candidate's guidance that the present invention will help priorization to be identified by full-length genome integrating map (such as using the sequence from Solexa technology), to carry out confirming and implementing at marker-assisted breeding further.
The nested integrating map colony (Yu et al.Genetics2008,178:539-551) developed by corn functional diversity group is used to carry out the QTL mapping of interested proterties.Because this linkage map has the precision (i.e. the mark density of 1cM) of about 1cM, the QTL identified in this population should be point-device.It is use the allelic information of parent shared to make that this QTL maps, and these parents are used to be formed this population.These sequences for full-length genome integrating map are placed on corn physical map.These marks on NAM linkage map are also placed on this corn physical map.
When these Solexa sequences overlap each other with the QTL from NAM population, the QTL identified in NAM population is compared on this physical map.Compared to not with the equitant sequence of QTL identified in NAM population, these sequences checked order from Solexa are carried out priorization and are used for further confirmation.See Fig. 4.
example 3. uses NAM SMR and MMR to carry out QTL detection
The experimental design of phenotype and genotype data and preparation
These NAM RIL systems are planted crossing over five positions in two years.Cross over these positions and these years interested proterties (mainly comprising the starch in corn alcohol project and albumen) assessed.Phenotypic data from each position is unbalanced.These unbalanced data structures show that it is necessary for obtaining these corresponding gene type data being.In order to do like this, all these marks be downloaded to genotype data (www.panzea.org/lit/data sets.html) and have been extracted genotype information for these assessed NAM systems.Further, in order to perform SMR and MMR, have found consistent linkage map from same website and having carried out downloading to use further.
The method of data analysis
NAM SMR and MMR is used to detect the QTL of starch in responsible corn and protein.The details of these methods is described in example 1.SMR and MMR is used to QTL mapping.SMR has the advantage reducing the impact that mark is separated partially, and QTL can be positioned in the narrow region on this chromosome by MMR.The ability that the combination of SMR and MMR makes QTL detect maximizes, and makes disappearance have the risk minimization of any QTL of less impact simultaneously.
The experience LOD threshold value that multiple method of replacing makes can determine for 0.05 time in the given level of signifiance these two kinds of methods so has been developed for SMR and MMR.In this is analyzed, 1000 displacements are employed for any one in both and has carried out permutation test.
The result of QTL mapping
11 QTL be have found for cornstarch proterties and ten QTL be have found for protein.In these QTL, identify six QTL for starch and five QTL for protein with crossing over all these position consistency.Further, have been found that 6 QTL control starch and protein, this shows the potential multiple-effect effect for two kinds of proterties.Have been found that these six QTL have large impact to individual characters.The qualification of the QTL of these multiple-effect perhaps explains the strong phenotypic correlation between starch in corn and protein.
Conclusion
As expected, based on NAM experimental design, SMR and MMR identifies for the QTL of the main of starch and protein with multiple-effect.It is all powerful instrument for carrying out QTL detection in NAM that these two kinds of methods are proved to be.Method of replacing for one of two kinds of methods provides the LOD threshold value detected for QTL.
example 4. whole-genome association with carry out with nested integrating map colony linkage mapping combine with priorization candidate gene guidance thus carry out the biological example confirming/implement
Introduce
Whole-genome association (GWA) is used to a kind of powerful instrument identifying genetic variant (they affect interested proterties) common in population, provides high mapping precision (up to mononucleotide change).Association study make use of the recombination event on genome accumulated through multiple generation, and genome is dividing in multiple parts in multiple linkage disequilibriums (LD) district group (block) in population by it.Mark in each LD district group usually shows and associates with the remarkable of functional change in the gene in same district group, and therefore can be taken as the substituent of the regards functionality change in plant breeding, or be used for accurately finding responsible gene further as a basis.
The object of GWA detects physically close linkage to the mark in regards functionality change.But detect that the association of mark is common, these marks are not connected to or are connected at a distance these changes upper (being usually considered to false positive).Although other population genetic factors many (such as, migration, sudden change, genetic shift, nonrandom pairing) also may facilitate false positive rate, it is in GWA, possible cause false-positive main misgivings in a large number that population layering or population structure have been accredited as.Gene frequency between the subpopulation in population be systematically different (this can by migration and nonrandom pairing, etc. cause) time, there is population structure.
For an example of GWA
Sample and data
1) for the inbreeding group of GWA: assemble corn inbreeding group with comprise select from the platform of about 3000 corn inbred strais to make maximized 600 inbred strais of genetic diversity.450 systems in this group known obtain from 3 subpopulations, i.e. non-hard stem (NSS), hard stem (SS) and the torrid zone-subtropics (TS) subgroup; In practice due to a variety of causes, remaining 150 is do not have available subgroup consistance (subgroup identity).
2) 500, the genotype data on 000 SNP: for entirely because of group SNP, employs Solexa sequencing technologies and filter out full-length genome cDNA library from the different inbred strais of 600 this inbreeding group, identify about 500,000 high-quality SNP.
3) phenotypic data in 3 ethanol correlated traits: for each of 600 inbred strais in the inbreeding group of 2 positions growth, evaluates the degree of the starch in corn kernel, oil and protein (3 main ethanol correlated traits) with near infrared spectrum (NIR) machine.
Data prediction
Carry out evaluating such as, to remove the suspicious data point of Phenotype typing, outlier to phenotypic data in the mode of experience and statistics.Can also assess to determine for requiring that significant mark-trait associations is the need of data transformations or displacement to the statistical distribution of phenotypic data.As shown in the histogram of proterties (Fig. 5), these 3 proterties are roughly normal distributions, and this shows that the p value estimated by Testing Association according to depending on normal assumption will be substantially effective.
Carry out assessing to identify manifest error to genotype data, such as SNP marker more than 2 allele, and non-information SNP (singlet or there is less gene frequency (< 0.05) those).500, in 000 SNP, 200 SNP are non-information and are excluded outside data.Hardy-Weinberg equilibrium can not be checked, because there is the latent defect of heterozygote in this population for inbreeding population.
Phenotypic data for independent inbred strais adjusts
Use a kind of mixed model approaches to obtain total genetic value of each inbred strais in sample, wherein control the effect from multiple position and random district group.In this model, total hereditary effect of each inbred strais obtains at random, because be considered to the chance sample from whole kind of matter for the inbred strais in this group; These random district groups are also considered to random; These positions obtain as fixed effect, and they are for making the crossbred in any future carry out the target location grown.Statistical model for analyzing can be written to:
Y hijk=μ+G h+L i+B j(i)+e hijk
Wherein μ is overall mean; G hit is the total random genetic effect for the h time inbreeding; L iit is the fixed effect of position i; B j (i)it is the random block effect of the district group j in the i of position; e hijkbe random residual, be assumed to be it is normal distribution.With lme4 storehouse ( www.r-project.org) by this models fitting in statistics program bag R.
When implementing in software TASSEL, obtain for G hbLUP (BLUP) value and in mixed linear model correlating method, it is used as phenotypic data.
The estimation of population structure
Population structure is included in statistical model the false positive that can effectively reduce in association analysis.In mixed linear model, population structure is merged into factor of a model to achieve this end by TASSEL.
As previously mentioned, the subpopulation (SS, NSS and TS) that existence 3 is known in inbreeding group, but the inbreeding body of 25% does not have subpopulation consistance.A method around this point estimates by the population structure of SNP data to this inbreeding group of these inbreeding bodies.The random set with 2000 SNP is selected from all information SNP, and for estimating population structure.
As the U.S. Patent application 12/328 submitted on Dec 4th, 2008, described in 689, principal component analysis (PCA) (PCA) provides similar accuracy in the population structure estimation of the bayes method in STRUCTUR.PCA uses all these SNP data to carry out, and obtains 50 major components (PC) at top and obtain their proper vector.Using stepwise regression analysis to have selected from 50 PC provides proterties specific major component, uses minority top PC simply, demonstrated and provided better control to population structure effect compared in association mixture model.
The estimation of the coefficient of consanguinity
The coefficient of consanguinity is that of relationship (relatedness) between two individualities measures.It represents the probability that the blood lineage of two genes of grab sample is identical from each individuality.Existence flag data estimates the multiple method of the coefficient of consanguinity, and these methods have merits and demerits separately.Select to cross over the shared allelic ratio between all SNP locus, as measuring of the coefficient of consanguinity between a pair inbred strais, it is according to circumstances fixed two probability that random gene is identical substantially.For be likely to calculating this kind of coefficient of consanguinity.
Method with mixed model is used to carry out association analysis
Mixed linear model has been used for association mapping (Yu et al.2006, NatureGenetics) in plant, this has demonstrated in control population structure is superior.By comprehensive perl script (this script provides the complete automatic operation of the data analysis to multiple characters), this method is implemented in ASReml (Gilmour et al. (1995) Biometrics 51:1440-1450) (one for performing the commercial packages of general mixture model).Compare with TASSEL (being implemented the software of method with mixed model by the people such as Yu (2006, Nature Genetics, Vol 38:203-208)), ASReml is faster and user's concern minimizes by perl script.
The mixed linear model implemented in ASReml be identical in TASSEL, it can be written to matrix form, as:
y=Xβ+Sα+Qv+Zu+e var(y)=ZKZ’σ 2 v+Rσσ 2 e
Wherein y is the vector of the phenotypic number of the inbred strais of all uniquenesses; β is the vector of all solids (fixed) experimental effect, and α is the vector of the hereditary effect of the QTL supposed at check position place; V is the vector of subpopulation effect; U is the vector of the multigentic effect of independent inbreeding thing; E is random residual vector.Known X, S, Q and Z are incidence matrix.
In this is analyzed, by the phenotypic data (total hereditary effect) of adjustment as y, 10 PCA proper vectors are used as Q matrix; The vector of X matrix to be overall mean be substantially 1s; S be inspection under for the genotype matrix under the additive inheritance model of each SNP; Z is the incidence matrix of the group of unique inbred strais.
Association results
In association analysis, the conspicuousness of each information SNP is tested and calculate P value (together with phenotype contribution R quadratic sum other statistical value several).Both False discovery rate (FDR) and Bonferroni correction is used to control to increase (inflated) false positive along with repeatedly checking.α/the quantity nominal p value under the level of signifiance (α) being calculated as inspection (SNP) is corrected with Bonferroni; FDR threshold value obtains from estimated p Distribution value.Employ the mean value between two threshold values under same level of significance (α).
For all inspections, have selected α=0.05 as level of significance.This causes 102 SNP relevant to content of starch significantly, and 134 SNP are relevant to protein, and 97 SNP and oil phase close.For starch, protein and oil, have been found that these SNP are 30,35 and 23 linkage disequilibrium district groups accordingly on comfortable genome.
GWA association is covered by the linkage mapping result from NAM.
Statistically significant association does not always show real biology association (may caused by sampling error).Therefore, from separate source evidence for confirmation these associations detected may be useful.
Nested integrating map (NAM) population (mapping population as a kind of newtype) in corn has been made into publicly available (Yu et al.2008, Genetics, and Vol.178:539551).The advantage of such population is that it provides the expectation of the statistics effect higher than linkage mapping and mapping precision, but carries out integrating map compared to with the sample from general population, it provides less false positive.Used NAM population to carry out linkage mapping research (above-mentioned example 3) for starch in the past, and therefrom identified 11 QTL districts for starch, for identification of proteins Chu10Ge district, and 8 districts are identified for oil.
Method association SNP from GWA being covered to the QTL region of the detection from NAM linkage analysis is: the SNP of the association in these QTL districts and mark are put on identical physical map and total genetic map (see Fig. 4).Table 1 shows that the related SNP of for starch 55% is comprised in 8 QTL regions, and for protein, the SNP of association of 31.1% is comprised in 6 QTL, and for oil, the SNP of all associations of 27.8% is comprised in 3 QTL.
The overlapping QTL of table 1. gathers with the SNP's associated
SNP from the gene overlapping with the QTL detected in NAM population is given higher priority and will be used to further biology and confirms.These SNP are also used to downstream application (such as marker-assisted breeding).
The all open source literatures mentioned in this manual and patented claim are tell-tale for the technical merit of the those of ordinary skill in field involved in the present invention.All openly offering is combined in this all by reference with patented claim, its degree as each separately disclose offer or patented claim by definitely and indicate combined by reference individually.
Although in order to the object of clear understanding describe in detail above invention by explanation and example, it is evident that can implement some within the scope of the appended claims changes and change.

Claims (8)

1. identify a method for the genetic marker be associated with the interested proterties in the nested colony of inhuman biology, the method comprises:
A) provide genotype value to each of the multiple genetic markers of each member of described nested colony, wherein said nested colony comprises the member showing described interested proterties;
B) phenotypic number is provided to the described interested proterties of each member of described nested colony;
C) apply nested correlation model and determine whether one or more described genetic marker is associated with proterties interested, described nested correlation model comprises the combination that single mark returns SMR model and multiple labeling regression M MR model, wherein:
I) before use SMR model is assessed associating between character value with marker genetype, non-information genotype is removed; And
Ii) successive Regression is used to select to mark for the co-factor be included in MMR model;
If wherein SMR model or MMR model inspection are to association, then think that genetic marker is associated with interested proterties, and wherein step c) run on the computing machine of suitably programming.
2. the method for claim 1, wherein said SMR model comprises:
y ij=μ+x ija+g iu i+e ij
Wherein y ijit is the phenotypic number of the individual j in subpopulation i;
Wherein μ is overall mean;
Wherein a is the additive effect of QTL;
Wherein g iit is the indieating variable of subpopulation i;
Wherein u iit is the effect of subpopulation i;
Wherein e ijit is residual error; And
X when wherein if individual j carries from tester or original seed (elite) parent allele ijbe defined as 1, if x during individual j carries from inbreeding parent or external (exotic) parent allele ijbe defined as-1.
3. the method for claim 1, wherein said MMR model comprises:
y ij=μ+x ija+Σ(k=1,m)c ijkb k+g iu i+e ij
Wherein y ijit is the phenotypic number of the individual j in subpopulation i;
Wherein μ is overall mean;
Wherein x ijit is the genotype of QTL;
Wherein a is the additive effect of QTL;
Wherein m is co-factor sum;
Wherein c ijkthe co-factor mark k of the individual j in subpopulation i;
Wherein b kit is the effect of co-factor mark k;
Wherein g iit is the indieating variable of subpopulation i;
Wherein u iit is the effect of subpopulation i; And
Wherein e ijit is residual error.
4. the method for claim 1, wherein these co-factors are selected based on the level of significance of definition.
5. method as claimed in claim 4, wherein said level of significance is less than or equal to 0.1.
6. method as claimed in claim 4, wherein these co-factors use a model to select, and this model comprises:
y ij=μ+c ijkb k+g iu i+e ij
Wherein y ijit is the phenotypic number of the individual j in subpopulation i;
Wherein μ is overall mean;
Wherein c ijkthe co-factor mark k of the individual j in subpopulation i;
Wherein b kit is the effect of co-factor mark k;
Wherein g iit is the indieating variable of subpopulation i;
Wherein u iit is the effect of subpopulation i; And
Wherein e ijit is residual error.
7. the method for claim 1, wherein said nested colony is by single common parent system and multiple inbreeding population founded the intermolecular hybrid of each in being and produce.
8. method as claimed in claim 7, wherein said population comprises by described single common parent system and a described multiple population of taking turns or taking turns selfing more and producing of founding the filial generation of the intermolecular hybrid of each being.
CN201080015488.5A 2009-02-06 2010-02-05 Method for selecting statistically validated candidate genes Active CN102369531B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US12/367,045 2009-02-06
US12/367,045 US8170805B2 (en) 2009-02-06 2009-02-06 Method for selecting statistically validated candidate genes
PCT/US2010/023312 WO2010091248A1 (en) 2009-02-06 2010-02-05 Method for selecting statistically validated candidate genes

Publications (2)

Publication Number Publication Date
CN102369531A CN102369531A (en) 2012-03-07
CN102369531B true CN102369531B (en) 2015-03-18

Family

ID=42084599

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201080015488.5A Active CN102369531B (en) 2009-02-06 2010-02-05 Method for selecting statistically validated candidate genes

Country Status (8)

Country Link
US (1) US8170805B2 (en)
EP (1) EP2399214B1 (en)
CN (1) CN102369531B (en)
AU (1) AU2010210552B2 (en)
BR (1) BRPI1011489B1 (en)
CA (1) CA2750225C (en)
ES (1) ES2757827T3 (en)
WO (1) WO2010091248A1 (en)

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8504579B1 (en) * 2010-04-23 2013-08-06 Crimson Corporation Filtering a data set using trees on a computing device
US20110296753A1 (en) * 2010-06-03 2011-12-08 Syngenta Participations Ag Methods and compositions for predicting unobserved phenotypes (pup)
WO2013009969A2 (en) * 2011-07-12 2013-01-17 Carnegie Mellon University Visual representations of structured association mappings
CN102492776B (en) * 2011-12-16 2013-09-04 浙江大学 Building method for genetic linkage map based on four-way cross population
MX2016002747A (en) * 2013-09-03 2016-05-26 Coyne Ip Holdings Llc Methods for genetically diversified stimulus-response based gene association studies.
CN103632067B (en) * 2013-11-07 2016-08-17 浙江大学 A kind of seed amount character site localization method based on mixed linear model
CN105740649B (en) * 2016-01-22 2018-06-19 浙江大学 A kind of multiple characters association analysis method based on mixed linear model
CN106096327B (en) * 2016-06-07 2018-08-17 广州麦仑信息科技有限公司 Gene character recognition methods based on Torch supervised deep learnings
CA3054995A1 (en) 2017-03-30 2018-10-04 Monsanto Technology Llc Systems and methods for use in identifying multiple genome edits and predicting the aggregate effects of the identified genome edits
CN107142317A (en) * 2017-06-02 2017-09-08 中国水稻研究所 A kind of method excavated and verify the Plant Height of Rice allele with cumulative effect
CN107315922B (en) * 2017-08-07 2020-06-23 杭州祥音医学检验实验室有限公司 Method and device for calculating additional contribution of genotype to phenotype
CN108197435B (en) * 2018-01-29 2022-02-25 绥化学院 Marker locus genotype error-containing multi-character multi-interval positioning method
CA3104057A1 (en) * 2018-06-19 2019-12-26 Ancestry.Com Dna, Llc Filtering genetic networks to discover populations of interest
CN109378037B (en) * 2018-10-31 2023-04-14 中国石油大学(华东) Accurate allele inference method based on genetic rule
CN109817281B (en) * 2019-01-23 2022-12-23 湖南农业大学 Method and device for estimating genome variety composition, and electronic device
CN109680078A (en) * 2019-02-22 2019-04-26 华中农业大学 Utilize the method for SNP site selection signal change of gradient Index Assessment economic characters candidate gene
CN109694924A (en) * 2019-03-07 2019-04-30 山东省花生研究所 A kind of method of effective anchoring Quantitative Characters In Peanut candidate region
WO2020197891A1 (en) * 2019-03-28 2020-10-01 Monsanto Technology Llc Methods and systems for use in implementing resources in plant breeding
CN110010203B (en) * 2019-03-29 2022-05-27 广州基迪奥生物科技有限公司 Interactive dynamic QTL analysis system and method based on biological cloud platform
CN110459265B (en) * 2019-08-14 2022-07-05 中国农业科学院作物科学研究所 Method for improving prediction accuracy of whole genome
US11636951B2 (en) 2019-10-02 2023-04-25 Kpn Innovations, Llc. Systems and methods for generating a genotypic causal model of a disease state
CN111199773B (en) * 2020-01-20 2023-03-28 中国农业科学院北京畜牧兽医研究所 Evaluation method for fine positioning character associated genome homozygous fragments
CN111613271B (en) * 2020-04-26 2023-02-14 西南大学 Method for predicting dominant genetic effect of quantitative characters of livestock and poultry and application
CN111883205B (en) * 2020-07-14 2023-10-20 云南省烟草农业科学研究院 Prediction method for selecting harmful ingredients of tobacco based on whole genome and application
CN111863127B (en) * 2020-07-17 2023-06-16 北京林业大学 Method for constructing genetic regulation network of plant transcription factor to target gene
CN113066530B (en) * 2021-03-31 2024-05-10 江苏省农业科学院 Method for merging SNP with linkage disequilibrium in eQTL analysis results in batches
CN114637712B (en) * 2022-03-18 2023-03-10 无锡众星微***技术有限公司 Error processing method and device of SAS2SATA Bridge in EDFB mode
CN114898809B (en) * 2022-04-11 2022-12-23 中国科学院数学与***科学研究院 Analysis method and storage medium for gene-environment interaction suitable for complex traits
CN116564410A (en) * 2023-05-23 2023-08-08 浙江大学 Method, equipment and medium for predicting mutation site cis-regulatory gene

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1434131A (en) * 2003-02-20 2003-08-06 邓红文 BsrBI molecular marker linked with Chinese herd lumber bone density in IL-6 gene

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5569588A (en) * 1995-08-09 1996-10-29 The Regents Of The University Of California Methods for drug screening
AU2003303502A1 (en) * 2002-12-27 2004-07-29 Rosetta Inpharmatics Llc Computer systems and methods for associating genes with traits using cross species data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1434131A (en) * 2003-02-20 2003-08-06 邓红文 BsrBI molecular marker linked with Chinese herd lumber bone density in IL-6 gene

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A unified mixed-model method for association mapping that accounts for multiple levels of relatedness;Jianming YU 等;《NATURE GENETICS》;20060228;第38卷(第2期);第203-208页 *
Genetic design and statistical power of nested association mapping in maize;Jianming Yu 等;《GENETICS》;20080131;第178卷(第1期);第539-551页 *
How to interpret a genome-wide association study;Thomas A.Pearson;《JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION》;20080301;第299卷(第11期);第1335-1344页 *
Precision mapping of quantitative trait loci;Zhao-Bang Zeng;《GENETICS》;19940401;第136卷(第4期);第1457-1468页 *
Sequence Polymorphisms Cause Many False cis eQTLs;Rudi Alberts 等;《PLOS ONE》;20070731;第2卷(第7期);第1-5页 *

Also Published As

Publication number Publication date
WO2010091248A1 (en) 2010-08-12
CA2750225A1 (en) 2010-08-12
BRPI1011489A2 (en) 2016-03-22
ES2757827T3 (en) 2020-04-30
CA2750225C (en) 2017-05-16
EP2399214B1 (en) 2019-09-11
AU2010210552B2 (en) 2016-01-28
AU2010210552A1 (en) 2011-08-11
CN102369531A (en) 2012-03-07
US20100204921A1 (en) 2010-08-12
US8170805B2 (en) 2012-05-01
BRPI1011489B1 (en) 2021-02-17
EP2399214A1 (en) 2011-12-28

Similar Documents

Publication Publication Date Title
CN102369531B (en) Method for selecting statistically validated candidate genes
Sansaloni et al. Diversity analysis of 80,000 wheat accessions reveals consequences and opportunities of selection footprints
Milner et al. Genebank genomics highlights the diversity of a global barley collection
Todesco et al. Massive haplotypes underlie ecotypic differentiation in sunflowers
Gali et al. Genome-wide association mapping for agronomic and seed quality traits of field pea (Pisum sativum L.)
Xu et al. Marker‐assisted selection in plant breeding: From publications to practice
Oladzad et al. Single and multi-trait GWAS identify genetic factors associated with production traits in common bean under abiotic stress environments
Burstin et al. Genetic diversity and trait genomic prediction in a pea diversity panel
Ouborg et al. Population genetics, molecular markers and the study of dispersal in plants
US20100145624A1 (en) Statistical validation of candidate genes
Siol et al. Patterns of genetic structure and linkage disequilibrium in a large collection of pea germplasm
Reeves et al. Retention of agronomically important variation in germplasm core collections: implications for allele mining
Caruana et al. Validation of genotyping by sequencing using transcriptomics for diversity and application of genomic selection in tetraploid potato
Zhang et al. Identification of candidate markers associated with agronomic traits in rice using discriminant analysis
Ibba et al. Genome‐based prediction of multiple wheat quality traits in multiple years
Prada Molecular population genetics and agronomic alleles in seed banks: searching for a needle in a haystack?
Yeoh et al. Estimating population boundaries using regional and local-scale spatial genetic structure: an example in Eucalyptus globulus
Gompert et al. Genomic evidence of genetic variation with pleiotropic effects on caterpillar fitness and plant traits in a model legume
Sodedji et al. Diversity, population structure, and linkage disequilibrium among cowpea accessions
Van Daele et al. Genomic analyses point to a low evolutionary potential of prospective source populations for assisted migration in a forest herb
Machado et al. On the usefulness of mock genomes to define heterotic pools, testers, and hybrid predictions in orphan crops
Pégard et al. Genome-wide genotyping data renew knowledge on genetic diversity of a worldwide alfalfa collection and give insights on genetic control of phenology traits
Korpelainen et al. Assessment of genetic relationships among native and introduced Himalayan balsam (Impatiens glandulifera) plants based on genome profiling
Sodedji et al. DArT-seq based SNP analysis of diversity, population structure and linkage disequilibrium among 274 cowpea (Vigna unguiculata (L.) Walp.) accessions
Ledesma et al. Molecular characterization of doubled haploid lines derived from different cycles of the Iowa Stiff Stalk Synthetic (BSSS) maize population

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant